Data mining of classification for sybil user detection

Publication Type:
Thesis
Issue Date:
2016
Full metadata record
Data analytics and Big Data application research, along with new structures in complex data, are reveling the secrets of their own complexity and patterns with valuable and critical, but challenging, issues through newly designed tools, techniques and models in data science technology. A common example concerns the interconnectivity of social network users on mobiles, involving content and information sharing through mobile social networks. There have been a large number of studies on mobile networks. Many focus on a variety of secured applications that attempt to exploit social connections, impersonate users or attack social groups. Such applications are often created with the intention of collecting confidential information, laundering money, blackmail or to perform other criminal activities. Existing methods for identifying such activity, such as distributed systems, social graph-based sybil detection, behaviour classification, and local ranking systems that estimate the trust level between users, rely on the dependencies between random nodes of connection on mobile social networks. These models aim to detect suspicious connections and have the advantage of learning the relationships between nodes and data. However, their detection patterns tend to impose the behavioural patterns typically associated with community-based and external networks. In data mining, the graph-based and classification models used for pattern collection can accurately predict patterns in data in targeted categories. Decision trees, commonly used for classification, are trees in which each branch represents a choice between a number of alternatives, and each leaf represents a classification, or decision. For example, a decision tree may help an institution decide whether a node in a dataset is suspicious, or considered to be sybil, if a decision tree can be induced from a set of data about its instances and the - classifications of those instances. It could also provide the flexibility to demonstrate data distribution. Thus, researchers have tried to combine different techniques and methods into network-based models to detect various patterns generated by sybil nodes within a network. The purpose of this thesis is to abridge existing classification and regression techniques to identify sybil nodes, and the correlation of those nodes with time, to address these research limitations. Classification and regression techniques predict behaviour based on continuous or categorical responses. For example if the predicted response is continuous, then it is called a regression tree. If the response is categorical, it is called a classification tree. At each node of the tree, the value of one the connected input nodes is checked and a binary answer – yes or no – determines whether one continues to the left or right sub-branch. When a leaf is reached, a prediction follows from a series of entropy calculations and graphing techniques. This thesis introduces a novel classification model for sybil detection in mobile social behaviour that identifies dependencies using connection duration and other attributes. Roger Quinlan’s C4.5 algorithm, its resulting decision tree and a random forest simplify the step-by-step identification process, while maintaining its merits. Partial correlations between nodes are simplified using Rattle programming, and the dataset is divided into majority nodes to assist processing. This research also includes a behavioural survey of the nodes and an extended analysis using a classification system for sybil detection, with a particular focus on sybil attacks in mobile social network environments. Each sybil node is tracked and identified based on the frequency and duration of its connections with other nodes. An outline of how the classified model identifies behaviour is also included, along with an explanation of the flow of the decision tree and the C4.5 algorithm process, which press-gangs identified sybil nodes based on the results of entropy calculations and information gain. The calculated entropy for each node connection across the all datasets informs the information gain. The maxGain calculations for individual node bring the final stage of draw decision tree and helped to predict the sybil nodes, compare and justify the sybil attackers. These processes and new models applied to sybil detection provide insight into the behaviour of connections, through deep analytics and entropy gain. The evidence gleaned from this research brings significant knowledge to data analytics and data science in the identification of threats on mobile social networks.
Please use this identifier to cite or link to this item: