Structural community detection in big graphs

Publication Type:
Issue Date:
Full metadata record
Community detection in graphs is a fundamental problem widely experienced across industries. Given a graph structure, one popular method to identify communities is classifying the vertices, which is formally named graph clustering. Additionally, community structures are always dense and highly connected in graphs. There are also a large number of research works focusing on mining cohesive subgraphs for community detection. Even though the community detection problem is extensively studied, challenges still exist. With the development of social media, graphs are highly dynamic, and the size of graphs is sharply increasing. The large time and space cost of traditional solutions may hardly be endured in big and dynamic graphs. In this thesis, we propose an index-based algorithm for the structural graph clustering (SCAN). Based on the proposed index structure, the time expended to compute structural clustering depends only on the result size, not on the size of the original graph. The space complexity of the index is bounded by 𝑂(𝑚), where m is the number of edges in the graph. We also propose algorithms and several optimization techniques for maintaining our index in dynamic graphs. For the cohesive subgraph detection, we study both degree-constrained (𝑘-core) and connectivity-constrained (𝑘-VCC) cohesive subgraph metrics. A 𝑘-core is a maximal connected subgraph in which each vertex has degree at least 𝑘. We study I/O efficient core decomposition following a semi-external model, which only allows vertex information to be loaded in memory. We propose an optimized I/O efficient algorithm for both core decomposition and core maintenance. In addition, we extend our algorithm to compute the graph degeneracy order, which is an important graph problem that is highly related to core decomposition. A 𝑘-vertex connected component (𝑘-VCC) is a connected subgraph in which the removal of any 𝑘 − 1 vertices will not disconnect the subgraph. A 𝑘-VCC has many outstanding structural properties, such as high cohesiveness, high robustness, and subgraph overlapping. The state-of-the-art solution enumerates all 𝑘-VCCs following a partition-based framework. It requires high computational cost in connectivity tests. We prove the upper bound of the number of partitions, which implies the polynomial running time of this framework. We propose two effective optimization strategies, namely neighbor sweep and group sweep, to largely reduce the number of local connectivity tests. We conducted extensive performance studies using several large real-world datasets to show the efficiency and effectiveness of all our approaches.
Please use this identifier to cite or link to this item: