Discrete Representation Learning for Network Data

Publication Type:
Issue Date:
Full metadata record
Network and graph data have been popularly used in describing a large body of real-world applications arranging from social networks, biological graphs, citation networks to transaction data. In order to obtain knowledge from network data, various machine learning models on networks have been proposed such as graph embedding models and graph neural networks. From the perspective of graph embedding models, 𝘢𝘵𝘵𝘳𝘪𝘣𝘶𝘵𝘦𝘥 𝘯𝘦𝘵𝘸𝘰𝘳𝘬 𝘦𝘮𝘣𝘦𝘥𝘥𝘪𝘯𝘨 represents a new branch of models which aims to obtain knowledge from attributed networks where both attributes of nodes and links between nodes are observable for learning. Existing attributed network embedding models enable joint representation learning of node links and attributes. These models, however, are all designed in continuous Euclidean spaces which often introduce data redundancy and thus impose challenges to storage and computation costs on large-scale networks. This thesis focuses on a new problem of 𝘥𝘪𝘴𝘤𝘳𝘦𝘵𝘦 𝘳𝘦𝘱𝘳𝘦𝘴𝘦𝘯𝘵𝘢𝘵𝘪𝘰𝘯 𝘭𝘦𝘢𝘳𝘯𝘪𝘯𝘨 𝘧𝘰𝘳 𝘯𝘦𝘵𝘸𝘰𝘳𝘬 𝘥𝘢𝘵𝘢 to obtain compact network representation for efficient network analysis. In particular, we study three challenging problems of discrete representation learning for attributed networks. First, how to learn binary representation vectors for network nodes. Binary code learning can generate succinct representations. However, there is no work on learning binary representation for attributed networks that can preserve both network structure (e.g., edge weights) and node attributes. Second, how to learn more compact discrete representation vectors than the binary vectors. Under the observation that binary representation may suffer from uncontrollable accuracy loss during prediction, it is natural to raise the question on how to design a sparse network representation model that can balance representation accuracy and representation size. Third, the binary network embedding is based on the assumption that network structures are readily available for analysis. However, in real-world scenarios such as social networks, sometimes it is impossible to collect explicit network structures due to the dynamics of networks, and the structures usually need to be inferred from implicit data such as information cascades in the networks. Thus, how to build an end-to-end discrete network embedding model that can learn binary representations from underlying information cascades is the third challenge. In lights of the above challenges, we propose a new class of attributed network embedding models that can learn discrete node representations to reduce data redundancy and eventually reduce storage and computation costs on attributed network data. To solve the first challenge, we present a 𝘉𝘪𝘯𝘢𝘳𝘪𝘻𝘦𝘥 𝘈𝘵𝘵𝘳𝘪𝘣𝘶𝘵𝘦𝘥 𝘕𝘦𝘵𝘸𝘰𝘳𝘬 𝘌𝘮𝘣𝘦𝘥𝘥𝘪𝘯𝘨 𝘮𝘰𝘥𝘦𝘭 (𝘉𝘈𝘕𝘌 for short) to learn binary node representation. Specifically, we define a new 𝘞𝘦𝘪𝘴𝘧𝘦𝘪𝘭𝘦𝘳-𝘓𝘦𝘩𝘮𝘢𝘯 𝘱𝘳𝘰𝘹𝘪𝘮𝘪𝘵𝘺 𝘮𝘢𝘵𝘳𝘪𝘹 to capture data dependence between node links and attributes by aggregating the information of node attributes and links from neighboring nodes to a given target node in a layer-wise manner. Based on the Weisfeiler-Lehman proximity matrix, we formulate a new 𝘞𝘦𝘪𝘴𝘧𝘪𝘭𝘦𝘳-𝘓𝘦𝘩𝘮𝘢𝘯 𝘮𝘢𝘵𝘳𝘪𝘹 𝘧𝘢𝘤𝘵𝘰𝘳𝘪𝘻𝘢𝘵𝘪𝘰𝘯 learning function under the binary node representation constraint. The learning problem is a 𝘮𝘪𝘹𝘦𝘥 𝘪𝘯𝘵𝘦𝘨𝘦𝘳 𝘰𝘱𝘵𝘪𝘮𝘪𝘻𝘢𝘵𝘪𝘰𝘯 and an efficient 𝘤𝘺𝘤𝘭𝘪𝘤 𝘤𝘰𝘰𝘳𝘥𝘪𝘯𝘢𝘵𝘦 𝘥𝘦𝘴𝘤𝘦𝘯𝘵 (CCD) algorithm is used as the solution. To solve the second challenge, we propose a new Low-Bit Quantization for Attributed Network Representation Learning model (LQANR for short) that can learn compact node representations with low bit-width values while preserving high representation accuracy. Specifically, we formulate a new representation learning function based on matrix factorization that can jointly learn the low-bit node representations and the layer aggregation weights under the low-bit quantization constraint. Because the new learning function falls into the category of mixed integer optimization, we propose an efficient mixed-integer based alternating direction method of multipliers (ADMM) algorithm as the solution. To solve the third challenge, we present an end-to-end discrete network embedding model for latent networks (𝘋𝘌𝘓𝘕) that can learn binary representations from underlying information cascades. The essential idea is to infer a 𝘭𝘢𝘵𝘦𝘯𝘵 𝘞𝘦𝘪𝘴𝘧𝘦𝘪𝘭𝘦𝘳-𝘓𝘦𝘩𝘮𝘢𝘯 𝘱𝘳𝘰𝘹𝘪𝘮𝘪𝘵𝘺 𝘮𝘢𝘵𝘳𝘪𝘹 that captures node dependence based on information cascades and then to factorize the latent Weisfiler-Lehman matrix under the binary node representation constraint. Since the learning problem is a mixed integer optimization problem, an efficient 𝘮𝘢𝘹𝘪𝘮𝘢𝘭 𝘭𝘪𝘬𝘦𝘭𝘪𝘩𝘰𝘰𝘥 𝘦𝘴𝘵𝘪𝘮𝘢𝘵𝘪𝘰𝘯 𝘣𝘢𝘴𝘦𝘥 𝘤𝘺𝘤𝘭𝘪𝘤 𝘤𝘰𝘰𝘳𝘥𝘪𝘯𝘢𝘵𝘦 𝘥𝘦𝘴𝘤𝘦𝘯𝘵 (𝘔𝘓𝘌-𝘊𝘊𝘋) algorithm is used as the solution. Experiments on real-world datasets show that the proposed models outperforms the state-of-the-art network embedding methods.
Please use this identifier to cite or link to this item: