Maximum and top-k diversified biclique search at scale

Lyu, B; Qin, L; Lin, X; Zhang, Y; Qian, Z; Zhou, J

Maximum and top-k diversified biclique search at scale

Lyu, B Qin, L

Lin, X Zhang, Y

Qian, Z Zhou, J

Permalink

Publisher:: SPRINGER
Publication Type:: Conference Proceeding
Citation:: VLDB Journal, 2022, 31, (6), pp. 1365-1389
Issue Date:: 2022-11-01

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download Published versionAdobe PDF (2.2 MB)

View on publisher's site

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Lyu, B
dc.contributor.author	Qin, L https://orcid.org/0000-0001-6068-5062
dc.contributor.author	Lin, X
dc.contributor.author	Zhang, Y https://orcid.org/0000-0002-2674-1638
dc.contributor.author	Qian, Z
dc.contributor.author	Zhou, J
dc.date.accessioned	2023-01-13T03:32:53Z
dc.date.available	2023-01-13T03:32:53Z
dc.date.issued	2022-11-01
dc.identifier.citation	VLDB Journal, 2022, 31, (6), pp. 1365-1389
dc.identifier.issn	1066-8888
dc.identifier.issn	0949-877X
dc.identifier.uri	http://hdl.handle.net/10453/164955
dc.description.abstract	Maximum biclique search, which finds the biclique with the maximum number of edges in a bipartite graph, is a fundamental problem with a wide spectrum of applications in different domains, such as E-Commerce, social analysis, web services, and bioinformatics. Unfortunately, due to the difficulty of the problem in graph theory, no practical solution has been proposed to solve the issue in large-scale real-world datasets. Existing techniques for maximum clique search on a general graph cannot be applied because the search objective of maximum biclique search is two-dimensional, i.e., we have to consider the size of both parts of the biclique simultaneously. In this paper, we divide the problem into several subproblems each of which is specified using two parameters. These subproblems are derived in a progressive manner, and in each subproblem, we can restrict the search in a very small part of the original bipartite graph. We prove that a logarithmic number of subproblems is enough to guarantee the algorithm correctness. To minimize the computational cost, we show how to reduce significantly the bipartite graph size for each subproblem while preserving the maximum biclique satisfying certain constraints by exploring the properties of one-hop and two-hop neighbors for each vertex. Furthermore, we study the diversified top-k biclique search problem which aims to find k maximal bicliques that cover the most edges in total. The basic idea is to repeatedly find the maximum biclique in the bipartite graph and remove it from the bipartite graph k times. We design an efficient algorithm that considers to share the computation cost among the k results, based on the idea of deriving the same subproblems of different results. We further propose two optimizations to accelerate the computation by pruning the search space with size constraint and refining the candidates in a lazy manner. We use several real datasets from various application domains, one of which contains over 300 million vertices and 1.3 billion edges, to demonstrate the high efficiency and scalability of our proposed solution. It is reported that 50% improvement on recall can be achieved after applying our method in Alibaba Group to identify the fraudulent transactions in their e-commerce networks. This further demonstrates the usefulness of our techniques in practice.
dc.language	en
dc.publisher	SPRINGER
dc.relation	http://purl.org/au-research/grants/arc/DP160101513
dc.relation	http://purl.org/au-research/grants/arc/DP210101393
dc.relation	http://purl.org/au-research/grants/arc/FT200100787
dc.relation.ispartof	VLDB Journal
dc.relation.isbasedon	10.1007/s00778-021-00681-6
dc.rights	info:eu-repo/semantics/openAccess
dc.subject	0804 Data Format, 0805 Distributed Computing, 0806 Information Systems
dc.subject.classification	Information Systems
dc.title	Maximum and top-k diversified biclique search at scale
dc.type	Conference Proceeding
utslib.citation.volume	31
utslib.for	0804 Data Format
utslib.for	0805 Distributed Computing
utslib.for	0806 Information Systems
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
utslib.copyright.status	open_access	*
dc.date.updated	2023-01-13T03:32:52Z
pubs.issue	6
pubs.publication-status	Published
pubs.volume	31
utslib.citation.issue	6

Abstract:

Maximum biclique search, which finds the biclique with the maximum number of edges in a bipartite graph, is a fundamental problem with a wide spectrum of applications in different domains, such as E-Commerce, social analysis, web services, and bioinformatics. Unfortunately, due to the difficulty of the problem in graph theory, no practical solution has been proposed to solve the issue in large-scale real-world datasets. Existing techniques for maximum clique search on a general graph cannot be applied because the search objective of maximum biclique search is two-dimensional, i.e., we have to consider the size of both parts of the biclique simultaneously. In this paper, we divide the problem into several subproblems each of which is specified using two parameters. These subproblems are derived in a progressive manner, and in each subproblem, we can restrict the search in a very small part of the original bipartite graph. We prove that a logarithmic number of subproblems is enough to guarantee the algorithm correctness. To minimize the computational cost, we show how to reduce significantly the bipartite graph size for each subproblem while preserving the maximum biclique satisfying certain constraints by exploring the properties of one-hop and two-hop neighbors for each vertex. Furthermore, we study the diversified top-k biclique search problem which aims to find k maximal bicliques that cover the most edges in total. The basic idea is to repeatedly find the maximum biclique in the bipartite graph and remove it from the bipartite graph k times. We design an efficient algorithm that considers to share the computation cost among the k results, based on the idea of deriving the same subproblems of different results. We further propose two optimizations to accelerate the computation by pruning the search space with size constraint and refining the candidates in a lazy manner. We use several real datasets from various application domains, one of which contains over 300 million vertices and 1.3 billion edges, to demonstrate the high efficiency and scalability of our proposed solution. It is reported that 50% improvement on recall can be achieved after applying our method in Alibaba Group to identify the fraudulent transactions in their e-commerce networks. This further demonstrates the usefulness of our techniques in practice.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/164955