Exploration on efficient similar sentences extraction

Gu, Y; Yang, Z; Xu, G; Nakano, M; Toyoda, M; Kitsuregawa, M

Exploration on efficient similar sentences extraction

Gu, Y Yang, Z Xu, G

Nakano, M Toyoda, M Kitsuregawa, M

Permalink

Publication Type:: Journal Article
Citation:: World Wide Web, 2014, 17 (4), pp. 595 - 626
Issue Date:: 2014-01-01

Closed Access

	Filename	Description	Size
	WWWJ14-Gu.pdf	Published Version	2.93 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Gu, Y	en_US
dc.contributor.author	Yang, Z	en_US
dc.contributor.author	Xu, G https://orcid.org/0000-0003-4493-6663	en_US
dc.contributor.author	Nakano, M	en_US
dc.contributor.author	Toyoda, M	en_US
dc.contributor.author	Kitsuregawa, M	en_US
dc.date.issued	2014-01-01	en_US
dc.identifier.citation	World Wide Web, 2014, 17 (4), pp. 595 - 626	en_US
dc.identifier.issn	1386-145X	en_US
dc.identifier.uri	http://hdl.handle.net/10453/30302
dc.description.abstract	Measuring the semantic similarity between sentences is an essential issue for many applications, such as text summarization, Web page retrieval, question-answer model, image extraction, and so forth. A few studies have explored on this issue by several techniques, e.g., knowledge-based strategies, corpus-based strategies, hybrid strategies, etc. Most of these studies focus on how to improve the effectiveness of the problem. In this paper, we address the efficiency issue, i.e., for a given sentence collection, how to efficiently discover the top-k semantic similar sentences to a query. The previous methods cannot handle the big data efficiently, i.e., applying such strategies directly is time consuming because every candidate sentence needs to be tested. In this paper, we propose efficient strategies to tackle such problem based on a general framework. The basic idea is that for each similarity, we build a corresponding index in the preprocessing. Traversing these indices in the querying process can avoid to test many candidates, so as to improve the efficiency. Moreover, an optimal aggregation algorithm is introduced to assemble these similarities. Our framework is general enough that many similarity metrics can be incorporated, as will be discussed in the paper. We conduct extensive experimental evaluation on three real datasets to evaluate the efficiency of our proposal. In addition, we illustrate the trade-off between the effectiveness and efficiency. The experimental results demonstrate that the performance of our proposal outperforms the state-of-the-art techniques on efficiency while keeping the same high precision as them. © 2013 Springer Science+Business Media New York.	en_US
dc.relation.ispartof	World Wide Web	en_US
dc.relation.isbasedon	10.1007/s11280-012-0195-z	en_US
dc.subject.classification	Information Systems	en_US
dc.title	Exploration on efficient similar sentences extraction	en_US
dc.type	Journal Article
utslib.citation.volume	4	en_US
utslib.citation.volume	17	en_US
utslib.for	0806 Information Systems	en_US
utslib.for	0805 Distributed Computing	en_US
utslib.for	0804 Data Format	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Computer Science
pubs.organisational-group	/University of Technology Sydney/Strength - AAI - Advanced Analytics Institute Research Centre
utslib.copyright.status	closed_access
pubs.issue	4	en_US
pubs.publication-status	Published	en_US
pubs.volume	17	en_US

Abstract:

Measuring the semantic similarity between sentences is an essential issue for many applications, such as text summarization, Web page retrieval, question-answer model, image extraction, and so forth. A few studies have explored on this issue by several techniques, e.g., knowledge-based strategies, corpus-based strategies, hybrid strategies, etc. Most of these studies focus on how to improve the effectiveness of the problem. In this paper, we address the efficiency issue, i.e., for a given sentence collection, how to efficiently discover the top-k semantic similar sentences to a query. The previous methods cannot handle the big data efficiently, i.e., applying such strategies directly is time consuming because every candidate sentence needs to be tested. In this paper, we propose efficient strategies to tackle such problem based on a general framework. The basic idea is that for each similarity, we build a corresponding index in the preprocessing. Traversing these indices in the querying process can avoid to test many candidates, so as to improve the efficiency. Moreover, an optimal aggregation algorithm is introduced to assemble these similarities. Our framework is general enough that many similarity metrics can be incorporated, as will be discussed in the paper. We conduct extensive experimental evaluation on three real datasets to evaluate the efficiency of our proposal. In addition, we illustrate the trade-off between the effectiveness and efficiency. The experimental results demonstrate that the performance of our proposal outperforms the state-of-the-art techniques on efficiency while keeping the same high precision as them. © 2013 Springer Science+Business Media New York.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/30302