Knowledge Entity Extraction and Text Mining in the Era of Big Data

Publisher:
Elsevier
Publication Type:
Journal Article
Citation:
Data and Information Management, 2021, 5, (3), pp. 309-311
Issue Date:
2021-07-02
Full metadata record
In the era of big data, tremendous amounts of information and data have drastically changed human civilization. The rapid growth in the number of documents generated everyday means that a large amount of knowledge is proposed, improved, and used. For readers, especially newcomers to a given field, excavating suitable knowledge entities from massive documents is time-consuming and labor-consuming, negatively impacting research efficiency. The broad availability of information provides more opportunities for people, but a new challenge has risen as well; that is, how to extract and use knowledge from numerous information resources, especially how to conduct knowledge extraction and text mining (TM) from massive documents in special domains. A knowledge entity is a relatively independent and integral knowledge module in a special discipline or research domain (Chang & Zheng, 2007). In scientific documents, knowledge entities refer to the knowledge mentioned or cited by authors, such as algorithms, models, theories, datasets, and software (Wang & Zhang, 2018), and reflect various resources used by the authors in problem-solving (Zhang, Mayr, Lu, & Zhang, 2020; Hou, Jochim, Gleize, Bonin, & Ganguly, 2019; Brack, D'Souza, Hoppe, Auer, & Ewerth, 2020). Extracting knowledge entities from numerous information resources is useful for multiple downstream tasks in information extraction, TM, natural language processing (NLP), information retrieval, digital library research, and so on. Particularly, in the field of artificial intelligence (AI), information science, and some other related disciplines, discovering methods from a large scale of academic literature, and evaluating the performance and influence of such methods, have become increasingly necessary and meaningful (Hou et al., 2020). In 2019, the “Heart of Machine” launched the project “SOTA (state of the art) model”1 . Targeting more than 100 tasks in machine learning research, the project obtained models, open datasets, evaluation indicators, and results from academic literature through manual annotation and named entity recognition, and provided open retrieval services for users. Defense Advanced Research Projects Agency (DARPA) has recently launched the Automating Scientific Knowledge Extraction (ASKE) project2 to develop next-generation applications of artificial intelligence. In parallel, deep learning techniques introduce new progresses to NLP and TM. Many kinds of neural network models, e.g., convolutional neural network (CNN), recurrent neural network (RNN), graph neural networks (GNN), and attention mechanism, have been widely involved in these tasks (Qiu et. al, 2020), particularly, text classification (Zhang, Zhao, & LeCun, 2015; Lai, Xu, Liu, & Zhao, 2015; Yao, Mao, & Luo, 2019; Liu & Guo, 2019) and clustering (Xu et al., 2015; Xu et al., 2017). There are some conferences and workshops in line with this topic, such as the Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL) (Cabanac et al., 2016), the Workshop on Mining Scientific Publications (WOSP), the Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents (EEKE) (Zhang et al., 2020), the Workshop on AI + Informetrics (AII) (Zhang, Zhang, Mayr, & Suominen, 2021), and the Workshop on Scholarly Document Processing (SDP) (Chandrasekaran et al., 2020). We are very grateful that three contributions were invited to the special issue of Data and Information Management (DIM). These three submissions were accepted after several rounds of peer-reviewing and revisions. The paper “Discovering Booming Bio-entities and Their Relationship with Funds” (Tan, Zhang, Yang, Wu, & Xu, 2021) tracked the overall trends and changes in biomedical topics from 1988 to 2017. It collected funding information in the PubMed database and the website of the United States National Institutes of Health (USNIH), and extracted funding-related entities and research hotspots in the corresponding fields. This study provides new insights for research funding allocation, and may support the science policy and strategic management of stakeholders. The paper “A Pattern and POS Auto-Learning Method for Terminology Extraction from Scientific Text” (Shao, Hua, & Song, 2021) proposed an unsupervised method based on sentence patterns and part of speech (POS) sequences extracted from scientific texts. The proposed method only requires a few initial learnable patterns to obtain initial terminological tokens and their POS sequences. Experiments on abstracts of articles in the Web of Science (WoS) database demonstrate its recognized performance. The paper “Automatic Subject Classification of Public Messages in E-government Affairs” (Pan & Chen, 2021) touched upon the task of automatic classification using bi-directional long short-term memory (Bi-LSTM) network model based on attention mechanism. This paper used the Bi-LSTM algorithm to strengthen the relevance of messages before and after the training process, and introduced semantic attention to highlight the weight of important text features.
Please use this identifier to cite or link to this item: