Recurrent neural networks with specialized word embeddings for health-domain named-entity recognition

Jauregi Unanue, I; Zare Borzeshi, E; Piccardi, M

Recurrent neural networks with specialized word embeddings for health-domain named-entity recognition

Jauregi Unanue, I

Zare Borzeshi, E Piccardi, M

Permalink

Publication Type:: Journal Article
Citation:: Journal of Biomedical Informatics, 2017, 76 pp. 102 - 109
Issue Date:: 2017-12-01

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download full textAdobe PDF (515.88 kB)

View on publisher's site

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Jauregi Unanue, I https://orcid.org/0000-0001-6223-9584	en_US
dc.contributor.author	Zare Borzeshi, E	en_US
dc.contributor.author	Piccardi, M https://orcid.org/0000-0001-9250-6604	en_US
dc.date.available	2020-05-25T19:11:15Z
dc.date.issued	2017-12-01	en_US
dc.identifier.citation	Journal of Biomedical Informatics, 2017, 76 pp. 102 - 109	en_US
dc.identifier.issn	1532-0464	en_US
dc.identifier.uri	http://hdl.handle.net/10453/120566
dc.description.abstract	© 2017 Elsevier Inc. Background Previous state-of-the-art systems on Drug Name Recognition (DNR) and Clinical Concept Extraction (CCE) have focused on a combination of text “feature engineering” and conventional machine learning algorithms such as conditional random fields and support vector machines. However, developing good features is inherently heavily time-consuming. Conversely, more modern machine learning approaches such as recurrent neural networks (RNNs) have proved capable of automatically learning effective features from either random assignments or automated word “embeddings”. Objectives (i) To create a highly accurate DNR and CCE system that avoids conventional, time-consuming feature engineering. (ii) To create richer, more specialized word embeddings by using health domain datasets such as MIMIC-III. (iii) To evaluate our systems over three contemporary datasets. Methods Two deep learning methods, namely the Bidirectional LSTM and the Bidirectional LSTM-CRF, are evaluated. A CRF model is set as the baseline to compare the deep learning systems to a traditional machine learning approach. The same features are used for all the models. Results We have obtained the best results with the Bidirectional LSTM-CRF model, which has outperformed all previously proposed systems. The specialized embeddings have helped to cover unusual words in DrugBank and MedLine, but not in the i2b2/VA dataset. Conclusions We present a state-of-the-art system for DNR and CCE. Automated word embeddings has allowed us to avoid costly feature engineering and achieve higher accuracy. Nevertheless, the embeddings need to be retrained over datasets that are adequate for the domain, in order to adequately cover the domain-specific vocabulary.	en_US
dc.relation.ispartof	Journal of Biomedical Informatics	en_US
dc.relation.isbasedon	10.1016/j.jbi.2017.11.007	en_US
dc.rights	info:eu-repo/semantics/openAccess
dc.subject.classification	Biomedical Engineering	en_US
dc.subject.classification	Medical Informatics	en_US
dc.subject.mesh	Humans	en_US
dc.subject.mesh	Algorithms	en_US
dc.subject.mesh	Neural Networks (Computer)	en_US
dc.subject.mesh	Databases, Factual	en_US
dc.subject.mesh	Machine Learning	en_US
dc.subject.mesh	Neural Networks, Computer	en_US
dc.title	Recurrent neural networks with specialized word embeddings for health-domain named-entity recognition	en_US
dc.type	Journal Article
utslib.citation.volume	76	en_US
utslib.for	0801 Artificial Intelligence and Image Processing	en_US
utslib.for	06 Biological Sciences	en_US
utslib.for	08 Information and Computing Sciences	en_US
utslib.for	11 Medical and Health Sciences	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Electrical and Data Engineering
pubs.organisational-group	/University of Technology Sydney/Strength - GBDTC - Global Big Data Technologies
pubs.organisational-group	/University of Technology Sydney/Students
utslib.copyright.status	open_access	*
pubs.publication-status	Published	en_US
pubs.volume	76	en_US

Abstract:

© 2017 Elsevier Inc. Background Previous state-of-the-art systems on Drug Name Recognition (DNR) and Clinical Concept Extraction (CCE) have focused on a combination of text “feature engineering” and conventional machine learning algorithms such as conditional random fields and support vector machines. However, developing good features is inherently heavily time-consuming. Conversely, more modern machine learning approaches such as recurrent neural networks (RNNs) have proved capable of automatically learning effective features from either random assignments or automated word “embeddings”. Objectives (i) To create a highly accurate DNR and CCE system that avoids conventional, time-consuming feature engineering. (ii) To create richer, more specialized word embeddings by using health domain datasets such as MIMIC-III. (iii) To evaluate our systems over three contemporary datasets. Methods Two deep learning methods, namely the Bidirectional LSTM and the Bidirectional LSTM-CRF, are evaluated. A CRF model is set as the baseline to compare the deep learning systems to a traditional machine learning approach. The same features are used for all the models. Results We have obtained the best results with the Bidirectional LSTM-CRF model, which has outperformed all previously proposed systems. The specialized embeddings have helped to cover unusual words in DrugBank and MedLine, but not in the i2b2/VA dataset. Conclusions We present a state-of-the-art system for DNR and CCE. Automated word embeddings has allowed us to avoid costly feature engineering and achieve higher accuracy. Nevertheless, the embeddings need to be retrained over datasets that are adequate for the domain, in order to adequately cover the domain-specific vocabulary.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/120566