Joint Entity Linking with Deep Reinforcement Learning

Entity linking is the task of aligning mentions to corresponding entities in a given knowledge base. Previous studies have highlighted the necessity for entity linking systems to capture the global coherence. However, there are two common weaknesses in previous global models. First, most of them calculate the pairwise scores between all candidate entities and select the most relevant group of entities as the final result. In this process, the consistency among wrong entities as well as that among right ones are involved, which may introduce noise data and increase the model complexity. Second, the cues of previously disambiguated entities, which could contribute to the disambiguation of the subsequent mentions, are usually ignored by previous models. To address these problems, we convert the global linking into a sequence decision problem and propose a reinforcement learning model which makes decisions from a global perspective. Our model makes full use of the previous referred entities and explores the long-term influence of current selection on subsequent decisions. We conduct experiments on different types of datasets, the results show that our model outperforms state-of-the-art systems and has better generalization performance.


INTRODUCTION
Entity Linking (EL), which is also called Entity Disambiguation (ED), is the task of mapping mentions in text to corresponding entities in a given knowledge Base (KB). is task is an important and challenging stage in text understanding because mentions are usually ambiguous, i.e., di erent named entities may share the same surface form and the same entity may have multiple aliases. EL is key for information retrieval (IE) and has many applications, such as knowledge base population (KBP), question answering (QA), etc.
Existing EL methods can be divided into two categories: local model and global model. Local models concern mainly on contextual words surrounding the mentions, where mentions are disambiguated independently. ese methods are not work well when the context information is not rich enough. Global models take into account the topical coherence among the referred entities within the same document, where mentions are disambiguated jointly. Most of previous global models [14,27,37] calculate the pairwise scores between all candidate entities and select the most relevant group of entities. However, the consistency among wrong entities as well as that among right ones are involved, which not only increases the model complexity but also introduces some noises. For example, in Figure 1, there are three mentions "France", "Croatia" and "2018 World Cup", and each mention has three candidate entities. Here, "France" may refer to French Republic, France national basketball team or France national football team in KB. It is di cult to disambiguate using local models, due to the scarce common information in the contextual words of "France" and the descriptions of its candidate entities. Besides, the topical coherence among the wrong entities related to basketball team (linked by an orange dashed line) may make the global models mistakenly refer "France" to France national basketball team. So, how to solve these problems?
We note that, mentions in text usually have di erent disambiguation di culty according to the quality of contextual information and the topical coherence. Intuitively, if we start with mentions that are easier to disambiguate and gain correct results, it will be e ective to utilize information provided by previously referred entities to disambiguate subsequent mentions. In the above example, it is much easier to map "2018 World Cup" to 2018 FIFA World Cup based on their common contextual words "France", "Croatia", "4-2". en, it is obvious that "France" and "Croatia" should be referred to the national football team because football-related terms are mentioned many times in the description of 2018 FIFA World Cup.
Inspired by this intuition, we design the solution with three principles: (i) utilizing local features to rank the mentions in text and deal with them in a sequence manner; (ii) utilizing the information of previously referred entities for the subsequent entity disambiguation; (iii) making decisions from a global perspective to avoid the error propagation if the previous decision is wrong.
In order to achieve these aims, we consider global EL as a sequence decision problem and proposed a deep reinforcement learning (RL) based model, RLEL for short, which consists of three modules: Local Encoder, Global Encoder and Entity Selector. For each mention and its candidate entities, Local Encoder encodes the local features to obtain their latent vector representations. en, the mentions are ranked according to their disambiguation di culty, which is measured by the learned vector representations. In order to enforce global coherence between mentions, Global Encoder encodes the local representations of mention-entity pairs in a sequential manner via a LSTM network, which maintains a long-term memory on features of entities which has been selected in previous states. Entity Selector uses a policy network to choose the target entities from the candidate set. For a single disambiguation decision, the policy network not only considers the pairs of current mention-entity representations, but also concerns the features of referred entities in the previous states which is pursued by the Global Encoder. In this way, Entity Selector is able to take actions based on the current state and previous ones. When eliminating the ambiguity of all mentions in the sequence, delayed rewards are used to adjust its policy in order to gain an optimized global decision.
Deep RL model, which learns to directly optimize the overall evaluation metrics, works much be er than models which learn with loss functions that just evaluate a particular single decision. By this property, RL has been successfully used in many NLP tasks, such as information retrieval [28], dialogue system [10] and relation classi cation [12], etc. To the best of our knowledge, we are the rst to design a RL model for global entity linking. And in this paper, our RL model is able to produce more accurate results by exploring the long-term in uence of independent decisions and encoding the entities disambiguated in previous states.
In summary, the main contributions of our paper mainly include following aspects: • We are the rst to consider EL as a sequence decision problem and innovatively utilize a deep reinforcement learning model in this task. • e proposed model takes into account both local context and global coherence. In the process of global disambiguation, we make full use of the previous selected entity information and make decisions from a global perspective.
• We evaluate our model on several benchmark datasets and the experimental results showed that our model achieves signi cant improvements over the state-of-the-art methods.

METHODOLOGY
e overall structure of our RLEL model is shown in Figure 2. e proposed framework mainly includes three parts: Local Encoder which encodes local features of mentions and their candidate entities, Global Encoder which encodes the global coherence of mentions in a sequence manner and Entity Selector which selects an  ) denotes the concatenation of the mention context vector V m t and one candidate entity vector V e k t . e policy network selects one entity from the candidate set, and V a t denotes the concatenation of the mention context vector V m t and the selected entity vector V e * t . h t represents the hidden status of V a t , and it will be input into S t +1 .
entity from the candidate set. As the Entity Selector and the Global Encoder are correlated mutually, we train them jointly. Moreover, the Local Encoder as the basis of the entire framework will be independently trained before the joint training process starts. In the following, we will introduce the technical details of these modules.

Preliminaries
Before introducing our model, we rstly de ne the entity linking task. Formally, given a document D with a set of mentions M = {m 1 , m 2 , ..., m k }, each mention m t ∈ D has a set of candidate entities C m t = {e 1 t , e 2 t , ..., e n t }. e task of entity linking is to map each mention m t to its corresponding correct target entity e + t or return "NIL" if there is not correct target entity in the knowledge base. Before selecting the target entity, we need to generate a certain number of candidate entities for model selection.
Inspired by the previous works [29,31,42], we use the mention's redirect and disambiguation pages in Wikipedia to generate candidate sets. For those mentions without corresponding disambiguation pages, we use its n-grams to retrieve the candidates [31]. In most cases, the disambiguation page contains many entities, sometimes even hundreds. To optimize the model's memory and avoid unnecessary calculations, the candidate sets need to be ltered [1,14,37]. Here we utilize the XGBoost model [3] as an entity ranker to reduce the size of candidate set. e features used in XGBoost can be divided into two aspects, the one is string similarity like the Jaro-Winkler distance between the entity title and the mention, the other is semantic similarity like the cosine distance between the mention context representation and the entity embedding. Furthermore, we also use the statistical features based on the pageview and hyperlinks in Wikipedia. Empirically, we get the pageview of the entity from the Wikipedia Tool Labs 1 which counts the number of visits on each entity page in Wikipedia. A er

Local Encoder
Given a mention m t and the corresponding candidate set {e 1 t , e 2 t , ..., e k t }, we aim to get their local representation based on the mention context and the candidate entity description. For each mention, we rstly select its n surrounding words, and represent them as word embedding using a pre-trained lookup table [24]. en, we use Long Short-Term Memory (LSTM) networks to encode the contextual word sequence {w 1 c , w 2 c , ..., w n c } as a xed-size vector V m t . e description of entity is encoded as D e i t in the same way. Apart from the description of entity, there are many other valuable information in the knowledge base. To make full use of these information, many researchers trained entity embeddings by combining the description, category, and relationship of entities. As shown in [14], entity embeddings compress the semantic meaning of entities and drastically reduce the need for manually designed features or co-occurrence statistics. erefore, we use the pre-trained entity embedding E e i t and concatenate it with the description vector D e i t to enrich the entity representation. e concatenation result is denoted by V e i t .
A er ge ing V e i t , we concatenate it with V m t and then pass the concatenation result to a multilayer perceptron (MLP). e MLP outputs a scalar to represent the local similarity between the mention m t and the candidate entity e i t . e local similarity is calculated by the following equations: Where ⊕ indicates vector concatenation. With the purpose of distinguishing the correct target entity and wrong candidate entities when training the local encoder model, we utilize a hinge loss that ranks ground truth higher than others. e rank loss function is de ned as follows: When optimizing the objective function, we minimize the rank loss similar to [14,37]. In this ranking model, a training instance is constructed by pairing a positive target entity e + t with a negative entity e − t . Where γ > 0 is a margin parameter and our purpose is to make the score of the positive target entity e + t is at least a margin γ higher than that of negative candidate entity e − t . With the local encoder, we obtain the representation of mention context and candidate entities, which will be used as the input into the global encoder and entity selector. In addition, the similarity scores calculated by MLP will be utilized for ranking mentions in the global encoder.

Global Encoder
In the global encoder module, we aim to enforce the topical coherence among the mentions and their target entities. So, we use an LSTM network which is capable of maintaining the long-term memory to encode the ranked mention sequence. What we need to emphasize is that our global encoder just encode the mentions that have been disambiguated by the entity selector which is denoted as V a t .
As mentioned above, the mentions should be sorted according to their contextual information and topical coherence. So, we rstly divide the adjacent mentions into a segment by the order they appear in the document based on the observation that the topical consistency a enuates along with the distance between the mentions.
en, we sort mentions in a segment based on the local similarity and place the mention that has a higher similarity value in the front of the sequence. In Equation 1, we de ne the local similarity of m i and its corresponding candidate entity e i t . On this basis, we de ne Ψ max (m i , e a i ) as the the maximum local similarity between the m i and its candidate set then we place m i before m j . Under this circumstances, the mentions in the front positions may not be able to make be er use of global consistency, but their target entities have a high degree of similarity to the context words, which allows them to be disambiguated without relying on additional information. In the end, previous selected target entity information is encoded by global encoder and the encoding result will be served as input to the entity selector.
Before using entity selector to choose target entities, we pretrained the global LSTM network. During the training process, we input not only positive samples but also negative ones to the LSTM. By doing this, we can enhance the robustness of the network. In the global encoder module, we adopt the following cross entropy loss function to train the model.
Where ∈ {0, 1} represents the label of the candidate entity. If the candidate entity is correct = 1, otherwise = 0. ∈ (0, 1) indicates the output of our model. A er pre-training the global  Figure 3: e architecture of policy network. It is a feedforward neural network and the input consists of four parts: mention context representation, candidate entity representation, feature representation, and encoding of the previous decisions.
encoder, we start using the entity selector to choose the target entity for each mention and encode these selections.

Entity Selector
In the entity selector module, we choose the target entity from candidate set based on the results of local and global encoder. In the process of sequence disambiguation, each selection result will have an impact on subsequent decisions. erefore, we transform the choice of the target entity into a reinforcement learning problem and view the entity selector as an agent. In particular, the agent is designed as a policy network which can learn a stochastic policy and prevents the agent from ge ing stuck at an intermediate state [39]. Under the guidance of policy, the agent can decide which action (choosing the target entity from the candidate set)should be taken at each state, and receive a delay reward when all the selections are made. In the following part, we rst describe the state, action and reward. en, we detail how to select target entity via a policy network.
State. e result of entity selection is based on the current state information. For time t, the state vector S t is generated as follows: Where ⊕ indicates vector concatenation. e V t m i and V t e i respectively denote the vector of m i and e i at time t. For each mention, there are multiple candidate entities correspond to it. With the purpose of comparing the semantic relevance between the mention and each candidate entity at the same time, we copy multiple copies of the mention vector. Formally, we extend V t m i ∈ R 1×n to V t m i ∈ R k ×n and then combine it with V t e i ∈ R k ×n . Since V t m i and V t e i are mainly to represent semantic information, we add feature vector V t f eatur e to enrich lexical and statistical features. ese features mainly include the popularity of the entity, the edit distance between the entity description and the mention context, the number of identical words in the entity description and the mention context etc. A er ge ing these feature values, we combine them into a vector and add it to the current state. In addition, the global vector V t −1 e * is also added to S t . As mentioned in global encoder module, V t −1 e * is the output of global LSTM network at time t − 1, which encodes the mention context and target entity information from m 0 to m t −1 . us, the state S t contains current information and previous decisions, while also covering the semantic representations and a variety of statistical features. Next, the concatenated vector will be fed into the policy network to generate action.
Action. According to the status at each time step, we take corresponding action. Speci cally, we de ne the action at time step t is to select the target entity e * t for m t . e size of action space is the number of candidate entities for each mention, where a i ∈ {0, 1, 2...k} indicates the position of the selected entity in the candidate entity list. Clearly, each action is a direct indicator of target entity selection in our model. A er completing all the actions in the sequence we will get a delayed reward.
Reward. e agent takes the reward value as the feedback of its action and learns the policy based on it. Since current selection result has a long-term impact on subsequent decisions, we don't give an immediate reward when taking an action. Instead, a delay reward is given by follows, which can re ect whether the action improves the overall performance or not.
where p(a t ) ∈ {0, 1} indicates whether the current action is correct or not. When the action is correct p(a t ) = 1 otherwise p(a t ) = 0. Hence T j=t p(a j ) and T j=t p(a j ) + t −T respectively represent the number of correct and wrong actions from time t to the end of episode. Based on the above de nition, our delayed reward can be used to guide the learning of the policy for entity linking.
Policy Network. A er de ning the state, action, and reward, our main challenge becomes to choose an action from the action space. To solve this problem, we sample the value of each action by a policy network π Θ (a|s). e structure of the policy network is shown in Figure 3. e input of the network is the current state, including the mention context representation, candidate entity representation, feature representation, and encoding of the previous decisions. We concatenate these representations and fed them into a multilayer perceptron, for each hidden layer, we generate the output by: Where W i and b i are the parameters of the ith hidden layer, through the relu activation function we get the h i (S t ). A er ge ing the output of the last hidden layer, we feed it into a so max layer which generates the probability distribution of actions. e probability distribution is generated as follows: Where the W and b are the parameters of the so max layer. For each mention in the sequence, we will take action to select the target entity from its candidate set. A er completing all decisions in the episode, each action will get an expected reward and our goal is to maximize the expected total rewards. Formally, the objective Generate the candidate set for each mention 4: Divide the mentions in D k into multiple sequences S = {S 1 , S 2 , ..., S N }; 5: for S k in S do 6: Rank the mentions M = {m 1 , m 2 , ..., m n } in S k based on the local similarity; 7: for m k in M do 8: Sample the target entity e * k for m k with Θ; Compute delayed reward R(a t ) for each action; 13: Update the parameter Θ of policy network: Θ ← Θ + α t R(a t )∇ Θ log π Θ (a|s) 14: end for 15: Update the parameter Φ in the global LSTM network 16: end for function is de ned as: Where P Θ (s t , a t ) is the state transfer function, π Θ (a|s) indicates the probability of taking action a under the state s, R(a t ) is the expected reward of action a at time step t. According to REINFORCE policy gradient algorithm [38], we update the policy gradient by the way of equation 9.
Θ ← Θ + α t R(a t )∇ Θ log π Θ (a|s) As the global encoder and the entity selector are correlated mutually, we train them jointly a er pre-training the two networks. e details of the joint learning are presented in Algorithm 1.

EXPERIMENT
In order to evaluate the e ectiveness of our method, we train the RLEL model and validate it on a series of popular datasets that are also used by [14,37]. To avoid over ing with one dataset, we use both AIDA-Train and Wikipedia data in the training set. Furthermore, we compare the RLEL with some baseline methods, where our model achieves the state-of-the-art results. We implement our models in Tensor ow and run experiments on 4 Tesla V100 GPU.

Experiment Setup
Datasets. We conduct experiments on several di erent types of public datasets including news and encyclopedia corpus. e training set is AIDA-Train and Wikipedia datasets, where AIDA-Train contains 18448 mentions and Wikipedia contains 25995 mentions. Training Details. During the training of our RLEL model, we select top K candidate entities for each mention to optimize the memory and run time. In the top K candidate list, we de ne the recall of correct target entity is R t . According to our statistics, when K is set to 1, R t is 0.853, when K is 5, R t is 0.977, when K increases to 10, R t is 0.993. Empirically, we choose top 5 candidate entities as the input of our RLEL model. For the entity description, there are lots of redundant information in the wikipedia page, to reduce the impact of noise data, we use TextRank algorithm [23] to select 15 keywords as description of the entity. Simultaneously, we choose 15 words around mention as its context. In the global LSTM network, when the number of mentions does not reach the set length, we adopt the mention padding strategy. In short, we copy the last mention in the sequence until the number of mentions reaches the set length.
Hyper-parameter se ing. We set the dimensions of word embedding and entity embedding to 300, where the word embedding and entity embedding are released by [30] and [14] respectively. For parameters of the local LSTM network, the number of LSTM cell units is set to 512, the batch size is 64, and the rank margin γ is 0.1. Similarly, in global LSTM network, the number of LSTM cell units is 700 and the batch size is 16. In the above two LSTM networks,

Comparing with Previous Work
Baselines. We compare RLEL with a series of EL systems which report state-of-the-art results on the test datasets. ere are various methods including classi cation model [25], rank model [6,34] and probability graph model [14,16,20,21,37]. Except that, Cheng et al . [5] formulate their global decision problem as an Integer Linear Program (ILP) which incorporates the entity-relation inference. Globerson et al . [15] introduce a multi-focal a ention model which allows each candidate to focus on limited mentions, Yamada et al . [40] propose a word and entity embedding model speci cally designed for EL.
Evaluation Metric. We use the standard Accuracy, Precision, Recall and F1 at mention level (Micro) as the evaluation metrics: where M * is the golden standard set of the linked name mentions, M is the set of linked name mentions outpu ed by an EL method.
Results. Same as previous work, we use in-KB accuracy and micro F1 to evaluate our method. We rst test the model on the AIDA-B dataset. From Table 2, we can observe that our model achieves the best result. Previous best results on this dataset are generated by [14,37] which both built CRF models. ey calculate the pairwise scores between all candidate entities. Di erently, our model only considers the consistency of the target entities and ignores the relationship between incorrect candidates. e experimental results show that our model can reduce the impact of noise data and improve the accuracy of disambiguation. Apart from experimenting on AIDA-B, we also conduct experiments on several di erent datasets to verify the generalization performance of our model.  From Table 3, we can see that RLEL has achieved relatively good performances on ACE2004, CWEB and WIKI. At the same time, previous models [5,14,37] achieve be er performances on the news datasets such as MSNBC and AQUINT, but their results on encyclopedia datasets such as WIKI are relatively poor. To avoid over ing with some datasets and improve the robustness of our model, we not only use AIDA-Train but also add Wikipedia data to the training set. In the end, our model achieve the best overall performance.
For most existing EL systems, entities with lower frequency are di cult to disambiguate. To gain further insight, we analyze the accuracy of the AIDA-B dataset for situations where gold entities have low popularity. We divide the gold entities according to their pageviews in wikipedia, the statistical disambiguation results are shown in Table 4. Since some pageviews can not be obtained, we only count part of gold entities. e result indicates that our model is still able to work well for low-frequency entities. But for medium-frequency gold entities, our model doesn't work well enough. e most important reason is that other candidate entities corresponding to these medium-frequency gold entities have higher pageviews and local similarities, which makes the model di cult to distinguish.

Discussion on di erent RLEL variants
To demonstrate the e ects of RLEL, we evaluate our model under di erent conditions. First, we evaluate the e ect of sequence length on global decision making. Second, we assess whether sorting the mentions have a positive e ect on the results. ird, we analysis the results of not adding globally encoding during entity selection. Last, we compare our RL selection strategy with the greedy choice. Sequence in di erent length. A document may contain multiple topics, so we do not add all mentions to a single sequence. In practice, we add some adjacent mentions to the sequence and use reinforcement learning to select entities from beginning to end. To analysis the impact of the number of mentions on joint disambiguation, we experiment with sequences on di erent lengths. e results on AIDA-B are shown in Figure 4. We can see that when the sequence is too short or too long, the disambiguation results are both very poor. When the sequence length is less than 3, delay reward can't work in reinforcement learning, and when the sequence length reaches 5 or more, noise data may be added. Finally, we choose the 4 adjacent mentions to form a sequence.
Influence of ranking mentions. In this section, we test whether ranking mentions is helpful for entity selections. At rst, we directly input them into the global encoder by the order they appear in the text. We record the disambiguation results and compare them with the method which adopts ranking mentions. As shown in Figure 5a, the model with ranking mentions has achieved be er performances on most of datasets, indicating that it is e ective to place the mention that with a higher local similarity in front of the sequence. It is worth noting that the e ect of ranking mentions is not obvious on the MSNBC dataset, the reason is that most of mentions in MSNBC have similar local similarities, the order of disambiguation has li le e ect on the nal result.
E ect of global encoding. Most of previous methods mainly use the similarities between entities to correlate each other, but our model associates them by encoding the selected entity information.
a.The influence of ranking mentions for entity selection.
. The effect of the global encoding for entity selection.
. Comparison of reinforcement learning selection with greedy choice.  To assess whether the global encoding contributes to disambiguation rather than add noise, we compare the performance with and without adding the global information. When the global encoding is not added, the current state only contains the mention context representation, candidate entity representation and feature representation, notably, the selected target entity information is not taken into account. From the results in Figure 5b, we can see that the model with global encoding achieves an improvement of 4% accuracy over the method that without global encoding.
Di erent entity selection strategies. To illustrate the necessity for adopting the reinforcement learning for entity selection, we compare two entity selection strategies like [12]. Speci cally, we perform entity selection respectively with reinforcement learning and greedy choice. e greedy choice is to select the entity with largest local similarity from candidate set. But the reinforcement learning selection is guided by delay reward, which has a global perspective. In the comparative experiment, we keep the other conditions consistent, just replace the RL selection with a greedy choice. Based on the results in Figure 5c, we can draw a conclusion that our entity selector perform much be er than greedy strategies. Table 5 shows two entity selection examples by our RLEL model. For multiple mentions appearing in the document, we rst sort them according to their local similarities, and select the target entities in order by the reinforcement learning model. From the results of sorting and disambiguation, we can see that our model is able to utilize the topical consistency between mentions and make full use of the selected target entity information.

RELATED WORK
e related work can be roughly divided into two groups: entity linking and reinforcement learning.

Entity Linking
Entity linking falls broadly into two major approaches: local and global disambiguation. Early studies use local models to resolve mentions independently, they usually disambiguate mentions based on lexical matching between the mention's surrounding words and the entity pro le in the reference KB. Various methods have been proposed to model mention's local context ranging from binary classi cation [25] to rank models [4,11]. In these methods, a large number of hand-designed features are applied. For some marginal mentions that are di cult to extract features, researchers also exploit the data retrieved by search engines [7,8] or Wikipedia sentences [36]. However, the feature engineering and search engine methods are both time-consuming and laborious. Recently, with the popularity of deep learning models, representation learning is utilized to automatically nd semantic features [2,17]. e learned entity representations which by jointly modeling textual contexts and knowledge base are e ective in combining multiple sources of information. To make full use of the information contained in representations, we also utilize the pre-trained entity embeddings in our model.
In recent years, with the assumption that the target entities of all mentions in a document shall be related, many novel global models for joint linking are proposed. Assuming the topical coherence among mentions, authors in [13,33] construct factor graph models, which represent the mention and candidate entities as variable nodes, and exploit factor nodes to denote a series of features. Two recent studies [14,37] use fully-connected pairwise Conditional Random Field(CRF) model and exploit loopy belief propagation to estimate the max-marginal probability. Moreover, PageRank or Random Walk [16,19,42] are utilized to select the target entity for each mention. e above probabilistic models usually need to prede ne a lot of features and are di cult to calculate the maxmarginal probability as the number of nodes increases. In order to automatically learn features from the data, Cao et al. [1] applies Graph Convolutional Network to exibly encode entity graphs. However, the graph-based methods are computationally expensive because there are lots of candidate entity nodes in the graph.
To reduce the calculation between candidate entity pairs, Globerson et al. [15] introduce a coherence model with an a ention mechanism, where each mention only focus on a xed number of mentions. Unfortunately, choosing the number of a ention mentions is not easy in practice. Two recent studies [31,32] nish linking all mentions by scanning the pairs of mentions at most once, they assume each mention only needs to be consistent with one another mention in the document. e limitation of their method is that the consistency information is too sparse, resulting in low con dence. Similar to us, Guo et al. [16] also sort mentions according to the di culty of disambiguation, but they did not make full use of the information of previously referred entities for the subsequent entity disambiguation. Nguyen et al. [27] use the sequence model, but they simply encode the results of the greedy choice, and measure the similarities between the global encoding and the candidate entity representations. eir model does not consider the long-term impact of current decisions on subsequent choices, nor does they add the selected target entity information to the current state to help disambiguation.

Reinforcement Learning
In the last few years, reinforcement learning has emerged as a powerful tool for solving complex sequential decision-making problems. It is well known for its great success in the game eld, such as Go [35] and Atari games [26]. Recently, reinforcement learning has also been successfully applied to many natural language processing tasks and achieved good performance [12,22,39]. Feng et al. [12] used reinforcement learning for relation classi cation task by ltering out the noisy data from the sentence bag and they achieved huge improvements compared with traditional classi ers. Zhang et al. [41] applied the reinforcement learning on sentence representation by automatically discovering task-relevant structures. To automatic taxonomy induction from a set of terms, Han et al. [18] designed an end-to-end reinforcement learning model to determine which term to select and where to place it on the taxonomy, which e ectively reduced the error propagation between two phases. Inspired by the above works, we also add reinforcement learning to our framework.

CONCLUSIONS
In this paper we consider entity linking as a sequence decision problem and present a reinforcement learning based model. Our model learns the policy on selecting target entities in a sequential manner and makes decisions based on current state and previous ones. By utilizing the information of previously referred entities, we can take advantage of global consistency to disambiguate mentions. For each selection result in the current state, it also has a longterm impact on subsequent decisions, which allows learned policy strategy has a global view. In experiments, we evaluate our method on AIDA-B and other well-known datasets, the results show that our system outperforms state-of-the-art solutions. In the future, we would like to use reinforcement learning to detect mentions and determine which mention should be rstly disambiguated in the document.