Data mining methods for conformational B-cell epitope prediction

Publication Type:
Thesis
Issue Date:
2016
Filename Description Size
01front.pdfcontents and abstract178.39 kB
Adobe PDF
02whole.pdfthesis7.21 MB
Adobe PDF
Full metadata record
NO FULL TEXT AVAILABLE. This thesis contains 3rd party copyright material. ----- Antigen-antibody binding is an essential immune mechanism. As the binding site on the antigen side, B-cell epitope plays a fundamental role in immune recognition, and provides an ideal target for vaccine development, disease diagnosis and immunotherapy. Of the B-cell epitopes, conformational B-cell epitopes make up more than 90%. However, the high complexity of wet-lab experiments is a primary obstacle in the identification of conformational B-cell epitopes. The current prediction methods, ignoring some vital issues, are typically of limited performance. To tackle the potential performance-affecting issues, this study designs and implements a series of data mining methods for conformational B-cell epitope prediction from both antigen structure and sequence. The study also proposes a practical propensity that can be used to recognise conserved B-cell epitopes effectively. A major contribution of this thesis is the construction of more accurate structure-based prediction models. A serious problem of existing structure-based approaches is that they conventionally build their prediction models on antigens isolated directly from antigen-antibody bound structures (i.e., quaternary structures), which contain unfair binding site information, such as shape and B-factor. It is recognised as one of the primary causes of the unsatisfactory performance. To deal with this issue, this study develops a new prediction method CeePre based on antigen unbound structures (i.e., tertiary structures). Additionally, this work applies the tertiary-structure derived B-factor, and shows its effectiveness through propensity analysis. Based on the principle of antigenic residue aggregation, a second-step learning is deployed to further refine the results. A second key issue that inhibits performance improvements is the incomplete annotation of the data sets. There can be multiple epitopes on one antigen; nevertheless, in most cases, only a portion of the epitopes have been determined or annotated. This situation is particularly obvious in the previous bound structure based methods. They conventionally label only one epitope for each antigen; all the unselected or undetermined epitopes are labelled as non-epitope, leading to bias in epitope prediction. A novel positive-unlabelled learning method is proposed to handle this issue, and it is applied in conformational B-cell epitope prediction. With manually labelled species, a species-specific analysis is performed on several propensities. This analysis comes to an important conclusion: similar trends between epitope and surface exist in different species, which implies that general predictors can work for all species; however, the details vary, and thus refinement by using species information may help to enhance prediction performance. Another primary contribution of this thesis is to build an accurate prediction model from antigen sequences. The purpose is to overcome the main drawback of structure-based methods: fewer antigen structures are available than antigen sequences. In addition, this approach focuses on a common problem of data heterogeneity; thus, a staged heterogeneity learning framework is proposed. It learns both characteristics and heterogeneity of data in a phased manner. The framework is applied to build a sequence-based conformational B-cell epitope prediction model, which achieves excellent performance for heterogeneous data sources. Furthermore, an algorithm is designed to cluster the predicted individual antigenic residues into conformational B-cell epitopes so as to provide a strong potential for real-world applications such as vaccine development. A conserved epitope is an epitope retained by multiple strains of a virus. It is the target of a broadly neutralising antibody. Identification of conserved epitopes can help to design broad-spectrum vaccines. This thesis proposes a very effective propensity Average Amino Acid Conservation Score (AAACS) to identify conserved epitopes; its effectiveness is validated on influenza HA (hemagglutinin) antigen. All the proposed prediction methods in this thesis have superior performance in comparison with the state-of-the-art approaches. They would contribute to the recognition and application of B-cell epitopes.
Please use this identifier to cite or link to this item: