Data Mining In Epigenetic Modification and Gene Expression

Publication Type:
Thesis
Issue Date:
2020
Full metadata record
This thesis employs data mining techniques to discover domain knowledge in epigenetic modification and gene expression profile. Computational methods are developed for three research questions, namely, how to accurately predict DNA N⁴-methylcytosine site, how to precisely identify mRNA N⁶-methyladenosine sites, and how to identify lung cancer gene expression profile markers. The motivations of the proposed methods are improving the performance of computational methods via constructing efficient feature space, optimizing machine learning schemes, solving the data imbalance issue, and employing novel statistical analysis approach to provide researchers efficient computational tools. DNA N⁴-methylcytosine (4mC) is a critical epigenetic modification and plays various roles in the restriction-modification system. The computational methods have been explored to identify 4mC in the DNA sequence in recent years due to the high cost of experimental laboratory detection. However, the state-of-the-art methods have limited performance because of the lack of effective sequence features and the ad hoc choice of learning algorithms. Chapter 3 proposes a new method with novel sequence feature space and machine learning scheme. In sequence encoding, five essential sequence features are integrated into a 292-dimension feature space, representing both global and local sequence characteristics. Then a feature selection scheme is built, where the feature importance score produced from the training process of XGBoost machine is taken as the criterion of feature selection. At last, an SVM-based prediction model is trained with the selected features and optimized by 10-fold cross-validations. In the result part, the impact of feature selection on model performance is evaluated by an independent test. The proposed method outperforms three state-of-art predictors in both independent test and 10-fold cross-validation. Furthermore, two case studies prove the effectiveness of our method in practical situations. N⁶-methyladenosine (m⁶A) widely involves in mRNA metabolism and embryogenesis. Multiple computational human mRNA m⁶A site predictors have been developed. However, there are two main drawbacks of the existing methods: first, inadequate learning of the imbalanced training data; second, the sequence text features are not outstanding in representing m⁶A sequence characteristics. Chapter 4 proposes to use the cost-sensitive learning idea to solve the imbalance data issues in the problem. This cost-sensitive approach learns from the entire imbalanced dataset without a random selection of negative samples. In sequence representation, site location, entropy features and specific single nucleotide polymorphism (SNP) positions are taken as new features, which improve the performs significantly. In the comparison with existing predictors, our method achieves better correctness and robustness in both independent tests and case studies. The results suggest that imbalance learning is promising to improve the performance of m⁶A prediction. The early diagnosis of lung cancer has been a challenging problem in clinical practice for a long time. The identification of differentially expressed genes as a disease marker is a promising solution. Chapter 5 presents a novel approach to identify marker genes and define the boundary of gene expression profile for human lung cancer. By calculating the kernel maximum mean discrepancy, the proposed method evaluates the expression difference between normal, normal adjacent to tumor (NAT) and tumor samples. The expression level boundaries among different groups are defined with the information entropy theory for marker genes. Compared with two conventional methods t-test and fold change, the genes selected by MMD values have better performance under all metrics in 10-fold cross-validation. Furthermore, the GO and KEGG enrichment analysis validate the discovered marker gene in function pathways. At last, we choose ten most meaningful genes as lung cancer markers and calculate the expression profile boundaries. The proposed method is more accurate than conventional DEA methods in marker gene identification and provides a reliable method for defining the gene expression level boundaries.
Please use this identifier to cite or link to this item: