Gene Set Anomaly Score: A Genomic Data and Knowledge Driven Approach for Analysing Anomalous Gene Expression in Cancer Patients

Publication Type:
Thesis
Issue Date:
2023
Full metadata record
Genomics research often uses Gene Set Enrichment Analysis (GSEA) to rank genes that correlate with the presence of phenotypical traits and to interpret how variations in gene expression influence those traits. GSEA provides an explanation of found genes through their associations with gene sets. As gene sets represent different biological mechanisms, they can indicate overall shifts in expression values in relation to their biology. This thesis investigated the relationships between patients and diseases by using gene sets, integrating gene expression data and gene set ontologies to develop a new analytics method called gene expression anomaly scores. These scores measure the deviation of expression values from expected values. This thesis investigated the representation of patient biology as two-dimensional graphs derived from anomaly scores. There are thousands of patient gene sets relating to a given disease, such as cancer. To identify strongly associated gene sets, this thesis apply principal component analysis (PCA) and maximum relevance and minimum redundancy (MRMR), selecting the two most prominent dimensions. Thus, PCA and MRMR were each used to embed patients into a 2-dimensional anomaly score space. Embedding patients using anomaly scores revealed relationships between patients and patient biology through clustering and feature selection in this space. Moreover, this thesis applied explainable AI (XAI) to understand patients' biology (gene sets) responsible for prediction by predictive models or AI algorithms. This thesis applied Local Interpretation-Driven Abstract Bayesian Network (LINDA-BN) which extracts patients biology and shows the relationships between biologies responsible for a prediction. The proposed method was used to analyse gene expression data of cancer patients from four different data sets. More specifically, anomaly scores followed by PCA or MRMR showed groups of cancer patients in scatter plots. These groups appeared to be related to treatment outcomes. In addition, MRMR was able to identify potential gene sets with meaningful biological implications. Comparatively, when raw and state-of-the-art gene expression scores were analyzed, only genes patterns were apparent. The outcomes of the distributions showed that the distribution of anomaly scores varied significantly between patients who relapsed and those who did not. In addition, the k-means algorithm revealed that the anomaly score performs better clustering than state-of-the-art methodologies. Furthermore, anomaly scores uncovered novel cancer biology in contrast to gene set enrichment analysis (GSEA) and state-of-the-art approaches. Finally, the outcomes of instance-based LINDA-BN showed an interpretable and explicable method for predicting medical condition a cancer patient.
Please use this identifier to cite or link to this item: