Leveraging SMOTE in a two-layer model for prediction of protein-protein interactions

Publication Type:
Conference Proceeding
Proceedings - 2019 7th International Conference on Advanced Cloud and Big Data, CBD 2019, 2019, pp. 133-138
Issue Date:
Filename Description Size
08916556.pdfPublished version467.45 kB
Adobe PDF
Full metadata record
© 2019 IEEE. The research of the mechanisms of infectious diseases between host and pathogens remains a hot topic. It takes stock of the interactions data between host and pathogens, including proteins and genomes, to facilitate the discoveries and prediction of underlying mechanisms. However, the incomplete protein-protein interactions data impediment the advances in this exploration and solicit the wet-lab experiments to examine and verify the latent interactions. Although there have been numerous studies trying to leverage the computational models, especially machine learning models, the performances of these models were not good enough to produce high-fidelity candidates of interactions data due to the nature of the protein-protein interactions data. In this paper, we propose a two-layer model for prediction of host-pathogen protein-protein interactions tackling the challenges affiliated to the feature representation algorithms and the imbalanced data. The two-layer model consists of two essential modules, which are XGBoost to reduce the imbalanced ratio of the data and SVM to improve the performance. SMOTE technology is incorporated as a key component in our model to alleviate the bias of imbalanced ratio. In this study, we have carefully collected proteins interactions data from public databases and built a dataset following the protocol with consensus of literature. A variety of models, including traditional models, models in major literature and our model, are verified on the datasets. Results demonstrate that our model significantly improve the performance comparing with the other state-of-the-art models.
Please use this identifier to cite or link to this item: