Educational data analysis refers to techniques, tools, and research designed to automatically extract meaning from large repositories of data generated by or related to people’s learning activities in educational environments. It is a research field which focus on helping policymakers and administrators understand how analytics and data mining can be applied for the purposes of educational improvement. Unfortunately, most research on educational data only by applying the existing machine learning or data mining algorithms, very few publications have discussed the character of the data itself. Traditional data mining algorithms have disadvantages, in that most of them assume the independent and identically distributed (IID) of data objects, attributes, and values. However, real world data usually contains strong couplings among values, attributes and data objects, and this represents a considerable challenge to existing methods and tools. This thesis focuses on utilizing coupling analysis in educational data analysis tasks. In particular, it focuses on two educational data analysis tasks: student performance prediction, and student social media sentiment analysis.
The student performance prediction task is firstly examined. This thesis begins with the most straightforward method which integrates coupling similarities as the distance for a weighted k-nearest centroid classifier. This method considers not only the intra-coupled similarity within an attribute but also the inter-coupled similarity between attributes. Computational cost is high for coupling analysis. Hence, a more efficient method is proposed that selects the centroid objects instead of all objects in the nearest neighbor search process. Furthermore, integrating support vector machines with coupled similarity. The original SVMs is designed for numerical data. This thesis develops a novel pairwise SVMs that use the coupled similarity metric as a kernel between data objects with nominal attributes. The experiment result shows the two proposed methods outperform the traditional SVMs and other popular classification methods on various public data sets, and the student performance prediction task.
Secondly, the student social media sentiment analysis is examined. Unlike linguistic methods, this thesis learns how to classify student sentiment by applying data mining on the labeled historical data. Most previous research employs the vector-space model for text representation and analysis, however, the vector-space model does not utilize the information about the term to term relationships. In other words, the traditional text mining techniques assume the relations between term to term are independent and identically distributed (IID). This thesis introduces a novel term representation by involving coupling relations between neighbors. This coupling representation provide much richer information which enables us to create a coupled similarity metric from document to document, and a coupling document similarity based k-nearest centroid classifier applied to the classification task. Experiments verify that the proposed approach outperforms the classic vector-space based classifier and displays distinct advantages and richness in terms of student social media sentiment analysis tasks.
Finally, due to the complexity of the proposed algorithm and the enormous amount of the educational related data source, a scalable educational data mining platform is in great demand. Hence, with the help of the Spark cluster, a novel coupling similarity based learning approach has been proposed to cater for the big data learning problem by parallelizing the coupled similarity calculation process. Further, the parallel k-NN for classification and k-Means for the clustering task has been proposed. Compared to the original algorithms, the experimental results show that the proposed methods not only outperforms the clustering and classification performance of the baselines, but also represent a huge improvement on the data scale in terms of the time efficiency. Accordingly, the proposed framework has already been implemented, a scalable educational data analysis platform with coupling analysis will serve to meet a host of future challenges.