Statistical analysis on distributed systems

Publication Type:
Issue Date:
Full metadata record
Organisations are increasingly storing the large amount of data they collect using distributed computing system such as Hadoop. In these distributed computing systems, instead of storing all the data on a single computer, which is unfeasible in the big data setting, data are divided into subsets, which are then stored across many computers. These distributed data, however, present a challenge to many statistical methods and algorithms since they assume all the data are available at one location (e.g., in the memory of a single computer). This thesis explores how statistical analysis can be performed on distributed computing systems using the divide and recombine paradigm, where analysis is performed on individual subsets of the data, follow by combining the results of these analyses. This thesis demonstrates how existing statistical methods such as coarsening, the EM algorithm, and meta-analysis can be used in the divide and recombine paradigm to perform statistical analysis on big data. In addition, this thesis investigates the use of other statistical methods such as multiple imputation, propensity scoring and penalised spline when working with so called ‘messy’ data. This thesis shows that the divide and recombine paradigm is more than just the simple averaging of results from the subsets. By tailoring existing statistical methods for divide and recombine, as well as abstracting the idea of distributed data, this thesis shows that many existing statistical methods are still relevant in today’s big data world.
Please use this identifier to cite or link to this item: