Big data and cloud computing are two disruptive trends nowadays, provisioning numerous opportunities to current IT industry and research communities while posing significant challenges on them as well. The massive increase in computing power and data storage capacity provisioned by the cloud and the advances in big data mining and analytics have expanded the scope of information available to businesses, government, and individuals by orders of magnitude. A major obstacle to the adoption of cloud computing in sectors such as health and business for big data analysis is the privacy risk associated with releasing data sets to third-parties in the cloud. The data sets in the sectors mentioned above often contain personal privacy-sensitive data, e.g., electronic health records and financial transaction records, while these data sets can offer significant economic and social benefits if analysed or mined by organizations such as disease research centres. Although some privacy issues are not new, the situation is aggravated due to the features of cloud computing like ubiquitous access and multi-tenancy, and the three V properties of big data, i.e., Volume, Velocity and Variety. Therefore, it is still a significant challenge to achieve privacy-preserving big data publishing in cloud computing. A widely-adopted technique for privacy-preserving data publishing with semantic correctness guarantees is to anonymise data via generalisation, and a bundle of anonymisation approaches have been proposed. However, most existing approaches are either inherently sequential or distributed without directly optimising scalability, thus rendering them unsuitable for data intensive applications and inapplicable to the state-of-the-art parallel and distributed paradigms like MapReduce.
In this thesis, we mainly investigate the problem of big data anonymisation for privacy preservation from the perspectives of scalability and cost-effectiveness. The cloud computing advantages including on-demand resource provisioning, rapid elasticity and pay-as-you-go fashion are exploited to address the problem, aiming at gaining high scalability and cost-effectiveness. Specifically, we examine three major phases in the lifecycle of privacy-preserving data publishing or sharing in cloud environments, including data anonymisation, anonymous data update and anonymous data management. Accordingly, a scalable and cost-effective privacy-preserving framework is proposed to provide a holistic conceptual foundation for privacy preservation over big data and enable users to accomplish the full potential of the high scalability, elasticity, and cost-effectiveness of the cloud. We develop a corresponding prototype system consisting of a series of solutions to the scalability issues that lie in the three phases based on MapReduce, the de facto standard for big data processing paradigm at present, for the sake of high scalability, cost-effectiveness and compatibility with other big data mining and analytical tools. In terms of extensive experiments on real-world data sets, this thesis demonstrates that our solutions can significantly improve the scalability and cost-effectiveness of big data privacy preservation compared to existing approaches.