Computation and Storage Trade-Off for Cost-Effectively Storing Scientific Datasets in the Cloud

Publication Type:
Handbook of Data Intensive Computing, 2011, 1st edition, pp. 129 - 153
Issue Date:
Full metadata record
Files in This Item:
Filename Description Size
Thumbnail2012001278OK_Yuan.pdf757.96 kB
Adobe PDF
Scientific applications are usually data intensive [1,~ 2], where the generated datasets are often terabytes or even petabytes in size. As reported by Szalay and Gray in [3], science is in an exponential world and the amount of scientific data will double every year over the next decade and future. Producing scientific datasets involves large number of computation intensive tasks, e.g., with scientific workflows [4], hence taking a long time for execution. These generated datasets contain important intermediate or final results of the computation, and need to be stored as valuable resources. This is because: (1) data can be reused scientists may need to re-analyze the results or apply new analyses on the existing datasets [5]; (2) data can be shared for collaboration, the computation results may be shared, hence the datasets are used by scientists from different institutions [6]. Storing valuable generated application datasets can save their regeneration cost when they are reused, not to mention the waiting time caused by regeneration. However, the large size of the scientific datasets is a big challenge for their storage
Please use this identifier to cite or link to this item: