Sufficiency Revisited: Rethinking Statistical Algorithms in the Big Data Era

Lee, JYL; Brown, JJ; Ryan, LM

Sufficiency Revisited: Rethinking Statistical Algorithms in the Big Data Era

Lee, JYL Brown, JJ

Ryan, LM

Permalink

Publication Type:: Journal Article
Citation:: American Statistician, 2017, 71 (3), pp. 202 - 208
Issue Date:: 2017-07-03

Closed Access

	Filename	Description	Size
	6_14_2018_Sufficienc.pdf	Published Version	968.62 kB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Lee, JYL	en_US
dc.contributor.author	Brown, JJ https://orcid.org/0000-0002-7535-2874	en_US
dc.contributor.author	Ryan, LM https://orcid.org/0000-0001-5957-2490	en_US
dc.date.issued	2017-07-03	en_US
dc.identifier.citation	American Statistician, 2017, 71 (3), pp. 202 - 208	en_US
dc.identifier.issn	0003-1305	en_US
dc.identifier.uri	http://hdl.handle.net/10453/125284
dc.description.abstract	© 2017 American Statistical Association. The big data era demands new statistical analysis paradigms, since traditional methods often break down when datasets are too large to fit on a single desktop computer. Divide and Recombine (D&R) is becoming a popular approach for big data analysis, where results are combined over subanalyses performed in separate data subsets. In this article, we consider situations where unit record data cannot be made available by data custodians due to privacy concerns, and explore the concept of statistical sufficiency and summary statistics for model fitting. The resulting approach represents a type of D&R strategy, which we refer to as summary statistics D&R; as opposed to the standard approach, which we refer to as horizontal D&R. We demonstrate the concept via an extended Gamma–Poisson model, where summary statistics are extracted from different databases and incorporated directly into the fitting algorithm without having to combine unit record data. By exploiting the natural hierarchy of data, our approach has major benefits in terms of privacy protection. Incorporating the proposed modelling framework into data extraction tools such as TableBuilder by the Australian Bureau of Statistics allows for potential analysis at a finer geographical level, which we illustrate with a multilevel analysis of the Australian unemployment data. Supplementary materials for this article are available online.	en_US
dc.relation.ispartof	American Statistician	en_US
dc.relation.isbasedon	10.1080/00031305.2016.1255659	en_US
dc.subject.classification	Statistics & Probability	en_US
dc.title	Sufficiency Revisited: Rethinking Statistical Algorithms in the Big Data Era	en_US
dc.type	Journal Article
utslib.citation.volume	3	en_US
utslib.citation.volume	71	en_US
utslib.for	0104 Statistics	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Science
pubs.organisational-group	/University of Technology Sydney/Faculty of Science/School of Mathematical and Physical Sciences
utslib.copyright.status	closed_access
pubs.issue	3	en_US
pubs.publication-status	Published	en_US
pubs.volume	71	en_US

Abstract:

© 2017 American Statistical Association. The big data era demands new statistical analysis paradigms, since traditional methods often break down when datasets are too large to fit on a single desktop computer. Divide and Recombine (D&R) is becoming a popular approach for big data analysis, where results are combined over subanalyses performed in separate data subsets. In this article, we consider situations where unit record data cannot be made available by data custodians due to privacy concerns, and explore the concept of statistical sufficiency and summary statistics for model fitting. The resulting approach represents a type of D&R strategy, which we refer to as summary statistics D&R; as opposed to the standard approach, which we refer to as horizontal D&R. We demonstrate the concept via an extended Gamma–Poisson model, where summary statistics are extracted from different databases and incorporated directly into the fitting algorithm without having to combine unit record data. By exploiting the natural hierarchy of data, our approach has major benefits in terms of privacy protection. Incorporating the proposed modelling framework into data extraction tools such as TableBuilder by the Australian Bureau of Statistics allows for potential analysis at a finer geographical level, which we illustrate with a multilevel analysis of the Australian unemployment data. Supplementary materials for this article are available online.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/125284