Mitigating the Impact of Data Sampling on Social Media Analysis and Mining

Publisher:
Institute of Electrical and Electronics Engineers (IEEE)
Publication Type:
Journal Article
Citation:
IEEE Transactions on Computational Social Systems, 2020, 7, (2), pp. 546-555
Issue Date:
2020-04-01
Filename Description Size
09001215.pdfPublished version2.25 MB
Adobe PDF
Full metadata record
© 2014 IEEE. The last decade has witnessed the explosive growth of online social media in users and contents. Due to the unprecedented scale and the cascading power of the underlying social networks, social media has created a new paradigm for sharing information, broadcasting breaking news, and reporting real-time events by any user from anywhere at any time. Many popular social media sites including Twitter provide streaming data services by standard APIs to the broad researcher and developer communities. Given the sheer data volume, rapid velocity, and feature variety of online social media, these sites often supply only a sampled set of streaming data, rather than the full data set to reduce the resource cost of computations, storage, and network bandwidth. In light of the substantial impact of sampling in Twitter data stream, this article explores a combination of spectral clustering, locality-sensitive hashing (LSH), latent Dirichlet allocation (LDA) topic modeling, and differential equation modeling to mitigate the impact of sampling on social media data analysis, in particular on detecting real-world events and predicting information diffusion. Our extensive experiments demonstrate that our proposed method is able to detect effectively the real-time emerging events and predict accurately the cascading pattern of these events from the 1% sampled Twitter data stream. To the best of our knowledge, this article is the first effort to introduce a systematic methodology to study and mitigate the impact of data sampling on social media analysis and mining.
Please use this identifier to cite or link to this item: