Big Data: The End of Sampling As We Know It?

Feb 12, 2014 - 4:15 PM

to , -

Big Data: The End of Sampling As We Know It?

Date:	Wednesday, February 12
Time:	4:10 pm -- 5:00 pm
Place:	Snedecor 3105
Speaker:	Lily Wang, Department of Statistics, University of Georgia, Athens

Abstract:

Each day in our lives we are breathing the air of digital data. Nowadays, “Big Data” is seemingly generated at all times by everything around us. It arrives with the well-known four V’s: alarming Velocity, ever-expanding Volume, multi-source Variety, and inhomogeneous Veracity. Traditionally, scientists have used sampling to draw inferences from data obtained from large populations. However, “Big Data” introduces the possibility that one can obtain exact results for an entire population. Will this eliminate the need for sampling? Is sampling an artifact of past best practices? In this talk we will address the above questions, and compare results obtained from analyses applied to populations and samples thereof.

“Big Data” come to us with great promise, as they can enhance and improve sample estimates by providing a huge number of auxiliary variables correlated with our primary variables of interest. We have entered an era where data collection is cheap, but extracting useful information from such data is not. In this talk, we introduce a general strategy for variable selection from large data-sets under various sampling designs. A survey-weighted penalized estimating equation approach is proposed to simultaneously select significant variables and estimate model coefficients. The proposed estimators are design-consistent and perform as well as the oracle procedure when the correct sub-model is known. A fast and efficient variable selection algorithm is developed to identify significant variables for complex longitudinal surveys. Examples will be illustrated to show the usefulness of the proposed methodology under various model settings and sampling designs.