Ph.D. Theses

Noise Reduction in User Generated Datasets

By Louis Gutierrez
Advisor: Mukkai Krishnamoorthy and Ron Eglash
June 26, 2014

The purpose of this research is to address the issue of noise and its propagation in massively large data sets. The claim is made that larger datasets disproportionately attract data noise, and intuitively, this is because malicious contributors, machine errors, Observer Bias, Groupthink, are synthesized, compelled, incentivised---by sheer numbers---to contribute with higher frequency to larger datasets. We demonstrate this empirically by analyzing large user generated datasets from Stack Exchange, Yelp, Amazon, as well as machine generated datasets from National Energy Technology Lab (NETL). In all these datasets we draw one unifying property, in addition to having exceptionally high levels of noise; the noise distribution is non-Gaussian. Moreover, the noise -- which is characterized by an initial surge, a quick decline, and then a slow (slower than exponential) descent towards zero -- strictly follows an Inverse Gaussian Distribution.

The effect of modeling data noise using the Inverse Gaussian Distribution is twofold: first, Normally distributed and Gaussian predicated statistical methods for mining, measuring and analyzing data become less effective, given that noise distribution is non-Gaussian. Secondly, new statistical methods---which are predicated on the Inverse Gaussian Distribution---can be devised to mine, measure and analyze large user generated datasets with greater statistical integrity. Specifically, in this research, we demonstrate that predictive models can be developed, by artificially modeling the historical performance of an evolving dataset on the Inverse Gaussian Distribution, and used to pre-process the dataset for noise.

The overall implications in the findings of this research are in the fields of data mining, analysis and prediction. By employing complex statistical methods, a developer or statistician works on the assumption that the data being analyzed is normally distributed, and if that is not the case, the results can be misleading. And since extremely large datasets are impractical to thoroughly analyze, the statistical error will likely go unnoticed. Moreover, the results in this research suggest that noise leans towards the adversarial case for Normally distributed statistical methods, and thus could potentially instigate the largest possible margin of error. By modeling noise after the Inverse Gaussian Distribution, statistical methods can be modified to work optimally, with a greater focus on the signal and a minimization of noise.

Return to main PhD Theses page