Statistical Characterization, Pattern Identification, and Analysis of Big Data 2017-01-0236
In the Big Data era, the capability in statistical and probabilistic data characterization, data pattern identification, data modeling and analysis is critical to understand the data, to find the trends in the data, and to make better use of the data. In this paper the fundamental probability concepts and several commonly used probabilistic distribution functions, such as the Weibull for spectrum events and the Pareto for extreme/rare events, are described first. An event quadrant is subsequently established based on the commonality/rarity and impact/effect of the probabilistic events. Level of measurement, which is the key for quantitative measurement of the data, is also discussed based on the framework of probability. The damage density function, which is a measure of the relative damage contribution of each constituent is proposed. The new measure demonstrates its capability in distinguishing between the extreme/rare events and the spectrum events. Several case studies including vehicle reliability, vehicle road test score, warranty, salary distribution of an institution, the city population distribution in 3 countries, and the earthquake distribution worldwide and in the USA, are provided to demonstrate the role of the statistical and probabilistic approaches in the characterization and analysis of the big data.