http://developeriq.in/articles/2014/feb/24/statistical-processing-using-apache-commons-math/
Statistical Processing using Apache Commons Math
Posted On February 24, 2014 by Ganesh P S filed under Enterprise
Dr.V.Nagaradjane
Statistics is an essential field of study for everyone – from high school kids to hardcore researchers. Expertise in statistical processing is a desirable strength for employees of any organization, since all fields of work are related to data processing. This article explains the use of the open source software library named apache commons math, which provides a collection of functions for basic statistical processing.
INTRODUCTION
Wherever data processing is invoved, use of statistical tools is a automatically understood for extracting essential information from a set of data points at hand. Statistical tools are applicable to medical research which tries to relate diseases with probable cauases, economics where parametric indicators are used for predicting market movements (although with very little success) and many other fields of industry and research.
Tools available for statistical processing are many in number, like MS Excel, OpemOffice Calc, MATLAB, SCILAB, R, STATA, etc. But, they are software packages and not libraries. Having a handy library for statistical processing will quench the thirst of a programmer, who always wishes to solve problems at hand with the most appropriate user interfaces and tools.
The statistical library of Apache commons can be integrated into software packages which require reliable mathematical processing under the Apache License, which permits inclusion of the library in any finished software product (including commercial ones) without many conditions. Hence, one can feel at ease to incorporate the library in any software package and redistribute the same without infringing the rights of any one else.
Apache commons is a project containing assorted list of libraries for several programming tasks. This large library contains support for executing external programs, file compression, Java beans manipulation, Unix daemons, mailing, event logging, Virtual File System (VFS) for accessing local files, network URLS and zip archives as if there was no difference.
Apache Commons Math is capable of solving mathematical problems in linear algebra, genetic algorithms, differential equations, data generation, special functions, complex numbers, distributions, 3D geometry, optimization and data filtering. This article explains the use of the statistical aspect of the Apache Commons Math library for simple day to day applications.
DOWNLOADING AND SETTING UP
Apache Commons Math package can be downloaded from Apache Commons site[1] . The package is in compressed format. Unzip the tar.gz or zip archive at a suitable location. The extracted folder contains a jar file named commons-mathX-Y.Z.jar, where the X, Y and Z denote version, subversion and release numbers.
To start using the Apache commons Math library, incorporate the jar file with its complete path into the CLASSPATH environment variable. On Windows, right click on My Computer, choose properties, select Advanced tab and click on Environment Variables button to set the path. On Linux, open the file /etc/profile as root user (or using sudo) and append the line PATH=$PATH:/my/path/commons-math3-3.2.jar at the end of the file.
The path and commons jar file name should correspond to the installation details on the computer. After setting up the environment variables, one can import the classes of the math libraryin Java programs, compile the source code into byte code without any difficulty.
BASIC FEATURES OF APACHE STATISTICS LIBRARY
The statistics library of Apache Commons provides packages for carrying out several statistical calculations. Table 1 shows the packages and their statistical specialization.
Table 1 Apache Commons Math Statistical Packages
Sl. No. | Package Name | Description |
1 | org.apache.commons.math3.stat | Provides the classes Frequency for frequency analysis and StatUtils with static methods for calculating mean, variance, minimum, maximum etc. |
2 | org.apache.commons.math3.stat.descriptive | Provides DescriptiveStatistics class for calculating statistical parameters (mean, S.D., etc.). This class stores input data in memory for further processing. Provides also provides SummaryStatistics class for calculating statistical parameters (mean, S.D., etc.). This class does not store input data in memory. |
3 | org.apache.commons.math3.stat.correlation | Provides the classes Covariance, PearsonsCorrelation and SpearmansCorrelation to calculate covariance and coefficient of correlation. |
4 | org.apache.commons.math3.stat.inference | Provides classes for carrying out statistical tests on data points. It provides the classes ChiSquareTest, GTest, MannWhineyUTest, OneWayAnova, TTest and WilcoxonSignedRankTest. |
5 | org.apache.commons.math3.stat.regression | This package provides the classes SimpleRegression (for univariate regression), OLSMultipleLinearRegression (for multivariate linear regression Ordinary Least Squares method), GLSMultivariateLinearRegression (for multivariate linear regression using Generalized Least Squares method) and MillerUpdatingRegression (for multivariate regression using the procedure describe by Gentleman Miller in his research paper). |
BASIC STATISTICS – MEAN, MODE, MEDIAN, SD
To calculate statistical parameters of min, max, mean, geometric mean, item count, sum, sum of squares, standard deviation, variance, percentiles, skewness, kurtosis and median, the commons math library provides the class named DescriptiveStatistics, which contains appropriately named functions for calculation of these parameters.
There is an alternate class named SummaryStatistics to calculate minimum, maximum, mean, geometric mean, number of items, sum, sum of squares, standard deviation and variance. Summary statistics processes its data on the fly. It does not remember the input data used for previous calculations and is useful for single pass calculations on frequently varying input data.
The following program calculates the statistical results using DescriptiveStatistics class. The code imports org.apache.commons.math3.stat.descriptive package. The program expects input values at the top text area. On pressing the button appropriate to the required statistical parameter, results are displayed in the text area at bottom.
//Descriptive.java |
The statistical calculations are done without much difficulty using the instance of DescriptiveStatistics class (created on line 07). The response to the pressing of buttons is processed in the actionPerformed method (lines 40 to 73). The program simply calls the appropriate method on the DescriptiveStatistics instance (ds) and the results are set on the result area.
It may be noted that most of the code shown above relates to the creation and operation of the Graphics User Interface. All statistical calculations are taken care by the Apache Commons library methods. Fig.1 show a sample operation of the GUI for calculating standard deviation of the numbers 25,34,55,56,80,95,45,82.
CALCULATING COEFFICIENT OF CORRELATION
We notice that there are certain phenomena (the amount of manure and harvest, temperature and sale of ice creams, The duration of powercut and sale of inverters, the temperature and time required for drying clothes), where we know some relation exists, but cannot ascertain how strong the relationship is. Under such circumstances, correlation provides a scientific means for ascertaining the existence of a relationship between given pairs of data.
Correlation also measures the strength of relation between given pairs (X and Y) of data. When a set of input parameters (X) and the results (Y) are taken up for examination of a possible relationship, coefficient of correlation helps to measure the relationship quantitatively.
The coefficient of correlation may range from 0 to 1 or 0 to -1. If the coefficient of correlation is 0, there is no relationship between the pairs of data. If the coefficient of correlation is 1, it denotes very strong relationship between the pairs of data. The positive sign denotes that Y value increases with increase in X value (positive relation – like temperature and sale of ice creams). If the coefficient of correlation is -1, it denotes very strong relation between data, the minus sign indicating that Y value decreases when X values increases (negative relation, like temperature and time required for drying clothes).
In general, the sign of the coefficient of correlation indicates the positive or negative type of relation between the pairs of data and the value of the same indicates stronger relation as the numerical value approaches 1.
The data used for calculating coefficient of correlation relates to the sale of icecreams in an ice cream shop against the temperature of the day, as shown in Table 2.
Table 2 Daily Maximum Temperature and Sale of Ice Creams
Day number | Maximum Temperature (oF) | Sale of icreams (in number) |
1 | 96 | 386 |
2 | 98 | 493 |
3 | 95 | 390 |
4 | 99 | 568 |
5 | 103 | 624 |
6 | 105 | 630 |
7 | 102 | 612 |
The following program calculates the coefficient of correlation for the data shown above. The data needs to be organized in the form of two arrays (x[] and y[]) of double type. The program calculates the the most common coefficient of correlation, called Pearson's coefficient, using PearsonsCorrelation class. Then, rank coefficient of correlation (rank is the number assigned to the sorted set of x and y values, in ascending order), based on Spearman's method is calculated using the SpearmansCorrelation class. Both the coefficients of correlation are obtained by calling correlation method with the x and y values as arguments. Fig.2 shows the results obtained by running the above program.
//TestOfCorrelation.java |
REGRESSION WITH SINGLE VARIABLE
Regression means fitting a mathematical relation to the data at hand, so that the result for any input value can be calculated using the newly found equation. The principle behind regression is to fit such a linear relationship to the data at hand, that the error between the actual result and the calculated result is minimum. If we calculate the relation based single set of X values and a corresponding set of Y values, it is called single variable regression or univariate regression.
Regression is a generalization of the points available at hand in the form of a function. Normally, a linear relationship is assumed. Since a straight line relationship of the general form y = slope * X + Y-axis-intercept is fitted for the given data, regression helps us to calculate the slope and y-axis intercept values.
We might calculate the regression coefficients (slope and y-axis intercept) for the data shown in Table 2, using the program listed below. Fig.3 shows the result obtained by running the program.
//TestOfSimpleRegression.java |
REGRESSION WITH MULTIPLE VARIABLES
When the result (Y) depends on more than one input value (X1, X2, ...), the regression carried out on such data sets is called multiple regression or multivariate regression. Let us consider that there is relationship between productivity of an individual, the rate of mobile phone/tablet owned by him/her and the monthly data consumption on the device. Table 3 depicts fictitious data on the productivity of the owner of the device, as a percentage of full productivity.
Table 3 Fictitious Data on Ownership of Device and Productivity of Owner
Sl. No. | Rate of Device (Mobile/Tab) | Data Usage (MB) per month | Productivity of its owner |
1 | 3,000 | 100 | 100% |
2 | 6,000 | 512 | 85% |
3 | 10,000 | 1024 | 70% |
4 | 10,000 | 2048 | 67% |
5 | 20,000 | 2048 | 65% |
6 | 20,000 | 3072 | 60% |
7 | 30,000 | 1024 | 66% |
8 | 30,000 | 2048 | 60% |
9 | 30,000 | 3072 | 55% |
10 | 45,000 | 1024 | 70% |
11 | 45,000 | 2048 | 65% |
12 | 45,000 | 3072 | 60% |
13 | 45,000 | 4098 | 55% |
14 | 60,000 | 1024 | 67% |
15 | 60,000 | 2048 | 60% |
16 | 60,000 | 3072 | 50% |
17 | 60,000 | 4098 | 40% |
The data is completely baseless and only makes fun of the productivity of smart phone. It portrays the productivity of an average individual who owns a smart phone, compared to that of a person who owns a basic, non-smart phone. Productivity here means the out turn of work which can immediately help the owner of the device to earn money or fulfill the task assigned by his or her employer.
The Apache Math Library provides two classes for carrying out multiple regression, namely, OLSMultipleLinearRegression and GLSMultipleLinearRegression. OLS stands for Ordinary Least Squares and GLS stands for Generalized Least Squares methods. Both the methods aim to minimize the square of the error between the actual result and the regression prediction. The following example uses OLSMultipleLinearRegression for estimating the regression parameters.
//TestOfMultipleRegression.java |
The predictions obtained from the above formula are displayed in Fig.4. Although regression introduces some error into the predictions, it creates a generalized version of the data available at hand. Predictions of regression equation are applicable for even unknown data input values.
If a person buys a mobile phone for 42,000 rupees and subscribing for a monthly data plan of 2500MB, the equation predicts the resulting productivity at 59.61%. But, the original data set does not have an answer to the given rate and data usage pair. Hence, regression equation helps us to make predictions on points for which results are not known through observation.
CONCLUSION
Need for performing statistical calculations is so widespread in the programming world. Correlation and regression are applied in generating mathematical models (equations) for experimental results. Apache Commons math library provides a well seasoned library for statistical calculations. We need only to provide the data and calculations are carried out without the burden of coding each operation.
Apache Commons math has much more to offer a programmer than statistics, by covering several areas of mathematics. I wish you extract more than AVERAGE productivity using Apache Commons math library.
REFERENCES
1. | shttp://commons.apache.org/proper/commons-math/download_math.cgi |
2. | http://commons.apache.org/proper/commons-math/userguide/stat.html |
About Author
V. Nagaradjane is a freelance programmer. He may be contacted at nagaradjanev@rediffmail.com