Statistical Processing using Apache Commons Math

最新推荐文章于 2024-08-07 09:17:38 发布

caolaosanahnu

最新推荐文章于 2024-08-07 09:17:38 发布

阅读量2k

点赞数

分类专栏： Statistics

Statistics 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

http://developeriq.in/articles/2014/feb/24/statistical-processing-using-apache-commons-math/

Statistical Processing using Apache Commons Math

Posted On February 24, 2014 by Ganesh P S filed under Enterprise

Dr.V.Nagaradjane

Statistics is an essential field of study for everyone – from high school kids to hardcore researchers. Expertise in statistical processing is a desirable strength for employees of any organization, since all fields of work are related to data processing. This article explains the use of the open source software library named apache commons math, which provides a collection of functions for basic statistical processing.

INTRODUCTION

Wherever data processing is invoved, use of statistical tools is a automatically understood for extracting essential information from a set of data points at hand. Statistical tools are applicable to medical research which tries to relate diseases with probable cauases, economics where parametric indicators are used for predicting market movements (although with very little success) and many other fields of industry and research.

Tools available for statistical processing are many in number, like MS Excel, OpemOffice Calc, MATLAB, SCILAB, R, STATA, etc. But, they are software packages and not libraries. Having a handy library for statistical processing will quench the thirst of a programmer, who always wishes to solve problems at hand with the most appropriate user interfaces and tools.

The statistical library of Apache commons can be integrated into software packages which require reliable mathematical processing under the Apache License, which permits inclusion of the library in any finished software product (including commercial ones) without many conditions. Hence, one can feel at ease to incorporate the library in any software package and redistribute the same without infringing the rights of any one else.

Apache commons is a project containing assorted list of libraries for several programming tasks. This large library contains support for executing external programs, file compression, Java beans manipulation, Unix daemons, mailing, event logging, Virtual File System (VFS) for accessing local files, network URLS and zip archives as if there was no difference.

Apache Commons Math is capable of solving mathematical problems in linear algebra, genetic algorithms, differential equations, data generation, special functions, complex numbers, distributions, 3D geometry, optimization and data filtering. This article explains the use of the statistical aspect of the Apache Commons Math library for simple day to day applications.

DOWNLOADING AND SETTING UP

Apache Commons Math package can be downloaded from Apache Commons site[1] . The package is in compressed format. Unzip the tar.gz or zip archive at a suitable location. The extracted folder contains a jar file named commons-mathX-Y.Z.jar, where the X, Y and Z denote version, subversion and release numbers.

To start using the Apache commons Math library, incorporate the jar file with its complete path into the CLASSPATH environment variable. On Windows, right click on My Computer, choose properties, select Advanced tab and click on Environment Variables button to set the path. On Linux, open the file /etc/profile as root user (or using sudo) and append the line PATH=$PATH:/my/path/commons-math3-3.2.jar at the end of the file.

The path and commons jar file name should correspond to the installation details on the computer. After setting up the environment variables, one can import the classes of the math libraryin Java programs, compile the source code into byte code without any difficulty.

BASIC FEATURES OF APACHE STATISTICS LIBRARY

The statistics library of Apache Commons provides packages for carrying out several statistical calculations. Table 1 shows the packages and their statistical specialization.

Table 1 Apache Commons Math Statistical Packages

Sl. No.	Package Name	Description
1	org.apache.commons.math3.stat	Provides the classes Frequency for frequency analysis and StatUtils with static methods for calculating mean, variance, minimum, maximum etc.
2	org.apache.commons.math3.stat.descriptive	Provides DescriptiveStatistics class for calculating statistical parameters (mean, S.D., etc.). This class stores input data in memory for further processing. Provides also provides SummaryStatistics class for calculating statistical parameters (mean, S.D., etc.). This class does not store input data in memory.
3	org.apache.commons.math3.stat.correlation	Provides the classes Covariance, PearsonsCorrelation and SpearmansCorrelation to calculate covariance and coefficient of correlation.
4	org.apache.commons.math3.stat.inference	Provides classes for carrying out statistical tests on data points. It provides the classes ChiSquareTest, GTest, MannWhineyUTest, OneWayAnova, TTest and WilcoxonSignedRankTest.
5	org.apache.commons.math3.stat.regression	This package provides the classes SimpleRegression (for univariate regression), OLSMultipleLinearRegression (for multivariate linear regression Ordinary Least Squares method), GLSMultivariateLinearRegression (for multivariate linear regression using Generalized Least Squares method) and MillerUpdatingRegression (for multivariate regression using the procedure describe by Gentleman Miller in his research paper).

BASIC STATISTICS – MEAN, MODE, MEDIAN, SD

To calculate statistical parameters of min, max, mean, geometric mean, item count, sum, sum of squares, standard deviation, variance, percentiles, skewness, kurtosis and median, the commons math library provides the class named DescriptiveStatistics, which contains appropriately named functions for calculation of these parameters.

There is an alternate class named SummaryStatistics to calculate minimum, maximum, mean, geometric mean, number of items, sum, sum of squares, standard deviation and variance. Summary statistics processes its data on the fly. It does not remember the input data used for previous calculations and is useful for single pass calculations on frequently varying input data.

The following program calculates the statistical results using DescriptiveStatistics class. The code imports org.apache.commons.math3.stat.descriptive package. The program expects input values at the top text area. On pressing the button appropriate to the required statistical parameter, results are displayed in the text area at bottom.

//Descriptive.java
01) import org.apache.commons.math3.stat.descriptive.*;
02) import javax.swing.*;
03) import java.awt.*;
04) import java.awt.event.*;
05) public class Descriptive extends JFrame implements ActionListener
06) {
07) DescriptiveStatistics ds = new DescriptiveStatistics();
08) JTextArea input = new JTextArea(8,80), result = new JTextArea(8,80);
09) public Descriptive() {
10)        super("Descriptive Statistics");
11)
12)        result.setEditable(false);
13)        JScrollPane js = null;
14)        this.getContentPane().add(js = new JScrollPane(input), BorderLayout.NORTH);
15)        js.setBorder(BorderFactory.createTitledBorder("Input"));
16)        this.getContentPane().add(js = new JScrollPane(result), BorderLayout.SOUTH);
17)        Font f = new Font("Arial",Font.BOLD,20);
18)        input.setFont(f);
19)        result.setFont(f);
20)        result.setForeground(Color.red);
21)        js.setBorder(BorderFactory.createTitledBorder("Result"));
22)        JPanel buttonPanel = new JPanel(new FlowLayout());
23)        JButton []b = {new JButton("Geometric Mean"),
24)                    new JButton("Kurtosis"), new JButton("Maximum"),
25)                    new JButton("Mean"), new JButton("Minimum"),
26)                    new JButton("Skewness"), new JButton("Percentile"),
27)                    new JButton("Sorted Values"), new JButton("Standard Deviation"),
28)                    new JButton("Sum"), new JButton("Variance"), new JButton("Count")
29)                    };
30)        for(int i=0;i<b.length; i++) {
31)                    b[i].addActionListener(this);
32)                    buttonPanel.add(b[i]);
33)                    }
34)        this.getContentPane().add(buttonPanel, BorderLayout.CENTER);
35)        this.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
36)        this.setBounds(GraphicsEnvironment.getLocalGraphicsEnvironment().
37)                    getMaximumWindowBounds());
38)        this.setVisible(true);
39)        }
40) public void actionPerformed(ActionEvent ae) {
41)        readValues();
42)        String com = ae.getActionCommand();
43)        if(com.equals("Geometric Mean"))
44)                    result.setText("Geometric Mean: "+ds.getGeometricMean());
45)        else if(com.equals("Kurtosis"))
46)                    result.setText("Kutosis: "+ds.getKurtosis());
47)        else if(com.equals("Maximum"))
48)                    result.setText("Maximum: "+ds.getMax());
49)        else if(com.equals("Minimum"))
50)                    result.setText("Minimum: "+ds.getMin());
51)        else if(com.equals("Mean"))
52)                    result.setText("Mean: "+ds.getMean());
53)        else if(com.equals("Skewness"))
54)                    result.setText("Skewness: "+ds.getSkewness());
55)        else if(com.equals("Percentile"))
56)                    result.setText("Percentile: "+ds.getPercentile(Double.parseDouble(
57)                                JOptionPane.showInputDialog(this,"Enter N for percentile",
58)                                            "N-Value", JOptionPane.QUESTION_MESSAGE))));
59)        else if(com.equals("Sorted Values")) {
60)                    double d[] = ds.getSortedValues();
61)                    StringBuilder sb = new StringBuilder("Sorted values:\n");
62)                    for(int i=0; i<d.length; i++)
63)                                sb.append(""+d[i]+"\n");
64)                    result.setText(sb.toString());
65)                    }
66)        else if(com.equals("Standard Deviation"))
67)                    result.setText("S.D.: "+ds.getStandardDeviation());
68)        else if(com.equals("Sum"))
69)                    result.setText("Sum: "+ds.getSum());
70)        else if(com.equals("Variance"))
71)                    result.setText("Variance: "+ds.getVariance());
72)        else if(com.equals("Count"))
73)                    result.setText("Count: "+ds.getN());
74)        }
75) private void readValues() {
76)        java.util.StringTokenizer st = new java.util.StringTokenizer(
77)                    input.getText(),", \t\n\r");
78)        String str = null;
79)        ds.clear();
80)        while(st.hasMoreTokens()) {
81)                    if((str = st.nextToken()) == null || str.equals(""))
82)                                continue;
83)                    ds.addValue(Double.parseDouble(str));
84)                    }
85)        }
86) public static void main(String arg[]) {
87)        new Descriptive();
88)        }
89) }

The statistical calculations are done without much difficulty using the instance of DescriptiveStatistics class (created on line 07). The response to the pressing of buttons is processed in the actionPerformed method (lines 40 to 73). The program simply calls the appropriate method on the DescriptiveStatistics instance (ds) and the results are set on the result area.

It may be noted that most of the code shown above relates to the creation and operation of the Graphics User Interface. All statistical calculations are taken care by the Apache Commons library methods. Fig.1 show a sample operation of the GUI for calculating standard deviation of the numbers 25,34,55,56,80,95,45,82.

CALCULATING COEFFICIENT OF CORRELATION

We notice that there are certain phenomena (the amount of manure and harvest, temperature and sale of ice creams, The duration of powercut and sale of inverters, the temperature and time required for drying clothes), where we know some relation exists, but cannot ascertain how strong the relationship is. Under such circumstances, correlation provides a scientific means for ascertaining the existence of a relationship between given pairs of data.

Correlation also measures the strength of relation between given pairs (X and Y) of data. When a set of input parameters (X) and the results (Y) are taken up for examination of a possible relationship, coefficient of correlation helps to measure the relationship quantitatively.

The coefficient of correlation may range from 0 to 1 or 0 to -1. If the coefficient of correlation is 0, there is no relationship between the pairs of data. If the coefficient of correlation is 1, it denotes very strong relationship between the pairs of data. The positive sign denotes that Y value increases with increase in X value (positive relation – like temperature and sale of ice creams). If the coefficient of correlation is -1, it denotes very strong relation between data, the minus sign indicating that Y value decreases when X values increases (negative relation, like temperature and time required for drying clothes).

In general, the sign of the coefficient of correlation indicates the positive or negative type of relation between the pairs of data and the value of the same indicates stronger relation as the numerical value approaches 1.

The data used for calculating coefficient of correlation relates to the sale of icecreams in an ice cream shop against the temperature of the day, as shown in Table 2.

Table 2 Daily Maximum Temperature and Sale of Ice Creams

Day number	Maximum Temperature (oF)	Sale of icreams (in number)
1	96	386
2	98	493
3	95	390
4	99	568
5	103	624
6	105	630
7	102	612

The following program calculates the coefficient of correlation for the data shown above. The data needs to be organized in the form of two arrays (x[] and y[]) of double type. The program calculates the the most common coefficient of correlation, called Pearson's coefficient, using PearsonsCorrelation class. Then, rank coefficient of correlation (rank is the number assigned to the sorted set of x and y values, in ascending order), based on Spearman's method is calculated using the SpearmansCorrelation class. Both the coefficients of correlation are obtained by calling correlation method with the x and y values as arguments. Fig.2 shows the results obtained by running the above program.

//TestOfCorrelation.java
01) import org.apache.commons.math3.stat.correlation.*;
02) public class TestOfCorrelation
03) {
04) public static void main(String arg[]) {
05)        double x[] = {96,98,95,99,103,105,102},
06)                    y[] = {386,493,390,568,624,630,612};
07)        System.out.println("The data values are:");
08)        for(int i=0; i<x.length; i++)
09)                    System.out.println(x[i]+"\t"+y[i]);
10)
11)        System.out.print("Pearson\'s coefficient of correlation: ");
12)        PearsonsCorrelation pc = new PearsonsCorrelation();
13)        double cc = pc.correlation(x,y);
14)        System.out.println(cc);
15)        SpearmansCorrelation sc = new SpearmansCorrelation();
16)        System.out.println("Spearman\'s Rank Correlation: "+sc.correlation(x,y));
17)        Covariance cov = new Covariance();
18)        System.out.println("Covariance: "+cov.covariance(x,y));
19)        }
20) }

REGRESSION WITH SINGLE VARIABLE

Regression means fitting a mathematical relation to the data at hand, so that the result for any input value can be calculated using the newly found equation. The principle behind regression is to fit such a linear relationship to the data at hand, that the error between the actual result and the calculated result is minimum. If we calculate the relation based single set of X values and a corresponding set of Y values, it is called single variable regression or univariate regression.

Regression is a generalization of the points available at hand in the form of a function. Normally, a linear relationship is assumed. Since a straight line relationship of the general form y = slope * X + Y-axis-intercept is fitted for the given data, regression helps us to calculate the slope and y-axis intercept values.

We might calculate the regression coefficients (slope and y-axis intercept) for the data shown in Table 2, using the program listed below. Fig.3 shows the result obtained by running the program.

//TestOfSimpleRegression.java
01) import org.apache.commons.math3.stat.regression.*;
02) public class TestOfSimpleRegression
03) {
04) public static void main(String arg[]) {
05)        double x[] = {96,98,95,99,103,105,102},
06)                    y[] = {386,493,390,568,624,630,612};
07)        java.text.DecimalFormat df = new java.text.DecimalFormat("0.00");
08)
09)        System.out.println("The data values are:");
10)        for(int i=0; i<x.length; i++)
11)                    System.out.println(x[i]+"\t"+y[i]);
12)
13)        SimpleRegression sr = new SimpleRegression();
14)        for(int i=0; i<x.length; i++)
15)                    sr.addData(x[i],y[i]);
16)        double slope = sr.getSlope(), intercept = sr.getIntercept();
17)        String interceptString = (intercept<0?(""+df.format(intercept)):
18)                                ("+"+df.format(intercept)));
19)        System.out.println("y = "+df.format(slope)+"x "+interceptString);
20)
21)        System.out.println("Predictions are (look out for errors in y values)");
22)        for(int i=0; i<x.length; i++)
23)                    System.out.println(x[i]+"\t"+df.format(sr.predict(x[i])));
24)        System.out.println("Error in slope: "+sr.getSlopeStdErr());
25)        System.out.println("Error in intercept: "+sr.getInterceptStdErr());
26)        System.out.println("Mean squared error: "+sr.getMeanSquareError());
27)        }
28) }

REGRESSION WITH MULTIPLE VARIABLES

When the result (Y) depends on more than one input value (X1, X2, ...), the regression carried out on such data sets is called multiple regression or multivariate regression. Let us consider that there is relationship between productivity of an individual, the rate of mobile phone/tablet owned by him/her and the monthly data consumption on the device. Table 3 depicts fictitious data on the productivity of the owner of the device, as a percentage of full productivity.

Table 3 Fictitious Data on Ownership of Device and Productivity of Owner

Sl. No.	Rate of Device (Mobile/Tab) (Rupees)	Data Usage (MB) per month	Productivity of its owner
1	3,000	100	100%
2	6,000	512	85%
3	10,000	1024	70%
4	10,000	2048	67%
5	20,000	2048	65%
6	20,000	3072	60%
7	30,000	1024	66%
8	30,000	2048	60%
9	30,000	3072	55%
10	45,000	1024	70%
11	45,000	2048	65%
12	45,000	3072	60%
13	45,000	4098	55%
14	60,000	1024	67%
15	60,000	2048	60%
16	60,000	3072	50%
17	60,000	4098	40%

The data is completely baseless and only makes fun of the productivity of smart phone. It portrays the productivity of an average individual who owns a smart phone, compared to that of a person who owns a basic, non-smart phone. Productivity here means the out turn of work which can immediately help the owner of the device to earn money or fulfill the task assigned by his or her employer.

The Apache Math Library provides two classes for carrying out multiple regression, namely, OLSMultipleLinearRegression and GLSMultipleLinearRegression. OLS stands for Ordinary Least Squares and GLS stands for Generalized Least Squares methods. Both the methods aim to minimize the square of the error between the actual result and the regression prediction. The following example uses OLSMultipleLinearRegression for estimating the regression parameters.

//TestOfMultipleRegression.java
01) import org.apache.commons.math3.stat.regression.*;
02) public class TestOfMultipleRegression
03) {
04) public static void main(String arg[]) {
05)        double x[][] = {
06)                    {3000,100},{6000,512},{10000,1024},
07)                    {10000,2048},{20000,2048},{20000,3072},
08)                    {30000,1024},{30000,2048},{30000,3072},
09)                    {45000,1024},{45000,2048},{45000,3072},
10)                    {45000,4098},{60000,1024},{60000,2048},
11)                    {60000,3072},{60000,4098}
12)                    },
13)        y[] = {100, 85, 70, 67,
14)                    65, 60, 66, 60,
15)                    55, 70, 65, 60,
16)                    55, 67, 60, 50, 40};
17)        java.text.DecimalFormat df = new java.text.DecimalFormat("####0.00000");
18)        OLSMultipleLinearRegression mr = new OLSMultipleLinearRegression();
19)        mr.newSampleData(y,x);
20)        double param[] = mr.estimateRegressionParameters();
21)        String param1String = (param[1]<0?(""+
22)                                                        df.format(param[1])):      ("+"+df.format(param[1]))),
23)                                            param2String = (param[2]<0?(""+
24)                                                        df.format(param[2])):      ("+"+df.format(param[2])));
25)        System.out.println("y = "+df.format(param[0])+
26)                                param1String+"Rd"+param2String+"Du");
27)
28)        System.out.println("\nThe data values are:");
29)        System.out.println("Rate\tData\tGiven\tPredicted");
30)        for(int i=0; i<x.length; i++)
31)                    System.out.println(x[i][0]+"\t"+x[i][1]+"\t"+y[i]+"\t"+
32)                                (param[0]+param[1]*x[i][0]+param[2]*x[i][1]));
33)        }
34) }

The predictions obtained from the above formula are displayed in Fig.4. Although regression introduces some error into the predictions, it creates a generalized version of the data available at hand. Predictions of regression equation are applicable for even unknown data input values.

If a person buys a mobile phone for 42,000 rupees and subscribing for a monthly data plan of 2500MB, the equation predicts the resulting productivity at 59.61%. But, the original data set does not have an answer to the given rate and data usage pair. Hence, regression equation helps us to make predictions on points for which results are not known through observation.

CONCLUSION

Need for performing statistical calculations is so widespread in the programming world. Correlation and regression are applied in generating mathematical models (equations) for experimental results. Apache Commons math library provides a well seasoned library for statistical calculations. We need only to provide the data and calculations are carried out without the burden of coding each operation.

Apache Commons math has much more to offer a programmer than statistics, by covering several areas of mathematics. I wish you extract more than AVERAGE productivity using Apache Commons math library.