Statistical Processing using Apache Commons Math



http://developeriq.in/articles/2014/feb/24/statistical-processing-using-apache-commons-math/

Statistical Processing using Apache Commons Math

Dr.V.Nagaradjane

Statistics is an essential field of study for everyone – from high school kids to hardcore researchers. Expertise in statistical processing is a desirable strength for employees of any organization, since all fields of work are related to data processing. This article explains the use of the open source software library named apache commons math, which provides a collection of functions for basic statistical processing.

INTRODUCTION

Wherever data processing is invoved, use of statistical tools is a automatically understood for extracting essential information from a set of data points at hand. Statistical tools are applicable to medical research which tries to relate diseases with probable cauases, economics where parametric indicators are used for predicting market movements (although with very little success) and many other fields of industry and research.

 

Tools available for statistical processing are many in number, like MS Excel, OpemOffice Calc, MATLAB, SCILAB, R, STATA, etc. But, they are software packages and not libraries. Having a handy library for statistical processing will quench the thirst of a programmer, who always wishes to solve problems at hand with the most appropriate user interfaces and tools.

The statistical library of Apache commons can be integrated into software packages which require reliable mathematical processing under the Apache License, which permits inclusion of the library in any finished software product (including commercial ones) without many conditions. Hence, one can feel at ease to incorporate the library in any software package and redistribute the same without infringing the rights of any one else.

Apache commons is a project containing assorted list of libraries for several programming tasks. This large library contains support for executing external programs, file compression, Java beans manipulation, Unix daemons, mailing, event logging, Virtual File System (VFS) for accessing local files, network URLS and zip archives as if there was no difference.

Apache Commons Math is capable of solving mathematical problems in linear algebra, genetic algorithms, differential equations, data generation, special functions, complex numbers, distributions, 3D geometry, optimization and data filtering. This article explains the use of the statistical aspect of the Apache Commons Math library for simple day to day applications.

DOWNLOADING AND SETTING UP

Apache Commons Math package can be downloaded from Apache Commons site[1] . The package is in compressed format. Unzip the tar.gz or zip archive at a suitable location. The extracted folder contains a jar file named commons-mathX-Y.Z.jar, where the X, Y and Z denote version, subversion and release numbers.

To start using the Apache commons Math library, incorporate the jar file with its complete path into the CLASSPATH environment variable. On Windows, right click on My Computer, choose properties, select Advanced tab and click on Environment Variables button to set the path. On Linux, open the file /etc/profile as root user (or using sudo) and append the line PATH=$PATH:/my/path/commons-math3-3.2.jar at the end of the file.

The path and commons jar file name should correspond to the installation details on the computer. After setting up the environment variables, one can import the classes of the math libraryin Java programs, compile the source code into byte code without any difficulty.

BASIC FEATURES OF APACHE STATISTICS LIBRARY

The statistics library of Apache Commons provides packages for carrying out several statistical calculations. Table 1 shows the packages and their statistical specialization.

 

Table 1 Apache Commons Math Statistical Packages


Sl. No.

Package Name

Description

1

org.apache.commons.math3.stat

Provides the classes Frequency for frequency analysis and StatUtils with static methods for calculating mean, variance, minimum, maximum etc.

2

org.apache.commons.math3.stat.descriptive

Provides DescriptiveStatistics class for calculating statistical parameters (mean, S.D., etc.). This class stores input data in memory for further processing.

Provides also provides SummaryStatistics class for calculating statistical parameters (mean, S.D., etc.). This class does not store input data in memory.

3

org.apache.commons.math3.stat.correlation

Provides the classes Covariance, PearsonsCorrelation and SpearmansCorrelation to calculate covariance and coefficient of correlation.

4

org.apache.commons.math3.stat.inference

Provides classes for carrying out statistical tests on data points. It provides the classes ChiSquareTest, GTest, MannWhineyUTest, OneWayAnova, TTest and WilcoxonSignedRankTest.

5

org.apache.commons.math3.stat.regression

This package provides the classes SimpleRegression (for univariate regression), OLSMultipleLinearRegression (for multivariate linear regression Ordinary Least Squares method), GLSMultivariateLinearRegression (for multivariate linear regression using Generalized Least Squares method) and MillerUpdatingRegression (for multivariate regression using the procedure describe by Gentleman Miller in his research paper).

BASIC STATISTICS – MEAN, MODE, MEDIAN, SD

To calculate statistical parameters of min, max, mean, geometric mean, item count, sum, sum of squares, standard deviation, variance, percentiles, skewness, kurtosis and median, the commons math library provides the class named DescriptiveStatistics, which contains appropriately named functions for calculation of these parameters.

 

There is an alternate class named SummaryStatistics to calculate minimum, maximum, mean, geometric mean, number of items, sum, sum of squares, standard deviation and variance. Summary statistics processes its data on the fly. It does not remember the input data used for previous calculations and is useful for single pass calculations on frequently varying input data.

The following program calculates the statistical results using DescriptiveStatistics class. The code imports org.apache.commons.math3.stat.descriptive package. The program expects input values at the top text area. On pressing the button appropriate to the required statistical parameter,  results are displayed in the text area at bottom.

//Descriptive.java
01) import org.apache.commons.math3.stat.descriptive.*;
02) import javax.swing.*;
03) import java.awt.*;
04) import java.awt.event.*;
05) public class Descriptive extends JFrame implements ActionListener
06) {
07) DescriptiveStatistics ds = new DescriptiveStatistics();
08) JTextArea input = new JTextArea(8,80), result = new JTextArea(8,80);
09) public Descriptive() {
10)        super("Descriptive Statistics");
11)       
12)        result.setEditable(false);
13)        JScrollPane js = null;
14)        this.getContentPane().add(js = new JScrollPane(input), BorderLayout.NORTH);
15)        js.setBorder(BorderFactory.createTitledBorder("Input"));
16)        this.getContentPane().add(js = new JScrollPane(result), BorderLayout.SOUTH);
17)        Font f = new Font("Arial",Font.BOLD,20);
18)        input.setFont(f);
19)        result.setFont(f);
20)        result.setForeground(Color.red);
21)        js.setBorder(BorderFactory.createTitledBorder("Result"));
22)        JPanel buttonPanel = new JPanel(new FlowLayout());
23)        JButton []b = {new JButton("Geometric Mean"),
24)                    new JButton("Kurtosis"), new JButton("Maximum"),
25)                    new JButton("Mean"), new JButton("Minimum"),
26)                    new JButton("Skewness"), new JButton("Percentile"),
27)                    new JButton("Sorted Values"), new JButton("Standard Deviation"),
28)                    new JButton("Sum"), new JButton("Variance"), new JButton("Count")
29)                    };
30)        for(int i=0;i<b.length; i++) {
31)                    b[i].addActionListener(this);
32)                    buttonPanel.add(b[i]);
33)                    }
34)        this.getContentPane().add(buttonPanel, BorderLayout.CENTER);
35)        this.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
36)        this.setBounds(GraphicsEnvironment.getLocalGraphicsEnvironment().
37)                    getMaximumWindowBounds());
38)        this.setVisible(true);
39)        }
40) public void actionPerformed(ActionEvent ae) {
41)        readValues();
42)        String com = ae.getActionCommand();
43)        if(com.equals("Geometric Mean"))
44)                    result.setText("Geometric Mean: "+ds.getGeometricMean());
45)        else if(com.equals("Kurtosis"))
46)                    result.setText("Kutosis: "+ds.getKurtosis());
47)        else if(com.equals("Maximum"))
48)                    result.setText("Maximum: "+ds.getMax());
49)        else if(com.equals("Minimum"))
50)                    result.setText("Minimum: "+ds.getMin());
51)        else if(com.equals("Mean"))
52)                    result.setText("Mean: "+ds.getMean());
53)        else if(com.equals("Skewness"))
54)                    result.setText("Skewness: "+ds.getSkewness());
55)        else if(com.equals("Percentile"))
56)                    result.setText("Percentile: "+ds.getPercentile(Double.parseDouble(
57)                                JOptionPane.showInputDialog(this,"Enter N for percentile",
58)                                            "N-Value", JOptionPane.QUESTION_MESSAGE))));
59)        else if(com.equals("Sorted Values")) {
60)                    double d[] = ds.getSortedValues();
61)                    StringBuilder sb = new StringBuilder("Sorted values:\n");
62)                    for(int i=0; i<d.length; i++)
63)                                sb.append(""+d[i]+"\n");
64)                    result.setText(sb.toString());
65)                    }
66)        else if(com.equals("Standard Deviation"))
67)                    result.setText("S.D.: "+ds.getStandardDeviation());
68)        else if(com.equals("Sum"))
69)                    result.setText("Sum: "+ds.getSum());
70)        else if(com.equals("Variance"))
71)                    result.setText("Variance: "+ds.getVariance());
72)        else if(com.equals("Count"))
73)                    result.setText("Count: "+ds.getN());
74)        }
75) private void readValues() {
76)        java.util.StringTokenizer st = new java.util.StringTokenizer(
77)                    input.getText(),", \t\n\r");
78)        String str = null;
79)        ds.clear();
80)        while(st.hasMoreTokens()) {
81)                    if((str = st.nextToken()) == null || str.equals(""))
82)                                continue;
83)                    ds.addValue(Double.parseDouble(str));
84)                    }
85)        }
86) public static void main(String arg[]) {
87)        new Descriptive();
88)        }
89) }

The statistical calculations are done without much difficulty using the instance of DescriptiveStatistics class (created on line 07). The response to the pressing of buttons is processed in the actionPerformed method (lines 40 to 73). The program simply calls the appropriate method on the DescriptiveStatistics instance (ds) and the results are set on the result area.

It may be noted that most of the code shown above relates to the creation and operation of the Graphics User Interface. All statistical calculations are taken care by the Apache Commons library methods. Fig.1 show a sample operation of the GUI for calculating standard deviation of the numbers 25,34,55,56,80,95,45,82.

 CALCULATING COEFFICIENT OF CORRELATION

We notice that there are certain phenomena (the amount of manure and harvest, temperature and sale of ice creams, The duration of powercut and sale of inverters, the temperature and time required for drying clothes), where we know some relation exists, but cannot ascertain how strong the relationship is. Under such circumstances, correlation provides a scientific means for ascertaining the existence of a relationship between given pairs of data.

Correlation also measures the strength of relation between given pairs (X and Y) of data. When a set of input parameters (X) and the results (Y) are taken up for examination of a possible relationship, coefficient of correlation helps to measure the relationship quantitatively.

The coefficient of correlation may range from 0 to 1 or 0 to -1. If the coefficient of correlation is 0, there is no relationship between the pairs of data. If the coefficient of correlation is 1, it denotes very strong relationship between the pairs of data. The positive sign denotes that Y value increases with increase in X value (positive relation – like temperature and sale of ice creams). If the coefficient of correlation is -1, it denotes very strong relation between data, the minus sign indicating that Y value decreases when X values increases (negative relation, like temperature and time required for drying clothes).

In general, the sign of the coefficient of correlation indicates the positive or negative type of relation between the pairs of data and the value of the same indicates stronger relation as the numerical value approaches 1.

The data used for calculating coefficient of correlation relates to the sale of icecreams in an ice cream shop against the temperature of the day, as shown in Table 2.

Table 2 Daily Maximum Temperature and Sale of Ice Creams


Day number

Maximum Temperature (oF)

Sale of icreams (in number)

1

96

386

2

98

493

3

95

390

4

99

568

5

103

624

6

105

630

7

102

612


The following program calculates the coefficient of correlation for the data shown above. The data needs to be organized in the form of two  arrays (x[] and y[]) of double type. The program calculates the the most common coefficient of correlation, called Pearson's coefficient, using PearsonsCorrelation class. Then, rank coefficient of correlation (rank is the number assigned to the sorted set of x and y values, in ascending order), based on Spearman's method is calculated using the SpearmansCorrelation class. Both the coefficients of correlation are obtained by calling correlation method with the x and y values as arguments. Fig.2 shows the results obtained by running the above program.

//TestOfCorrelation.java
01) import org.apache.commons.math3.stat.correlation.*;
02) public class TestOfCorrelation
03) {
04) public static void main(String arg[]) {
05)        double x[] = {96,98,95,99,103,105,102},
06)                    y[] = {386,493,390,568,624,630,612};
07)        System.out.println("The data values are:");
08)        for(int i=0; i<x.length; i++)
09)                    System.out.println(x[i]+"\t"+y[i]);
10)       
11)        System.out.print("Pearson\'s coefficient of correlation: ");
12)        PearsonsCorrelation pc = new PearsonsCorrelation();
13)        double cc = pc.correlation(x,y);
14)        System.out.println(cc);
15)        SpearmansCorrelation sc = new SpearmansCorrelation();
16)        System.out.println("Spearman\'s Rank Correlation: "+sc.correlation(x,y));
17)        Covariance cov = new Covariance();
18)        System.out.println("Covariance: "+cov.covariance(x,y));
19)        }
20) }

 

REGRESSION WITH SINGLE VARIABLE

Regression means fitting a mathematical relation to the data at hand, so that the result for any input value can be calculated using the newly found equation. The principle behind regression is to fit such a linear relationship to the data at hand, that the error between the actual result and the calculated result is minimum. If we calculate the relation based single set of X values and a corresponding set of Y values, it is called single variable regression or univariate regression.

 

Regression is a generalization of the points available at hand in the form of a function. Normally, a linear relationship is assumed. Since a straight line relationship of the general form y = slope * X + Y-axis-intercept is fitted for the given data, regression helps us to calculate the slope and y-axis intercept values.

We might calculate the regression coefficients (slope and y-axis intercept) for the data shown in Table 2, using the program listed below. Fig.3 shows the result obtained by running the program.

//TestOfSimpleRegression.java
01) import org.apache.commons.math3.stat.regression.*;
02) public class TestOfSimpleRegression
03) {
04) public static void main(String arg[]) {
05)        double x[] = {96,98,95,99,103,105,102},
06)                    y[] = {386,493,390,568,624,630,612};
07)        java.text.DecimalFormat df = new java.text.DecimalFormat("0.00");
08)       
09)        System.out.println("The data values are:");
10)        for(int i=0; i<x.length; i++)
11)                    System.out.println(x[i]+"\t"+y[i]);
12)       
13)        SimpleRegression sr = new SimpleRegression();
14)        for(int i=0; i<x.length; i++)
15)                    sr.addData(x[i],y[i]);
16)        double slope = sr.getSlope(), intercept = sr.getIntercept();
17)        String interceptString = (intercept<0?(""+df.format(intercept)):
18)                                ("+"+df.format(intercept)));
19)        System.out.println("y = "+df.format(slope)+"x "+interceptString);
20)       
21)        System.out.println("Predictions are (look out for errors in y values)");
22)        for(int i=0; i<x.length; i++)
23)                    System.out.println(x[i]+"\t"+df.format(sr.predict(x[i])));
24)        System.out.println("Error in slope: "+sr.getSlopeStdErr());
25)        System.out.println("Error in intercept: "+sr.getInterceptStdErr());
26)        System.out.println("Mean squared error: "+sr.getMeanSquareError());
27)        }
28) }

 

REGRESSION WITH MULTIPLE VARIABLES

When the result (Y) depends on more than one input value (X1, X2, ...), the regression carried out on such data sets is called multiple regression or multivariate regression. Let us consider that there is relationship between productivity of an individual, the rate of mobile phone/tablet owned by him/her and the monthly data consumption on the device. Table 3 depicts fictitious data on the productivity of the owner of the device, as a percentage of full productivity.

Table 3 Fictitious Data on Ownership of Device and Productivity of Owner


Sl. No.

Rate of Device (Mobile/Tab) 
(Rupees)

Data Usage (MB) per month

Productivity of its owner

1

3,000

100

100%

2

6,000

512

85%

3

10,000

1024

70%

4

10,000

2048

67%

5

20,000

2048

65%

6

20,000

3072

60%

7

30,000

1024

66%

8

30,000

2048

60%

9

30,000

3072

55%

10

45,000

1024

70%

11

45,000

2048

65%

12

45,000

3072

60%

13

45,000

4098

55%

14

60,000

1024

67%

15

60,000

2048

60%

16

60,000

3072

50%

17

60,000

4098

40%

The data is completely baseless and only makes fun of the productivity of smart phone. It portrays the productivity of an average individual who owns a smart phone, compared to that of a person who owns a basic, non-smart phone. Productivity here means the out turn of work which can immediately help the owner of the device to earn money or fulfill the task assigned by his or her employer.

      

The Apache Math Library provides two classes for carrying out multiple regression, namely, OLSMultipleLinearRegression and GLSMultipleLinearRegression. OLS stands for Ordinary Least Squares and GLS stands for Generalized Least Squares methods. Both the methods aim to minimize the square of the error between the actual result and the regression prediction. The following example uses OLSMultipleLinearRegression for estimating the regression parameters.

//TestOfMultipleRegression.java
01) import org.apache.commons.math3.stat.regression.*;
02) public class TestOfMultipleRegression
03) {
04) public static void main(String arg[]) {
05)        double x[][] = {
06)                    {3000,100},{6000,512},{10000,1024},
07)                    {10000,2048},{20000,2048},{20000,3072},
08)                    {30000,1024},{30000,2048},{30000,3072},
09)                    {45000,1024},{45000,2048},{45000,3072},
10)                    {45000,4098},{60000,1024},{60000,2048},
11)                    {60000,3072},{60000,4098}
12)                    },
13)        y[] = {100, 85, 70, 67,
14)                    65, 60, 66, 60,
15)                    55, 70, 65, 60,
16)                    55, 67, 60, 50, 40};
17)        java.text.DecimalFormat df = new java.text.DecimalFormat("####0.00000");
18)        OLSMultipleLinearRegression mr = new OLSMultipleLinearRegression();
19)        mr.newSampleData(y,x);
20)        double param[] = mr.estimateRegressionParameters();
21)        String param1String = (param[1]<0?(""+
22)                                                        df.format(param[1])):      ("+"+df.format(param[1]))),
23)                                            param2String = (param[2]<0?(""+
24)                                                        df.format(param[2])):      ("+"+df.format(param[2])));
25)        System.out.println("y = "+df.format(param[0])+
26)                                param1String+"Rd"+param2String+"Du");
27)       
28)        System.out.println("\nThe data values are:");
29)        System.out.println("Rate\tData\tGiven\tPredicted");
30)        for(int i=0; i<x.length; i++)
31)                    System.out.println(x[i][0]+"\t"+x[i][1]+"\t"+y[i]+"\t"+
32)                                (param[0]+param[1]*x[i][0]+param[2]*x[i][1]));
33)        }
34) }

The predictions obtained from the above formula are displayed in Fig.4. Although regression introduces some error into the predictions, it creates a generalized version of the data available at hand. Predictions of regression equation are applicable for even unknown data input values.

If a person buys a mobile phone for 42,000 rupees and subscribing for a monthly data plan of 2500MB, the equation predicts the resulting productivity at 59.61%. But, the original data set does not have an answer to the given rate and data usage pair. Hence, regression equation helps us to make predictions on points for which results are not known through observation. 

CONCLUSION

Need for performing statistical calculations is so widespread in the programming world. Correlation and regression are applied in generating mathematical models (equations) for experimental results. Apache Commons math library provides a well seasoned library for statistical calculations. We need only to provide the data and calculations are carried out without the burden of coding each operation.

 

Apache Commons math has much more to offer a programmer than statistics, by covering several areas of mathematics. I wish you extract more than AVERAGE productivity using Apache Commons math library.

REFERENCES

1.

shttp://commons.apache.org/proper/commons-math/download_math.cgi

2.

http://commons.apache.org/proper/commons-math/userguide/stat.html

About Author

V. Nagaradjane is a freelance programmer. He may be contacted at nagaradjanev@rediffmail.com

《统计信号处理基础卷一》是一本经典的信号处理教材,由Steven M. Kay撰写。本书对统计信号处理的基本理论进行了全面而深入的介绍。 首先,本书从概率论和统计学的基础理论出发,介绍了随机变量、概率密度函数、概率质量函数等概念。通过对概率分布、统计参数等的讨论,读者可以建立对统计信号处理的数学理论基础。 其次,本书详细介绍了随机过程的基本概念和理论。随机过程是信号处理中经常遇到的一种情况,它在时间上具有随机性。本书通过对平稳性、功率谱密度、互相关函数以及高斯过程等的讨论,使读者对随机过程有了更加深刻的理解。 接着,本书介绍了信号估计问题。信号估计是统计信号处理的核心问题之一,包括参数估计、线性最小均方误差估计、贝叶斯估计等。通过对估计问题的讨论,读者可以学会如何利用统计学原理从观测数据中获得对信号的估计。 最后,本书介绍了经典的线性滤波器设计问题。线性滤波器是信号处理中非常重要的工具,本书通过对FIR滤波器和IIR滤波器的设计原理和方法的介绍,帮助读者掌握线性滤波器的设计技巧。 总的来说, 《统计信号处理基础卷一》深入浅出地介绍了统计信号处理的基本理论和常见方法,结合大量的例子和练习题,使读者逐步掌握统计信号处理的核心概念和技术。无论是从事相关学科的学生、研究者,还是从事相关工作的工程师,都可以从本书中受益匪浅。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值