Statistical Analysis of Large Datasets An Exploration of R - MySQL Interface

1 篇文章 0 订阅
Roger Koenker
University of Illinois

University of Illinois


Least Squares

Quantile Regression


LM: Least Squares on Large Datasets

We illustrate the approach with a model for infant birth weight based on the 1999 U.S. Detailed Natality Survey. The dataset contains roughly 2.4 million observations on 18 variables when loaded into R it requires about 275 MB of RAM. Furthermore, any nontrivial statistical procedure like least squares will further increase the memory requirements to values beyond the capabilities of the vast majority of R users.

With that in mind, we have been experimenting with techniques that would allow users to compute least squares solutions with extremely large datasets, while maintaining memory requirements at low (common) levels. The principles of ourpackage are:

0) Given N observations on a MySQL server;
1) Determine variables needed for X,y,w, etc
2) While (n < N)
{     
fetch a block of n rows. 
accumulate X'X and X'y
        
}
3) Compute least squares solution by Cholesky.
Here's a synopsis of the example included in the documentation of LM, which we (naturally) limit to 500 observations on 3 variables. Later, we will report on an example with 2,376,497 observations on 18 variables. 

We run the example over the network using a 56k V.90 modem connection with an Athlon 750 MHz with 384 MB of RAM powered by Linux 2.2.17 SuSE 6.4. The MySQL server is version 3.22.32. R-1.2.0 takes about 12 seconds to source and execute the code below:
                                                                               
> library(LM)
> m <-  MySQL()
> mycon <- dbConnect(m, user='%', dbname="LM",
+            host="galton.econ.uiuc.edu")
>  rs <- dbExecStatement( mycon, "SELECT * FROM LMData
+             WHERE ID < 501" )
                 
> summary(LM( weight ~ momage + boy, data=rs, conn=mycon,     NumStdErr=TRUE ))

Call:
LM(formula = weight ~ momage + boy, data = rs, NumStdErr = TRUE, conn = mycon)

Residuals:
Min
1Q 
Median
3Q
Max 
-2766.41
-314.82
45.06
350.15
1265.74
          

Coefficients:        

Estimate
Std. Error
t value
Pr(>|t|)
(Intercept) 
2947.916
123.967
23.780
<2e-16
momage       
11.121
4.411
2.521
0.0120
boy           
119.983
50.519
2.375
0.0179

           
Residual standard error:563.2 on 497 degrees of freedom
Multiple R-Squared: 0.0248,  Adjusted R-squared: 0.02088
F-statistic: 6.321 on 2 and 497 degrees of freedom,    p-value: 0.001947



In the next example, we run a model with 16 left-hand side variableson 2,376,497 observations. We run the example on a 100Mbps network with a dual Pentium III at 650 MHz with 512 MB of RAM powered by Linux 2.2.16-SMP SuSE 6.4. 

It takes only about 10.5 minutes to complete. And, total memory usage never exceeds 115 MB.

These numbers may not seem too impressive, but if we consider that in order to load the ascii dataset to R with 'scan' it used all 512 MB of RAM plus some 300 MB of swap space, there are certainly substantial memory gains. Furthermore, just re-opening R with the dataset in .RData takes over 2.5 minutes and requires 272 MB! We were unable to use `read.table'; it drained all memory resources (1GB = RAM + swap)! We also used the package 'stataread' which turned out to be the fastest and less memory intensive. Finally, MySQL also seems quite efficient reading ascii data, it took only 3 min. 41.31 sec.

It's possible to reduce memory requirements by fetching fewer observations at a time; in our example we fetched 100,000 each time. This will not necessarily result in longer waiting times, but it's certainly more feasible. It appears that most of the time is occupied loading each block into a dataframe. In a separate exercise, fetching data from a MySQL server took approximately the same time as reading the data from text files with 'read.table ' (scan is faster). So, it would be interesting to investigate whether significant efficiency gains are possible in this phase of the process. We would be very interested in comments about this aspect of LM.



Download Area:

LM - Version 0.1-0
And the also the required RMySQL package.




Quantile Regression

The LM development project can be considered to be a successful feasibility study for a comparable functionality for quantile regression. We hope to have an alpha version of the quantile regression software in the near future. Stay tuned for further developments.



MySQL

In order to use our package(s), you need to have access to a MySQL server, either locally or remotely. Therefore, we will briefly describe the installation, administration and client use of MySQL.

You can skip this section altogether if you have experience with MySQL.

MySQL is a Relational DataBase Management System (RDBMS) available at http://www.mysql.com, free of charge under the conditions described in the site.

Installation

You will need at least a MySQL client (to talk with a MySQL server). MySQL is available for several OS (Unix, Linux, FreeBSDand Windows). If you don't have access to a server, please consider installing your own server. Under SuSE 6.4, we installed the following rpm packages: mysqlclnt (the client), mysql (the server), mysqllib (libraries) and mysqldev (development files and libraries).


For additional instructions please refer to the manual  provided by MySQL, which is quite thorough. We encourage you to read at least the security tips.

Administration


This is intended to be a brief introduction to administering a server and it is fully based on our experience with SuSE 6.4.

Once you have installed MySQL server, you need to:

0) Start the mysql daemon (e.g. rcmysql start);
1) Assign a password to the user root (the administrator):
root@galton$ mysqladmin -u root password yourpasswd
2) Add a user with certain access privileges to databases. For illustrative purposes, we will consider the user "dummy" and the database "dbdummy" restricting access from the localhost and "world.dummy.org". 

                                      
anovo@galton$ mysql -u root -p mysql

a) Update the 'user' table

mysql> insert into user (host,user,password)
   ->   values('localhost','dummy',password('NOTeasy'));
mysql> insert into user (host,user,password)
   ->   values('world.dummy.org','dummy',password('NOTeasy'));

b) Update the 'db' table

mysql> insert into db
   -> (host,db,user,Select_priv,Insert_priv,Update_priv,Delete_priv,
   ->   Create_priv,Drop_priv)
   -> values ('%','dbdummy','dummy','Y','Y','Y','Y','Y','Y');

c) Create the necessary database (dbdummy). First, exit mysql. Then,

anovo@galton$ mysqladmin -u root -p create dbdummy
enter password: *******

Database "dbdummy" created

d) Reload the server 

anovo@galton$ mysqladmin -u root -p reload;
enter password: *******

3) Create and load the data into tables. 
Notice that we gave permission to 'dummy' ( see 2b) ).

dummy@galton$ mysql -p dbdummy
enter password: *******

Welcome to the MySQL monitor.  
Commands end with ; or \g.
Your MySQL connection id is 5602 
to server version: 3.22.32

Type 'help' for help.

mysql> CREATE TABLE dummynat(
   -> ID INT NOT NULL AUTO_INCREMENT,
   -> birmon smallint unsigned not null,
   -> weight float not null,
   -> boy tinyint unsigned not null,
   -> momage float not null,
   -> PRIMARY KEY (ID));
Query OK, 0 rows affected (0.00sec)

4) Although we defined the fields (e.g. birmon) and their characteristics (e.g., smallint,float, tinyint, not null), our table is still empty, we need to load the (ascii) data into the server:

mysql> LOAD DATA LOCAL INFILE"~/projects/sql-r/births.txt"
   -> INTO TABLE dummynat FIELDS TERMINATED BY ' ' 
   -> (birmon, weight, boy, momage);
Query OK, 2376497 rows affected(3 min 41.31sec)
Records: 2376497  Deleted: 0  Skipped: 0  Warnings: 0

We created a table named 'dummynat' with the following fields (variables): ID, birmon, weight, boy, momage. The fields of statistical interest are birmon (month of birth), weight (in grams), boy (indicator) and momage (mom's age). ID is an identification field to which we assign a PRIMARY KEY. (The text in red represents the MySQL prompt; our input is in black). Notice how we omitted the field ID from our LOAD statement; ID is automatically created (and incremented).

Client Use

This section illustrates basic SQL statements to retrieve specific pieces of information contained in 'dummynat' table.

mysql> SHOW tables;
+--------------+
| Tables in LM |
+--------------+
| dummynat     |
+--------------+
1 row in set (0.16 sec)

mysql> SELECT * FROM dummynat LIMIT 2;
+----+--------+---------+-----+--------+
| ID | birmon | weight  | boy | momage |
+----+--------+---------+-----+--------+
|   1|      1 | 4054.00 |   0 |  36.00 |
|   2|      2 | 3487.00 |   0 |  20.00 |
+----+--------+---------+-----+--------+
2 rows inset (0.22 sec)

mysql> SELECT weight FROM dummynat WHERE boy=1 AND momage>40;
+--------+
| weight |
+--------+
| 3799.00|
| 3459.00|
| 4014.00|
| 2268.00|
| 2580.00|
| 3146.00|
| 3125.00|
| 3525.00|
| 2268.00|
| 2965.00|
| 3771.00|
+--------+
11 rows in set (0.24 sec)



devshed.com has several references. An introduction to MySQL Administration is given here and first principles can be found here.
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
统计静态时序分析(Statistical Static Timing Analysis,SSTA)是用于微处理器组件的一种方法。它是一种在静态时序分析的基础上引入统计模型的技术,用于评估微处理器的时序性能。 传统的静态时序分析方法假设所有的工艺变差都是确定性的,并通过考虑最差情况下的工艺变差来评估时序性能。然而,在现代微电子制造中,工艺变差是不可避免的,并且具有随机性质。因此,传统的分析方法可能会导致过于保守的结果。 在统计静态时序分析方法中,不再假设工艺变差是确定性的,而是将其建模为随机变量。通常使用均值和方差来描述工艺变差的统计特性。通过对工艺变差进行统计建模,可以更准确地评估时序性能。 在进行统计静态时序分析时,需要建立包括工艺变差模型、电路模型和时序分析算法的完整方法链。首先,需要获取和建模工艺变差数据,可以使用实测数据或者基于模拟仿真的方法。然后,需要开发适用于统计分析的电路模型,考虑工艺变差的随机性质。最后,使用统计分析算法对时序性能进行评估,例如使用蒙特卡洛方法进行抽样分析。 统计静态时序分析方法的主要优势是能够提供更准确和实用的时序性能评估。它可以考虑不确定性因素,提供更加精确的结果,并且可以为设计者提供更多的灵活性和决策支持,以优化设计和制造过程。但是,统计分析方法相对传统的方法更加复杂,并且需要更多的计算资源和时间。 总之,统计静态时序分析方法是用于微处理器组件的一种先进的时序分析技术。它可以更准确地评估时序性能,为设计者提供更多的灵活性和决策支持。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值