超大量数据绘图

最新推荐文章于 2022-08-08 22:55:07 发布

城西小霸王

最新推荐文章于 2022-08-08 22:55:07 发布

阅读量1.7k

点赞数

分类专栏：数量生态学文章标签： r语言 big data 统计学

本文链接：https://blog.csdn.net/sunfishtan/article/details/120420623

版权

数量生态学专栏收录该内容

4 篇文章 1 订阅

订阅专栏

最近遇到个问题，需要做一个散点图，but，问题是有一千多万个点。

R是可以做, but，内存不够，或者耗时，再或者输出文件巨大无比，用Adobe Acrobat打开文件一直在画圈。

显然，需要换思路。

网上找到一个可以从超大量数据中取样的方法，分享给有需要的人
原文链接：Plotting of very large data sets in R

重要前提：画图的目的是展示趋势，而不是具体看某些点的位置，否则，涉嫌操纵数据，请结合自身需求使用。

Problem is you can’t load all data into the memory. So you could do sampling of the data, as indicated earlier by @Marek. On such a huge datasets, you get essentially the same results even if you take only 1% of the data. For the violin plot, this will give you a decent estimate of the density. Progressive calculation of quantiles is impossible, but this should give a very decent approximation. It is essentially the same as the “randomized method” described in the link @aix gave.

If you can’t subset the date outside of R, it can be done using connections in combination with sample(). Following function is what I use to sample data from a dataframe in text format when it’s getting too big. If you play a bit with the connection, you could easily convert this to a socketConnection or other to read it from a server, a database, whatever. Just make sure you open the connection in the correct mode.

Good, take a simple .csv file, then following function samples a fraction p of the data:

	#定义取样函数
	sample.df <- function(f,n=10000,split=",",p=0.1){
    con <- file(f,open="rt",)
    on.exit(close(con,type="rt"))
    y <- data.frame()
    #读取header
    x <- character(0)
    while(length(x)==0){
      x <- strsplit(readLines(con,n=1),split)[[1]]
    }
    Names <- x
    #读取并处理数据
    repeat{
      x <- tryCatch(read.table(con,nrows=n,sep=split),error = function(e) NULL )
      if(is.null(x)) {break}
      names(x) <- Names
      nn <- nrow(x)
      id <- sample(1:nn,round(nn*p))
      y <- rbind(y,x[id,])
    }
    rownames(y) <- NULL
    return(y)
}

An example of the usage :

#生成数据文件
Df <- data.frame(
  X1=1:10000,
  X2=1:10000,
  X3=rep(letters[1:10],1000)
)
write.csv(Df,file="test.txt",row.names=F,quote=F)

#n是每次读取数据的行数，p是取样的比例 
#n is number of lines to be read at once, p is the fraction to sample
DF2 <- sample.df("test.txt",n=1000,p=0.2)
str(DF2)

#清理数据
unlink("test.txt")

城西小霸王

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
超大量数据绘图

最近遇到个问题，需要做一个散点图，but，问题是有一千多万个点。R是可以做, but，内存不够，或者耗时，再或者输出文件巨大无比，用Adobe Acrobat打开文件一直在画圈。显然，需要换思路。网上找到一个可以从超大量数据中取样的方法，分享给有需要的人原文链接：Plotting of very large data sets in R重要前提：画图的目的是展示趋势，而不是具体看某些点的位置，否则，涉嫌操纵数据，请结合自身需求使用。Problem is you can’t load all da
复制链接

扫一扫