最近遇到个问题,需要做一个散点图,but,问题是有一千多万个点。
R是可以做, but,内存不够,或者耗时,再或者输出文件巨大无比,用Adobe Acrobat打开文件一直在画圈。
显然,需要换思路。
网上找到一个可以从超大量数据中取样的方法,分享给有需要的人
原文链接:Plotting of very large data sets in R
重要前提:画图的目的是展示趋势,而不是具体看某些点的位置,否则,涉嫌操纵数据,请结合自身需求使用。
Problem is you can’t load all data into the memory. So you could do sampling of the data, as indicated earlier by @Marek. On such a huge datasets, you get essentially the same results even if you take only 1% of the data. For the violin plot, this will give you a decent estimate of the density. Progressive calculation of quantiles is impossible, but this should give a very decent approximation. It is essentially the same as the “randomized method” described in the link @aix gave.
If you can’t subset the date outside of R, it can be done using connections in combination with sample(). Following function is what I use to sample data from a dataframe in text format when it’s getting too big. If you play a bit with the connection, you could easily convert this to a socketConnection or other to read it from a server, a database, whatever. Just make sure you open the connection in the correct mode.
Good, take a simple .csv file, then following function samples a fraction p of the data:
#定义取样函数
sample.df <- function(f,n=10000,split=",",p=0.1){
con <- file(f,open="rt",)
on.exit(close(con,type="rt"))
y <- data.frame()
#读取header
x <- character(0)
while(length(x)==0){
x <- strsplit(readLines(con,n=1),split)[[1]]
}
Names <- x
#读取并处理数据
repeat{
x <- tryCatch(read.table(con,nrows=n,sep=split),error = function(e) NULL )
if(is.null(x)) {break}
names(x) <- Names
nn <- nrow(x)
id <- sample(1:nn,round(nn*p))
y <- rbind(y,x[id,])
}
rownames(y) <- NULL
return(y)
}
An example of the usage :
#生成数据文件
Df <- data.frame(
X1=1:10000,
X2=1:10000,
X3=rep(letters[1:10],1000)
)
write.csv(Df,file="test.txt",row.names=F,quote=F)
#n是每次读取数据的行数,p是取样的比例
#n is number of lines to be read at once, p is the fraction to sample
DF2 <- sample.df("test.txt",n=1000,p=0.2)
str(DF2)
#清理数据
unlink("test.txt")