R语言读取数据（Practical Data Science with R 第二章）

最新推荐文章于 2022-04-03 08:11:36 发布

但为世同

最新推荐文章于 2022-04-03 08:11:36 发布

阅读量1.4k

点赞数

文章标签： r语言数据

本篇博客介绍如何使用R语言读取和分析结构化数据，以UCI Machine Learning Repository的car数据集为例，讲解了数据类型的检查以及面对非结构化数据时的处理方法。通过class()函数查看数据类型，并提供了针对非结构化数据的处理思路。

摘要由CSDN通过智能技术生成

1、用R语言读取文件中的数据

1.1、用R语言读取结构化数据

以University of California Irvine Machine Learning Repository (http://archive.ics.uci.edu/ml/)的car数据为例：

uciCar <- read.table(
'http://www.win-vector.com/dfiles/car.data.csv',
sep=',',
header=T
)

变量uciCar中存储了链接文件中的数据信息。并且，命令 read.table()非常强大，如果文件的扩展名是 .gz，那么命令 read.table()会自动解压缩，并加载数据。

完成读取数据以后，我们需要对数据进行初步的观察分析。需要用到一下几个命令：

查看数据的类型class()：

class(uciCar)

数据摘要summary()：

summary(uciCar)

1.2、用R读取半结构化数据

有的数据并没有很好的结构化，例如：http://mng.bz/mZbu 的数据

A11 6 A34 A43 1169 A65 A75 4 A93 A101 4 ...
A12 48 A32 A43 5951 A61 A73 2 A92 A101 2 ...
A14 12 A34 A46 20

d <- read.table(paste('http://archive.ics.uci.edu/ml/',
'machine-learning-databases/statlog/german/german.data',sep=''),
stringsAsFactors=F,header=F)
print(d[1:3,])

我们可以设置列名：

colnames(d) <- c('Status.of.existing.checking.account',
'Duration.in.month', 'Credit.history', 'Purpose',
'Credit.amount', 'Savings account/bonds',
'Present.employment.since','Installment.rate.in.percentage.of.disposable.income',
'Personal.status.and.sex', 'Other.debtors/guarantors',
'Present.residence.since', 'Property', 'Age.in.years',
'Other.installment.plans', 'Housing',
'Number.of.existing.credits.at.this.bank', 'Job',
'Number.of.people.being.liable.to.provide.maintenance.for',
'Telephone', 'foreign.worker', 'Good.Loan')
d$Good.Loan <- as.factor(ifelse(d$Good.Loan==1,'GoodLoan','BadLoan'))
print(d[1:3,])

命令c()生成了一个向量。
我们通过数据的说明文档，可以知道数据信息的意义，比如：在第4列中，A40 is a new car loan, A41 is a used car loan。把所有的解释对应生成一个list：

mapping <- list(
'A40'='car (new)',
'A41'='car (used)',
'A42'='furniture/equipment',
'A43'='radio/television',
'A44'='domestic appliances',
...
)

我们用循环把数据进行替换：

for(i in 1:(dim(d))[2]) {
    if(class(d[,i])=='character') {
        d[,i] <- as.factor(as.character(mapping[d[,i]]))
    }
}

观察一下数据：

print(d[1:3,'Purpose'])

如果程序不全，无法运行，原文作者已经将所有代码分享：

https://github.com/WinVector/zmPDSwR/tree/master/Statlog/

但为世同

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫