R语言小白学习笔记3—R语言读取数据

最新推荐文章于 2024-07-20 09:40:13 发布

LL_2048

最新推荐文章于 2024-07-20 09:40:13 发布

阅读量1.3w

点赞数 6

分类专栏： R语言学习笔记文章标签： r语言大数据

本文链接：https://blog.csdn.net/LL_2048/article/details/113563379

版权

R语言学习笔记专栏收录该内容

14 篇文章

订阅专栏

本文详细介绍R语言中多种数据读取方式，包括CSV、Excel、数据库等常见格式及网络数据，涵盖基本函数使用及常见问题解决方法。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

笔记链接

学习笔记1—R语言基础.
学习笔记2—高级数据结构.

想说的话

我觉得R语言读取数据还是比较重要的，这章真的，遇到了特别多的问题，然后一个问题一个问题去解决，所以花了两天才学完。笔记中我也给出了我的解决思路和方法，每一个例子我都敲了一遍，加深印象。

学习笔记3—R语言读取数据

3.1 读取CSV文件

读取CSV文件最好的方法是使用read.table函数，其返回结果为data.frame

read.table函数第一个参数为文件所在路径，本地或网页
第二个参数header表示数据的第一行，即列名
第三个参数sep表示数据的分隔符
例：

> theUrl <- "http://www.jaredlander.com/data/TomatoFirst.csv"
> tomato <- read.table(file=theUrl, header=TRUE, sep=",")
> head(tomato)
  Round             Tomato Price      Source Sweet Acid
1     1         Simpson SM  3.99 Whole Foods   2.8  2.8
2     1  Tuttorosso (blue)  2.99     Pioneer   3.3  2.8
3     1 Tuttorosso (green)  0.99     Pioneer   2.8  2.6
4     1     La Fede SM DOP  3.99   Shop Rite   2.6  2.8
5     2       Cento SM DOP  5.49  D Agostino   3.3  3.1
6     2      Cento Organic  4.99  D Agostino   3.2  2.9
  Color Texture Overall Avg.of.Totals Total.of.Avg
1   3.7     3.4     3.4          16.1         16.1
2   3.4     3.0     2.9          15.3         15.3
3   3.3     2.8     2.9          14.3         14.3
4   3.0     2.3     2.8          13.4         13.4
5   2.9     2.8     3.1          14.4         15.2
6   2.9     3.1     2.9          15.5         15.1

大文件使用read.table函数读取到内存比较慢，读取大CSV文件两个主流函数：read_delim和fread，因为两者都不把字符数据自动转换成factor。

3.1.1 read_delim函数

readr包提供读取文本文件的一系列函数。最常用的是read_delim函数。
其第一个参数为读取的文件路径或者URL。col_names默认为TRUE，指定文件的第一行为列名。
其返回为tibble数据类型，是data.frame的扩展。
例：

> theUrl <- "http://www.jaredlander.com/data/TomatoFirst.csv"
> tomato2 <- read_delim(file=theUrl, delim=',')

-- Column specification ---------------------------------
cols(
  Round = col_double(),
  Tomato = col_character(),
  Price = col_double(),
  Source = col_character(),
  Sweet = col_double(),
  Acid = col_double(),
  Color = col_double(),
  Texture = col_double(),
  Overall = col_double(),
  `Avg of Totals` = col_double(),
  `Total of Avg` = col_double()
)

> tomato2
# A tibble: 16 x 11
   Round Tomato Price Source Sweet  Acid Color Texture
   <dbl> <chr>  <dbl> <chr>  <dbl> <dbl> <dbl>   <dbl>
 1     1 Simps~  3.99 Whole~   2.8   2.8   3.7     3.4
 2     1 Tutto~  2.99 Pione~   3.3   2.8   3.4     3  
 3     1 Tutto~  0.99 Pione~   2.8   2.6   3.3     2.8
 4     1 La Fe~  3.99 Shop ~   2.6   2.8   3       2.3
 5     2 Cento~  5.49 D Ago~   3.3   3.1   2.9     2.8
 6     2 Cento~  4.99 D Ago~   3.2   2.9   2.9     3.1
 7     2 La Va~  3.99 Shop ~   2.6   2.8   3.6     3.4
 8     2 La Va~  3.99 Faicos   2.1   2.7   3.1     2.4
 9     3 Stani~  4.53 Resta~   3.4   3.3   4.1     3.2
10     3 Ciao   NA    Other    2.6   2.9   3.4     3.3
11     3 Scott~  0    Home ~   1.6   2.9   3.1     2.4
12     3 Di Ca~ 12.8  Eataly   1.7   3.6   3.8     2.3
13     4 Trade~  1.49 Trade~   3.4   3.3   4       3.6
14     4 365 W~  1.49 Whole~   2.8   2.7   3.4     3.1
15     4 Muir ~  3.19 Whole~   2.9   2.8   2.7     3.2
16     4 Biona~  3.39 Whole~   2.4   3.3   3.4     3.2
# ... with 3 more variables: Overall <dbl>, `Avg of
#   Totals` <dbl>, `Total of Avg` <dbl>

在这里安装readr包的时候出现了一些问题：

第一个问题：

WARNING: Rtools is required to build R packages but is not currently installed. Please download and install the appropriate version of Rtools before proceeding。

解决：问度娘->需要下载Rtools->下载

然后是漫长的等待时间，这里我下载的是rtools40，为了节省大家时间，安装包见：
https://download.csdn.net/download/LL_2048/14984549.

第二个问题：

downloaded length ！= reported length

导致需要下载的好多zip文件都不能用。

解决：问度娘->是因为下载速度太慢，所以出现了长度不相同的问题->修改设定CRAN镜像

然后终于导入好了：
在这里插入图片描述

3.1.2 fread函数

另一个读取大量数据的函数是data.table包的fread函数。

第一个参数是读取的文件路径
header参数表示第一行是列名
sep指定分隔符

其返回结果为data.table对象，是data.frame的扩展和优化
例：

> library(data.table)
> tomato3 <- fread(input=theUrl, sep=',', header=TRUE)
试开URL’http://www.jaredlander.com/data/TomatoFirst.csv'
Content type 'text/csv' length 1107 bytes
downloaded 1107 bytes

> head(tomato3)
   Round             Tomato Price      Source Sweet Acid
1:     1         Simpson SM  3.99 Whole Foods   2.8  2.8
2:     1  Tuttorosso (blue)  2.99     Pioneer   3.3  2.8
3:     1 Tuttorosso (green)  0.99     Pioneer   2.8  2.6
4:     1     La Fede SM DOP  3.99   Shop Rite   2.6  2.8
5:     2       Cento SM DOP  5.49  D Agostino   3.3  3.1
6:     2      Cento Organic  4.99  D Agostino   3.2  2.9
   Color Texture Overall Avg of Totals Total of Avg
1:   3.7     3.4     3.4          16.1         16.1
2:   3.4     3.0     2.9          15.3         15.3
3:   3.3     2.8     2.9          14.3         14.3
4:   3.0     2.3     2.8          13.4         13.4
5:   2.9     2.8     3.1          14.4         15.2
6:   2.9     3.1     2.9          15.5         15.1

3.2 读取Excel数据

使用readxl包中的read_excel函数，但其需要先下载好文件。
原书中的代码：

> download.file(url='http://www.jaredlander.com/data/ExcelExample.xlsx', destfile='data/ExcelExample.xlsx', method='curl')
> library(readxl)
> excel_sheets('data/ExcelExample.xlsx')

但我遇到了好多问题：

首先是用download.file时出现了一些问题：
一直出现这个错误：

Error in download.file(url = “http://www.jaredlander.com/data/ExcelExample.xlsx”, :
‘curl’ call had nonzero exit status

百度后发现解决方案都在说linux，所以难道是因为windows中没有curl方法吗？于是：

>install.packages("RCurl")

走起！

安装了之后还是这样的问题：
在这里插入图片描述
阿西吧！

于是我想看下download.file的帮助文件

> help("download.file")

在这里插入图片描述

这里提到了method里的其他方法，于是我试了一下auto，发现真的解决了上边那个问题，但新的问题：

>download.file(url='http://www.jaredlander.com/data/ExcelExample.xlsx', destfile='data/ExcelExample.xlsx', method='auto')
Error in download.file(url = "http://www.jaredlander.com/data/ExcelExample.xlsx",  : 
  无法打开目的文件'data/ExcelExample.xlsx'，原因是'No such file or directory'

我想了想，这里可能是没有建立data文件夹的问题，所以我无奈地把data删了（当时想法比较单纯，我只想把他下载下来）

> download.file(url='http://www.jaredlander.com/data/ExcelExample.xlsx', destfile='ExcelExample.xlsx', method='auto')
试开URL’http://www.jaredlander.com/data/ExcelExample.xlsx'
Content type 'application/vnd.openxmlformats' length 2045364 bytes (2.0 MB)
downloaded 2.0 MB

这里终于下载好了

然而

> excel_sheets('ExcelExample.xlsx')
错误: Evaluation error: error -103 with zipfile in unzGetCurrentFileInfo

我直接反手就是…… ……复制粘贴->百度

可惜百度上也没有解决方法，于是我检查了一下下载的excel文件：
在这里插入图片描述
阿西吧，为啥打不开嘞？

我猜测可能是下载的时候出现损坏，或者原来这个文件就是坏的。

于是无奈复制粘贴了网址直接百度下载，终于下载好了。

> excel_sheets('E:/B/R/ExcelExample.xlsx')
[1] "Tomato" "Wine"   "ACS"

太感动了！继续！

这里用的read_excel函数进行读取，返回的是tibble类型的数据。

> tomatoXL <- read_excel('E:/B/R/ExcelExample.xlsx')
> tomatoXL
# A tibble: 16 x 11
   Round Tomato Price Source Sweet  Acid Color Texture
   <dbl> <chr>  <dbl> <chr>  <dbl> <dbl> <dbl>   <dbl>
 1     1 Simps~  3.99 Whole~   2.8   2.8   3.7     3.4
 2     1 Tutto~  2.99 Pione~   3.3   2.8   3.4     3  
 3     1 Tutto~  0.99 Pione~   2.8   2.6   3.3     2.8
 4     1 La Fe~  3.99 Shop ~   2.6   2.8   3       2.3
 5     2 Cento~  5.49 D Ago~   3.3   3.1   2.9     2.8
 6     2 Cento~  4.99 D Ago~   3.2   2.9   2.9     3.1
 7     2 La Va~  3.99 Shop ~   2.6   2.8   3.6     3.4
 8     2 La Va~  3.99 Faicos   2.1   2.7   3.1     2.4
 9     3 Stani~  4.53 Resta~   3.4   3.3   4.1     3.2
10     3 Ciao   NA    Other    2.6   2.9   3.4     3.3
11     3 Scott~  0    Home ~   1.6   2.9   3.1     2.4
12     3 Di Ca~ 12.8  Eataly   1.7   3.6   3.8     2.3
13     4 Trade~  1.49 Trade~   3.4   3.3   4       3.6
14     4 365 W~  1.49 Whole~   2.8   2.7   3.4     3.1
15     4 Muir ~  3.19 Whole~   2.9   2.8   2.7     3.2
16     4 Biona~  3.39 Whole~   2.4   3.3   3.4     3.2
# ... with 3 more variables: Overall <dbl>, `Avg of
#   Totals` <dbl>, `Total of Avg` <dbl>

可以通过提供表的位置索引（数字）或者表的名称（字符）来读取Excel文件的指定表：

> wineXL1 <- read_excel('E:/B/R/ExcelExample.xlsx', sheet=2)
> head(wineXL1)
# A tibble: 6 x 14
  Cultivar Alcohol `Malic acid`   Ash `Alcalinity of ~
     <dbl>   <dbl>        <dbl> <dbl>            <dbl>
1        1    14.2         1.71  2.43             15.6
2        1    13.2         1.78  2.14             11.2
3        1    13.2         2.36  2.67             18.6
4        1    14.4         1.95  2.5              16.8
5        1    13.2         2.59  2.87             21  
6        1    14.2         1.76  2.45             15.2
# ... with 9 more variables: Magnesium <dbl>, `Total
#   phenols` <dbl>, Flavanoids <dbl>, `Nonflavanoid
#   phenols` <dbl>, Proanthocyanins <dbl>, `Color
#   intensity` <dbl>, Hue <dbl>, `OD280/OD315 of diluted
#   wines` <dbl>, Proline <dbl>

3.3 读取数据库数据

大部分数据库都可以通过各种驱动访问（通常是ODBC连接）。最流行的开源数据库有R语言实验包，如：RPostgreSQL和RMySQL。

这里用简单的SQLite来演示读取数据库：
首先用download.file函数下载数据库文件：

> download.file("http://www.jaredlander.com/data/diamonds.db",destfile="E:/B/R/diamonds.db", mode='wb')

竟然一次就成功了，有点兴奋，继续：

安装加载RSQLite包：

> install.packages("RSQLite")
> library(RSQLite)

首先用dbDriver函数指定驱动连接数据库，该函数主要参数为驱动的类型，如"SQLite"或者"ODBC"

> drv <- dbDriver('SQLite')
> class(drv)
[1] "SQLiteDriver"
attr(,"package")
[1] "RSQLite"

然后用dbConnect函数建立数据库连接，第一个参数是驱动，第二个常用参数是DSN连接字符串或路径。

> con <- dbConnect(drv, 'E:/B/R/diamonds.db')
> class(con)
[1] "SQLiteConnection"
attr(,"package")
[1] "RSQLite"

现在就已经连上数据库了，用DBI包的函数可以查看更多：

> dbListTables(con)
[1] "DiamondColors" "diamonds"      "sqlite_stat1" 
> dbListFields(con, name='diamonds')
 [1] "carat"   "cut"     "color"   "clarity" "depth"  
 [6] "table"   "price"   "x"       "y"       "z"

使用dbGetQuery函数准备执行查询数据库，返回data.frame数据类型：

> diamondsTable <- dbGetQuery(con, "SELECT * FROM diamonds", stringAsFactors=FALSE)
> head(diamondsTable)
  carat       cut color clarity depth table price    x
1  0.23     Ideal     E     SI2  61.5    55   326 3.95
2  0.21   Premium     E     SI1  59.8    61   326 3.89
3  0.23      Good     E     VS1  56.9    65   327 4.05
4  0.29   Premium     I     VS2  62.4    58   334 4.20
5  0.31      Good     J     SI2  63.3    58   335 4.34
6  0.24 Very Good     J    VVS2  62.8    57   336 3.94
     y    z
1 3.98 2.43
2 3.84 2.31
3 4.07 2.31
4 4.23 2.63
5 4.35 2.75
6 3.96 2.48

3.4 读取其他统计工具数据

foreign包提供了一些用于读取其他格式的数据的命令

部分常用统计工具数据导入的函数：

函数	读取格式
read.spss	SPSS
read.dta	Stata
read.ssd	SAS
read.octave	Octave
read.mtp	Minitab
read.systat	Systat

这些函数的返回是data.frame，但并不是每次都会成功，虽然read.ssd可以读入SAS数据，但它需要有效的SAS许可证。

Hadley Wickham重写了新包haven，优化了速度和易用性，返回结果为tibble类型，常用函数：

函数	读取格式
read_spss	SPSS
read_sas	Stata（书上是这样写的，但我觉得这里是SAS）
read_stata	Systat（同上，我觉得这里是Stata）

3.5 读取R语言二进制文件

与其他程序员一起工作时，传递数据或R对象最好方法是用Rdata文件，可储存多个对象，并在三大系统间传递

例：创建一些对象存储在RData文件中，删除后，再加载

> n <- 20
> r <- 1:10
> w <- data.frame(n, r)
> n
[1] 20
> w
    n  r
1  20  1
2  20  2
3  20  3
4  20  4
5  20  5
6  20  6
7  20  7
8  20  8
9  20  9
10 20 10
> save(n, r, w, file="E:/B/R/multiple.rdata")
> rm(n, r, w)
> n
错误: 找不到对象'n'
> load("E:/B/R/multiple.rdata")
> n
[1] 20

这些对象存放到工作环境时和存储到RData中名字相同，所以不用把load函数结果分配给对象。

函数saveRDS将对象保存到一个二进制RDS文件中，此时保存的对象没有名字，所以加载到工作环境后需要分配对象：

> smallVector <- c(1, 5, 4)
> smallVector
[1] 1 5 4
> saveRDS(smallVector, file='thisObject.rds')
> thatVect <- readRDS('thisObject.rds')
> thatVect
[1] 1 5 4
> identical(smallVector, thatVect)
[1] TRUE

3.6 读取R语言数据

R和一些软件包自带数据

只需要输入data( )即可
例：

> data(diamonds, package = 'ggplot2')
> head(diamonds)
# A tibble: 6 x 10
  carat cut   color clarity depth table price     x     y
  <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl>
1 0.23  Ideal E     SI2      61.5    55   326  3.95  3.98
2 0.21  Prem~ E     SI1      59.8    61   326  3.89  3.84
3 0.23  Good  E     VS1      56.9    65   327  4.05  4.07
4 0.290 Prem~ I     VS2      62.4    58   334  4.2   4.23
5 0.31  Good  J     SI2      63.3    58   335  4.34  4.35
6 0.24  Very~ J     VVS2     62.8    57   336  3.94  3.96
# ... with 1 more variable: z <dbl>

3.7 读取网页数据

3.7.1读取HTML表格

如果数据存储在HTML表格，可以使用XML包中的readHTMLTable函数抽取数据。
例：

> theURL <- "https://www.jaredlander.com/2012/02/another-kind-of-super-bowl-pool/"
> bowlPool <- readHTMLTable(theURL, which=1, header=FALSE, stringsAsFactors=FALSE)

结果出现了错误：

错误: failed to load external entity “http://www.jaredlander.com/2012/02/another-kind-of-super-bowl-pool”

可能是网站变了的缘故？

于是我用浏览器进入这个网址，发现前边是https，然后修改：

> theURL <- "https://www.jaredlander.com/2012/02/another-kind-of-super-bowl-pool/"
> bowlPool <- readHTMLTable(theURL, which=1, header=FALSE, stringsAsFactors=FALSE)
Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘readHTMLTable’ for signature ‘"NULL"’
此外: Warning message:
XML content does not seem to be XML: ''

出错！

百度发现好像不能抓取https的网页，额~好吧，但现在网页好像都是https，可能学到后边会有新的方法吧，go on！

3.7.2抽取网页数据

网页数据一般在table、div、span或者其他html元素中。本例以纽约Ribalta比萨网站的菜单和门店信息为例，地址电话存储在ordered list中，列标记在span元素中，物品和价格在table中。

这里使用rvest包进行数据抽取。

网页文件用read_html函数读取：

> library(rvest)
> ribalta <- read_html('http://www.jaredlander.com/data/ribalta.html')
> class(ribalta)
[1] "xml_document" "xml_node"    
> ribalta
{html_document}
<html xmlns="http://www.w3.org/1999/xhtml">
[1] <head>\n<meta http-equiv="Content-Type" content="t ...
[2] <body>\r\n<ul>\n<li class="address">\r\n    <span  ...

通过查看HTML，可以看到地址存储在span中，首先用html_nodes函数选择ul元素里的所有span元素：

> ribalta %>% html_nodes('ul') %>% html_nodes('span')
{xml_nodeset (6)}
[1] <span class="street">48 E 12th St</span>
[2] <span class="city">New York</span>
[3] <span class="zip">10003</span>
[4] <span>\r\n    \t<span id="latitude" value="40.7333 ...
[5] <span id="latitude" value="40.733384"></span>
[6] <span id="longitude" value="-73.9915618"></span>

管道符见笔记1.8管道

这将返回ul元素的所有span节点的列表，接下来使用html_nodes函数搜索带有’.street’类的任意元素：

> ribalta %>% html_nodes('.street') 
{xml_nodeset (1)}
[1] <span class="street">48 E 12th St</span>

为了实现数据抽取，可以调用html_text函数抽取span元素内的文本：

> ribalta %>% html_nodes('.street') %>% html_text()
[1] "48 E 12th St"

当待抽取的信息数据存储在html元素的属性中，则需要使用html_attr函数：

例，抽取ID’longitude’的span元素属性：

> ribalta %>% html_nodes('#longitude') %>% html_attr('value')
[1] "-73.9915618"

本例许多信息存储在’food-items’类的table中，所以通过html_nodes函数搜索’food-items’类的table，因为有多个表格，我们需要的为第六个，这里用magrittr包里的extract2函数对表格选取。

使用html_table函数抽取的数据最终存为data.frame数据类型：

> ribalta %>% html_nodes('table.food-items') %>% magrittr::extract2(5) %>% html_table()
                       X1
1    Marinara Pizza Rosse
2         Doc Pizza Rosse
3 Vegetariana Pizza Rosse
4    Brigante Pizza Rosse
5     Calzone Pizza Rosse
6   Americana Pizza Rosse
                                                              X2
1                                     basil, garlic and oregano.
2                                  buffalo mozzarella and basil.
3                 mozzarella cheese, basil and baked vegetables.
4                       mozzarella cheese, salami and spicy oil.
5 ricotta, mozzarella cheese, prosciutto cotto and black pepper.
6                          mozzarella cheese, wurstel and fries.
  X3
1  9
2 15
3 15
4 15
5 16
6 16

3.8 读取JOSN数据

对于API和文档数据库特别常用的数据格式是JSON，它是基于纯文本的可嵌套的数据格式。

读取JSON数据两个主要的包：rjson和rjsonlite

fromJSON函数读取JSON文件进入R语言，解析JSON文本，默认其将数据简化为data.frame数据类型：

> library(jsonlite)
> pizza <- fromJSON('http://www.jaredlander.com/data/PizzaFavorites.json')
> pizza
                    Name
1          Di Fara Pizza
2          Fiore's Pizza
3              Juliana's
4     Keste Pizza & Vino
5  L & B Spumoni Gardens
6 New York Pizza Suprema
7           Paulie Gee's
8                Ribalta
9              Totonno's
                                 Details
1     1424 Avenue J, Brooklyn, NY, 11230
2   165 Bleecker St, New York, NY, 10012
3  19 Old Fulton St, Brooklyn, NY, 11201
4   271 Bleecker St, New York, NY, 10014
5      2725 86th St, Brooklyn, NY, 11223
6       413 8th Ave, New York, NY, 10001
7 60 Greenpoint Ave, Brooklyn, NY, 11222
8      48 E 12th St, New York, NY, 10003
9  1524 Neptune Ave, Brooklyn, NY, 11224