R中的arrow库的操作_r包 arrow-CSDN博客

本文链接：https://blog.csdn.net/weixin_45906368/article/details/131979975

文章目录

创建arrow格式文件
- 加载包
- 使用包
- 创建数据框
- 使用arrow库的函数write_parquet()，将数据写入arrow格式文件
读取arrow格式文件
管理arrow格式文件
一，读入和写入数据--单个文件
- 将现有的对象转换为箭头表
- 将数据从箭头表转换为数据框
- 编写parquet文件--将单个parquet文件写入磁盘
- 读取parquet文件 --将单个Parquet文件读入内存
- - 将文件作为箭头表读入
- 读取Parquet文件时过滤列
- 编写feather V2/arrow IPC文件
- 读取feather V2/arrow IPC文件
- 写入流式 arrow IPC文件（编写箭头IPC流格式）
- 读取流式 arrow IPC文件（从箭头IPC流格式读取）
- 编写CSV文件（将箭头数据写入单个CSV文件）
- 读取CSV文件（将单个CSV文件读入内存）
- 读取JSON文件（将JSON文件读入内存）
二，读取和写入数据--多个文件
- 将数据写入磁盘--parquet
- 写入分区数据--parquet
- 读取分区数据
- 将数据写入磁盘--feather/arrow/IPC形式
- 将feather/arrow/IPC数据作为箭头数据集读取
- 将数据写入磁盘--CSV格式
- 将CSV数据作为箭头数据集读取
- 读取CSV数据集（无标头）
- 写入压缩分区数据ing
- 读取压缩数据ing
三，创建箭头对象
- 从R对象创建箭头数组
- 从R对象创建箭头表
- 查看箭头表/记录批处理内容
- 从R对象手动创建记录批处理
四，定义数据类型
- 更新现有箭头数组的数据类型cast
- 更新现有箭头表中字段的数据类型
- 从R对象创建箭头表时指定数据类型
- 读取文件时指定数据类型
五，操作数据 - 数组
- 按匹配谓词或掩码的值过滤
- 计算数组的平均值/最小值/最大值等
- 计算数组中元素的出现次数
- 将算术函数应用于数组
- 直接在数组上调用箭头计算函数
六，操作数据-表格
- 在箭头中使用dplyr动词
- 在箭头的dplyr动词中使用R函数
- 在箭头dplyr动词中使用箭头函数
七，使用R中的PyArrow
- 在R中使用PyArrow创建箭头对象
- 从R调用pyArrow函数
八，航班
- 连接到飞星服务器
- 将数据发送发送到飞行服务器
- 检查飞行服务器上存在哪些资源
- 从飞行服务器检索数据

创建arrow格式文件

加载包

install.packages("arrow")

使用包

 library(arrow)

创建数据框

 data <- data.frame(column1=c(1,2,3),column2=c("a","b","c"),column3=c(TRUE,FALSE,TRUE))

使用arrow库的函数write_parquet()，将数据写入arrow格式文件

 arrow::write_parquet(data,"C:/Users/86133/Desktop/RData/data1.parquet")

读取arrow格式文件

data <- arrow::read_parquet("C:/Users/86133/Desktop/RData/data1.parquet")

管理arrow格式文件

arrow库还提供了一些函数来处理arrow格式的数据。您可以使用以下函数对数据进行操作：

查看数据结构：

arrow::schema(data)

这将打印数据结构的元数据。

选择特定列：

arrow::select(data, c("column1", "column2"))

这将选择指定列，并返回新的数据帧。

过滤数据：

arrow::filter(data, column > 10)

这将根据指定条件过滤数据，并返回新的数据帧。

排序数据：

arrow::sort(data, column)

一，读入和写入数据–单个文件

将现有的对象转换为箭头表

air_table <- arrow_table(airquality)

将数据从箭头表转换为数据框

air_df <- as.data.frame(air_table)

编写parquet文件–将单个parquet文件写入磁盘

# Create table
my_table <- arrow_table(data.frame(group = c("A", "B", "C"), score = c(99, 97, 99)))

# Write to Parquet
write_parquet(my_table, "my_table.parquet")

读取parquet文件 --将单个Parquet文件读入内存

 parquet_tb1 <- read_parquet("my_table.parquet")

由于参数保留为其默认值，因此将文件作为对象读入

class(parquet_tb1)
[1] "data.frame

将文件作为箭头表读入

 my_table_arrow <- read_parquet("my_table.parquet",as_data_frame = FALSE)

class(my_table_arrow)
[1] “Table” “ArrowTabular” “ArrowObject” “R6”

读取Parquet文件时过滤列

 # Create table to read back in 
 dist_time <- arrow_table(data.frame(distance = c(12.2, 15.7, 14.2), time = c(43,44, 40)))

# Write to Parquet 
write_parquet(dist_time, "dist_time.parquet")

# Read in only the "time" column 
time_only <- read_parquet("dist_time.parquet", col_select = "time") time_only

> time_only <- read_parquet("dist_time.parquet",col_select = "time")

编写feather V2/arrow IPC文件

my_table <- arrow_table(data.frame(group = c("A", "B", "C"), score = c(99, 97, 99)))
write_feather(my_table, "my_table.arrow")

读取feather V2/arrow IPC文件

my_feather_tbl <- read_feather("my_table.arrow")

写入流式 arrow IPC文件（编写箭头IPC流格式）

> my_table <-  arrow_table(data.frame(group = c("A","B","C"),score=c(99,97,99)))
> write_ipc_stream(my_table,"my_table.arrows")

读取流式 arrow IPC文件（从箭头IPC流格式读取）

> my_ipc_stream <- arrow::read_ipc_stream("my_table.arrows")

编写CSV文件（将箭头数据写入单个CSV文件）

> write_csv_arrow(cars,"cars.csv")

读取CSV文件（将单个CSV文件读入内存）

my_csv <- read_csv_arrow("cars.csv",as_data_frame = FALSE)

读取JSON文件（将JSON文件读入内存）

# Create a file to read back in
tf <- tempfile()
writeLines('
    {"country": "United Kingdom", "code": "GB", "long": -3.44, "lat": 55.38}
    {"country": "France", "code": "FR", "long": 2.21, "lat": 46.23}
    {"country": "Germany", "code": "DE", "long": 10.45, "lat": 51.17}
  ', tf, useBytes = TRUE)

# Read in the data
countries <- read_json_arrow(tf, col_select = c("country", "long", "lat"))
countries

countries <- read_json_arrow(tf, col_select = c("country", "long", "lat"))

二，读取和写入数据–多个文件

将数据写入磁盘–parquet

单个 Parquet 文件中将数据写入磁盘。

write_dataset(dataset = airquality, path = "airquality_data")

写入分区数据–parquet

根据数据中的列将多个 Parquet 数据文件保存到分区中的磁盘

write_dataset(airquality, "airquality_partitioned", partitioning = c("Month"))

创建了基于提供的文件夹分区变量 .Month

list.files(“airquality_partitioned”)
[1] “Month=5” “Month=6” “Month=7” “Month=8” “Month=9”

读取分区数据

将分区数据文件作为箭头数据集读取。

 air_data <- open_dataset("airquality_partitioned_deeper")

将数据写入磁盘–feather/arrow/IPC形式

write_dataset(dataset = airquality,path = "airquality_data_feather",format = "feather")

将feather/arrow/IPC数据作为箭头数据集读取

# write Arrow file to use in this example
write_dataset(dataset = airquality, path = "airquality_data_arrow",format = "arrow")
# read into R
open_dataset("airquality_data_arrow", format = "arrow")

open_dataset("airquality_data_arrow", format = "arrow")

将数据写入磁盘–CSV格式

write_dataset(dataset = airquality,path = "airquality_data_csv",format = "csv")

将CSV数据作为箭头数据集读取

# write CSV file to use in this example
write_dataset(dataset = airquality,path = "airquality_data_csv"， format = "csv")
# read into R
open_dataset("airquality_data_csv", format = "csv")

open_dataset("airquality_data_csv", format = "csv")

读取CSV数据集（无标头）

读取包含没有标头的 CSV 的数据集

# write CSV file to use in this example
dataset_1 <- airquality[1:40, c("Month", "Day", "Temp")]
dataset_2 <- airquality[41:80, c("Month", "Day", "Temp")]

dir.create("airquality")
write.table(dataset_1, "airquality/part-1.csv", sep = ",", row.names = FALSE, col.names = FALSE)
write.table(dataset_2, "airquality/part-2.csv", sep = ",", row.names = FALSE, col.names = FALSE)

# read into R
open_dataset("airquality", format = "csv", column_names = c("Month", "Day", "Temp"))

 open_dataset("airquality", format = "csv", column_names = c("Month", "Day", "Temp"))

如果数据集由无标头 CSV 文件组成，则必须提供每列。您可以通过多种方式执行此操作 - 通过参数（如上所示）或通过架构：column_names

open_dataset(“airquality”, format = “csv”, schema = schema(“Month” = int32(), “Day” = int32(), “Temp” = int32()))

写入压缩分区数据ing

读取压缩数据ing

三，创建箭头对象

从R对象创建箭头数组

R 中的现有向量转换为箭头数组对象。

> score <- c(97.99,86)
> score_array <- Array$create(score)

从R对象创建箭头表

将 R 中的现有数据框转换为箭头表对象

> my_tibble <- tibble::tibble(group=c("A","B","C"),score=c(99,45,89))   #创建数据框
> my_table <-  arrow_table(my_tibble)                         #数据框转换为箭头表

查看箭头表/记录批处理内容

dplyr::collect(my_table)

从R对象手动创建记录批处理

将 R 中的现有数据框转换为箭头记录批处理对象

> my_tibble <- tibble::tibble(group=c("A","B","C"),score=c(99,45,89))  #创建数据框
> my_record <- record_batch(my_tibble)   #数据框转换为记录批处理对象

四，定义数据类型

更新现有箭头数组的数据类型cast

> integer_arr <- Array$create(1:5)  #创建一个数组’
>  uint_arr <- integer_arr$cast(target_type = uint8())   #转换数组类型

更新现有箭头表中字段的数据类型

更改现有箭头表中一个或多个字段的类型。

oscars <- tibble::tibble(actor = c("Katharine Hepburn", "Meryl Streep", "Jack Nicholson"),num_awards = c(4, 3, 3))   #创建一个数据框
oscars_arrow <- arrow_table(oscars)  #转换为箭头表形式
oscars_schema <- schema(actor = string(), num_awards = int16())   #设置字段新的数据类型
oscars_arrow_int <- oscars_arrow$cast(target_schema = oscars_schema)   #箭头表的数据类型的转换

从R对象创建箭头表时指定数据类型

oscars <- tibble::tibble(actor = c("Katharine Hepburn", "Meryl Streep", "Jack Nicholson"),num_awards = c(4, 3, 3) ）#创建一个数据框
oscars_schema <- schema(actor = string(), num_awards = int16())   #设置字段结构
scars_data_arrow <- arrow_table(oscars, schema = oscars_schema)   #将数据框转换为箭头表，且进行数据类型的指定

读取文件时指定数据类型

读取文件时手动指定箭头数据类型。

oscars <- tibble::tibble(actor = c("Katharine Hepburn", "Meryl Streep", "Jack Nicholson"),num_awards = c(4, 3, 3) ）#创建一个数据框
write_dataset(oscars, path = "oscars_data")   #写一个数据集到内存上
oscars_schema <- schema(actor = string(), num_awards = int16())   #设置列名的数据类型
oscars_dataset_arrow <- open_dataset("oscars_data", schema = oscars_schema)  #打开文件同时，指定列的类型

五，操作数据 - 数组

按匹配谓词或掩码的值过滤

在 Array 中搜索与谓词条件匹配的值。

my_values <- Array$create(c(1:5, NA))
my_values[my_values > 3]   #筛选数

计算数组的平均值/最小值/最大值等

my_values <- Array$create(c(1:5, NA))
mean(my_values, na.rm = TRUE)  #除去里面的null值，进行求平均值

计算数组中元素的出现次数

repeated_vals <- Array$create(c(1, 1, 2, 3, 3, 3, 3, 3))
value_counts(repeated_vals)

将算术函数应用于数组

Array 对象上使用各种算术运算符

num_array <- Array$create(1:10)
num_array + 10

直接在数组上调用箭头计算函数

first_100_numbers <- Array$create(1:100)
call_function("variance", first_100_numbers, options = list(ddof = 0))     #计算1-100的方差，将增量自由度设置为0

六，操作数据-表格

在箭头中使用dplyr动词

将箭头与使用 dplyr 语法

arrow_table(starwars) %>%
  filter(species == "Human") %>%
  mutate(height_ft = height/30.48) %>%
  select(name, height_ft) %>%
  collect()

在箭头的dplyr动词中使用R函数

library(dplyr)
arrow_table(starwars) %>%
  filter(species == "Human", homeworld == "Tatooine") %>%
  collect()

在箭头dplyr动词中使用箭头函数

> arrow_table(starwars)  %>%
+ filter(str_detect(name,"Darth"))  %>%
+ collect()

七，使用R中的PyArrow

在R中使用PyArrow创建箭头对象

使用 PyArrow 在 R 会话中创建 Arrow 对象。

library(reticulate)
pa <- import("pyarrow")   #导包
pyarrow_scalar <- pa$scalar(42)  #创建箭头对象
pyarrow_scalar

从R调用pyArrow函数

调用函数

table_1 <- arrow_table(mtcars[1:5,])
table_2 <- arrow_table(mtcars[11:15,])
pa$concat_tables(tables = list(table_1, table_2)) %>%
  collect()