3-02-1 数据加载CSV

Yehchitsai

已于 2022-04-19 10:00:12 修改

阅读量864

点赞数

分类专栏： Python数据处理文章标签： Python

于 2022-04-17 15:26:13 首次发布

本文链接：https://blog.csdn.net/m0_50614038/article/details/124230278

版权

Python数据处理专栏收录该内容

42 篇文章 4 订阅

订阅专栏

3.2 数据加载与存储

不管是大数据分析或是机器学习，第一个步骤都是将原始数据加载到系统中，然而，原始数据的形式有很多种，可能是日志文件、数据集文件、网页开放数据或是数据库等。 pandas 的加载与存储 API 是一组 reader/writer 函数，比如 pandas.read_csv() 函数，可以读取 CSV 格式的文件并返回 pandas 的 DataFrame 对象；相应的 write 函数是像 DataFrame.to_csv() 一样的函数。下面是一个方法列表，包含了这里面的所有 readers/writer 函数。本节将介绍 CSV/Excel/JSON 三种常用的 pandas 库加载数据的方法。

Format Type	Data Description	Reader	Writer
text	CSV	read_csv	to_csv
text	JSON	read_json	to_json
text	HTML	read_html	to_html
text	Local clipboard	read_clipboard	to_clipboard
binary	MS Excel	read_excel	to_excel
binary	OpenDocument	read_excel	to_excel
binary	HDF5 Format	read_hdf	to_hdf
SQL	SQL	read_sql	to_sql
SQL	Google Big Query	read_gbq	to_gbq

加载 CSV 文件

CSV 文件是以逗号为分隔符 (Comma-Separated Values, CSV) 的文件。可以使用 pandas 库的 read_csv() 来加载一个本地或远端的 CSV 文件。因为 read_csv() 函数可用的参数十分多，所以这里只介绍一个简单的案例，根据 Pandas cookbook 所提供的开放数据，来进行练习。直接透过 URL 来抓取网路上的数据，因为这笔数据量很多，所以透过 nrows 这个参数来设定止抓取前 100 笔，抓取完数据后，透过 shape 属性来观察数据集的维度， info() 函数来观察数据集的信息。在进行数据处理时会多次读取，所以通常会下载下来，以本地的方式来进行操作，并观察内容。info() 可显示的信息有：

数据列数 - RangeIndex: 100 entries)
栏位数量 - total 45 columns)
每个栏位的索引、名称以及数据属性 - 0, Date, object)
统计栏位的类型，比方说以下的实例中的 45 个栏位中，有 5 个 float64, 39 个 int64, 1 个 object
内存使用量 - memory usage: 35.3+ KB

实例

import pandas as pd
dataURL = 'http://donnees.ville.montreal.qc.ca/dataset/f170fecc-18db-44bc-b4fe-5b0b6d2c7297/resource/ca3e704f-6145-40ad-9e73-f7be707c4932/download/comptage_velo_2020.csv'
print("加载一个远端的 CSV 文件")
df = pd.read_csv(dataURL,nrows=100) 
print(df.shape)
print(df.info())
print(df[:3])
  
dataDir = './data/comptage_velo_2020.csv'
print("加载一个本地的 CSV 文件")
df2 = pd.read_csv(dataDir) 
print(df2.shape)
print(df2.info())
print(df2[:3])
  
输出结果为：
  
加载一个远端的 CSV 文件
(100, 45)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 45 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Date                100 non-null    object 
 1   compteur_100054073  100 non-null    int64  
 2   compteur_100052606  100 non-null    int64  
 3   compteur_100003032  100 non-null    int64  
 ...
 43  compteur_100047030  100 non-null    int64  
 44  compteur_100057052  100 non-null    int64  
dtypes: float64(5), int64(39), object(1)
memory usage: 35.3+ KB
None
               Date  compteur_100054073  compteur_100052606  \
0  2020-01-01 00:00                   0                   0   
1  2020-01-01 00:15                   0                   0   
2  2020-01-01 00:30                   0                   0   
  
   compteur_100003032  compteur_100053057  compteur_100053058  \
0                   0                   0                   0   
1                   1                   0                   0   
2                   0                   0                   0   
...
[3 rows x 45 columns]
  
加载一个本地的 CSV 文件
(28019, 45)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28019 entries, 0 to 28018
Data columns (total 45 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Date                28019 non-null  object 
 1   compteur_100054073  28019 non-null  int64  
 2   compteur_100052606  28019 non-null  int64  
 3   compteur_100003032  100 non-null    int64  
 ...
 43  compteur_100047030  100 non-null    int64  
 44  compteur_100057052  100 non-null    int64  
dtypes: float64(18), int64(26), object(1)
memory usage: 9.6+ MB
  
None
            Date  compteur_100054073  compteur_100052606  compteur_100003032  \
0  2020/1/1 0:00                   0                   0                   0   
1  2020/1/1 0:15                   0                   0                   1   
2  2020/1/1 0:30                   0                   0                   0   
  
   compteur_100053057  compteur_100053058  compteur_100012218  \
0                   0                 0.0                 1.0   
1                   0                 0.0                 0.0   
2                   0                 0.0                 0.0
...
[3 rows x 45 columns]

有些文档的格式类似 CSV ，只是分隔符使用了其他的字符，如空白、冒号、跳位符，如下图所示，1. 使用了跳位符作为分隔符，且因为 2. 内容是中文所以编码格式为 GB2312 ，可以使用 read_csv() 函数来进行读取，只是需要更多的参数来指定相关细节，指定分隔符为跳位符 sep=’\t’，指定编码方式为 encoding=‘GB18030’，虽然当前编码为 GB2312 但如果发生一些读取上的错误，建议可以替换成兼容且包含字符更多的编码 GB18030 去解码。

在这里插入图片描述
图 3-2-1 使用跳位符做分隔文件格式

实例

import pandas as pd
  
dataDir = './data/2021_duplicate checking2.txt'
df = pd.read_csv(dataDir, sep='\t', encoding='GB18030') 
print(df.shape)
print(df.info())
  
输出结果为：
(1007, 11)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1007 entries, 0 to 1006
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   序号          1007 non-null   int64  
 1   专业          1007 non-null   object 
 2   状态          1007 non-null   object 
 3   检测结果        1007 non-null   object 
 4   重合字数        1007 non-null   object 
 5   总字数         1007 non-null   object 
 6   去除引用        1007 non-null   object 
 7   去除本人        1007 non-null   object 
 8   中英文互检文字复制比  305 non-null    object 
 9   中英文互检重合字数   305 non-null    float64
 10  上传日期        305 non-null    object 
dtypes: float64(1), int64(1), object(9)
memory usage: 86.7+ KB
None

可以透过索引 (index) 与栏位标签 (columns) 进行数据框切片，此外，Pandas 的 loc 属性基于标签来存取；iloc 属性则是基于主要的整数位置来使用。

实例

print("直接透过索引位置与栏位标签来切片，读取 0-4 行，'专业', '状态','检测结果'这三栏")
print(df[:5][['专业', '状态','检测结果']])
print("\n直接透过索引'标签'与栏位标签来切片，读取索引标签 3,4,5 行，'专业', '状态','检测结果'这三栏")
print(df.loc[3:5, ['专业', '状态','检测结果']]) #按标签切片
print("\n直接透过索引位置与栏位位置来切片，读取索引位置 3,4 行，1,2,3 这三栏")
print(df.iloc[3:5, 1:4]) # 按位置切片
  
输出结果为：
  
直接透过索引与栏位标签来切片，读取 0-4 行，'专业', '状态','检测结果'这三栏
        专业   状态 检测结果
0  信息安全与管理  已上传   0%
1  信息安全与管理  已上传   0%
2     软件技术  已上传   0%
3     软件技术  已上传   0%
4  计算机网络技术  已上传   0%
  
直接透过索引'标签'与栏位标签来切片，读取索引标签 3,4,5 行，'专业', '状态','检测结果'这三栏
        专业   状态 检测结果
3     软件技术  已上传   0%
4  计算机网络技术  已上传   0%
5  计算机网络技术  已上传   0%
  
直接透过索引位置与栏位位置来切片，读取索引位置 3,4 行，1,2,3 这三栏
        专业   状态 检测结果
3     软件技术  已上传   0%
4  计算机网络技术  已上传   0%