python中的pandas库如何读数据_Python之Pandas库学习（二）：数据读写

最新推荐文章于 2023-04-13 17:56:39 发布

weixin_39578197

最新推荐文章于 2023-04-13 17:56:39 发布

阅读量532

点赞数

文章标签： python中的pandas库如何读数据

本文链接：https://blog.csdn.net/weixin_39578197/article/details/111799414

版权

1. I/O API工具

读取函数

写入函数

read_csv

to_csv

read_excel

to_excel

read_hdf

to_hdf

read_sql

to_sql

read_json

to_json

read_html

to_html

read_stata

to_stata

read_clipboard

to_clipboard

read_pickle

to_pickle

read_msgpack

to_mspack

read_gbq

to_gbq

2. 读写CSV文件

文件的每一行的元素是用逗号隔开，这种格式的文件就叫CSV文件。

2.1. 从CSV中读取数据

简单读取

excited.csv

white,read,blue,green,animal

1,5,2,3,cat

2,7,8,5,dog

3,3,6,7,horse

2,2,8,3,duck

4,4,2,1,mouse

code.py

>>> csvframe = pd.read_csv('E:\\Python\\Codes\\excited.csv')

>>> csvframe

white read blue green animal

0 1 5 2 3 cat

1 2 7 8 5 dog

2 3 3 6 7 horse

3 2 2 8 3 duck

4 4 4 2 1 mouse

用header和names指定表头

excited.csv

1,5,2,3,cat

2,7,8,5,dog

3,3,6,7,horse

2,2,8,3,duck

4,4,2,1,mouse

code.py

>>> csvframe = pd.read_csv('E:\\Python\\Codes\\excited.csv', header=None)

>>> csvframe

0 1 2 3 4

0 1 5 2 3 cat

1 2 7 8 5 dog

2 3 3 6 7 horse

3 2 2 8 3 duck

4 4 4 2 1 mouse

>>> csvframe = pd.read_csv('E:\\Python\\Codes\\excited.csv', names=['white', 'red', 'blue', 'green', 'animal'])

>>> csvframe

white red blue green animal

0 1 5 2 3 cat

1 2 7 8 5 dog

2 3 3 6 7 horse

3 2 2 8 3 duck

4 4 4 2 1 mouse

创建等级结构的DataFrame

excited.csv

color,status,item1,item2,item3

black,up,3,4,6

black,down,2,6,7

white,up,5,5,5

white,down,3,3,2

white,left,1,2,1

red,up,2,2,2

red,down,1,1,4

code.py

>>> csvframe = pd.read_csv('E:\\Python\\Codes\\excited.csv', index_col=['color', 'status'])

>>> csvframe

item1 item2 item3

color status

black up 3 4 6

down 2 6 7

white up 5 5 5

down 3 3 2

left 1 2 1

red up 2 2 2

down 1 1 4

2.2. 写入数据到CSV中

简单写入

code.py

>>> frame = pd.DataFrame(np.arange(16).reshape((4,4)), columns = ['red', 'blue', 'orange', 'black'], index = ['a', 'b', 'c', 'd'])

>>> frame

red blue orange black

a 0 1 2 3

b 4 5 6 7

c 8 9 10 11

d 12 13 14 15

>>> frame.to_csv('E:\\Python\\Codes\\excited.csv')

excited.csv

,red,blue,orange,black

a,0,1,2,3

b,4,5,6,7

c,8,9,10,11

d,12,13,14,15

可以发现第一行的前面有一个','，因为列名前面有一个空白。

取消索引和列的写入

code.py

>>> frame.to_csv('E:\\Python\\Codes\\excited.csv', index = False, header = False)

excited.csv

0,1,2,3

4,5,6,7

8,9,10,11

12,13,14,15

处理NaN元素

code.py

>>> frame = pd.DataFrame([[3, 2, np.NaN], [np.NaN, np.NaN, np.NaN], [2, 3, 3]], index = ['a', 'b', 'c'], columns = ['red', 'black', 'orange'])

>>> frame

red black orange

a 3.0 2.0 NaN

b NaN NaN NaN

c 2.0 3.0 3.0

>>> frame.to_csv('E:\\Python\\Codes\\excited.csv')

使用np_rep参数把空字段替换

>>> frame.to_csv('E:\\Python\\Codes\\excited.csv', na_rep = 'lalala')

excited.csv

,red,black,orange

a,3.0,2.0,

b,,,

c,2.0,3.0,3.0

可以发现所有的NaN就是为空的

替换

,red,black,orange

a,3.0,2.0,lalala

b,lalala,lalala,lalala

c,2.0,3.0,3.0

这里发现列首的第一个还是没有东西，因为它本身不存在？

3. 读写TXT文件

TXT文件不一定是以逗号或者分号分割数据的，这种时候要用正则表达式。通常还要配合'*'号表示匹配任意多个。

例如'\s*'.

符号

意义

换行符以外的单个字符

数字

非数字字符

空白字符

非空白字符

换行符

制表符

\uxxxx

用十六进制数字xxxx表示的Unicode字符

简单读取

excited.txt

乱加空格和制表符

white red blue green

1 5 2 3

2 7 8 5

2 3 3 3

code.py

>>> pd.read_table('E:\\Python\\Codes\\excited.txt', sep = '\s*')

__main__:1: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.

E:\Python\Python3\lib\site-packages\pandas\io\parsers.py:2137: FutureWarning: split() requires a non-empty pattern match.

yield pat.split(line.strip())

E:\Python\Python3\lib\site-packages\pandas\io\parsers.py:2139: FutureWarning: split() requires a non-empty pattern match.

yield pat.split(line.strip())

white red blue green

0 1 5 2 3

1 2 7 8 5

2 2 3 3 3

第一次尝试的时候报错了,于是按照提示加上

>>> pd.read_table('E:\\Python\\Codes\\excited.txt', sep = '\s*', engine = 'python')

white red blue green

0 1 5 2 3

1 2 7 8 5

2 2 3 3 3

成功了，其中'*'号的意思是匹配任意多个

读取时排除一些行

excited.txt

12#$@!%$!$#!@$!@$!@

#$%^$^%$#!

@#%!

white red blue green

!$#$!@$#!@$

1 5 2 3

2 7 8 5

2 3 3 3

^#$^@FGSDQAS

code.py

>>> pd.read_table('E:\\Python\\Codes\\excited.txt', sep = '\s*', engine = 'python', skiprows = [0, 1, 2, 4, 8])

white red blue green

0 1 5 2 3

1 2 7 8 5

2 2 3 3 3

列表内代表要跳过的行

读取部分数据

sep也可以用在read_csv啊原来。nrows代表读取几行的数据，例如nrows=3那么就读取3行的数据。

chunksize是把文件分割成一块一块的，chunksize=3的话就是每一块的行数为3.

excited.txt

white red blue green black orange golden

1 5 2 3 111 222 233

100 7 8 5 2333 23333 233333

20 3 3 3 12222 1222 23232

2000 7 8 5 2333 23333 233333

300 3 3 3 12222 1222 23232

code.py

>>> frame = pd.read_csv('E:\\Python\\Codes\\excited.txt', sep = '\s*', skiprows=[2], nrows = 3, engine = 'python')

>>> frame

white red blue green black orange golden

0 1 5 2 3 111 222 233

1 20 3 3 3 12222 1222 23232

2 2000 7 8 5 2333 23333 233333

从头开始读三行，并且跳过了第三行

>>> pieces = pd.read_csv('E:\\Python\\Codes\\excited.txt', sep = '\s*', chunksize = 2, engine = 'python')

>>> for piece in pieces:

... print (piece)

... print (type(piece))

...

white red blue green black orange golden

0 1 5 2 3 111 222 233

1 100 7 8 5 2333 23333 233333

white red blue green black orange golden

2 20 3 3 3 12222 1222 23232

3 2000 7 8 5 2333 23333 233333

white red blue green black orange golden

4 300 3 3 3 12222 1222 23232

每两个为一块。并且类型都是DataFrame。

3.2. 写入数据到TXT中

写入数据的话和csv是一样的。

4. 读写HTML文件

4.1. 写入数据到HTML文件中

先看看to_html()方法

code.py

>>> frame

white red blue green black orange golden

0 1 5 2 3 111 222 233

1 100 7 8 5 2333 23333 233333

2 20 3 3 3 12222 1222 23232

3 2000 7 8 5 2333 23333 233333

4 300 3 3 3 12222 1222 23232

>>> print(frame.to_html())

whiteredbluegreenblackorangegolden

0152311122223311007852333233332333332203331222212222323232000785233323333233333430033312222122223232

可以发现DataFrame.to_html()可以将DataFrame直接变成html的表格内容。因此我们要把一个DataFrame变成可以浏览的html文件的时候，只需要插入一些其他的东西。

code.py

>>> s = ['']

>>> s.append('

DataFrame')

>>> s.append('

>>> s.append(frame.to_html())

>>> s.append('')

>>> html = ''.join(s)

>>> html_file = open('E:\\Python\\Codes\\DataFrame.html', 'w')

>>> html_file.write(html)

1193

>>> html_file.close()

DataFrame.html

white

red

blue

green

black

orange

golden

111

222

233

100

2333

23333

233333

12222

1222

23232

2000

2333

23333

233333

300

12222

1222

23232

4.2. 从HTML文件中读取数据

read_html()方法会返回页面所有的表格，因此得到的是一个DataFrame数组。

code.py

从上例读取

>>> web_frames = pd.read_html('E:\\Python\\Codes\\DataFrame.html')

>>> for web_frame in web_frames:

... print (web_frame)

...

Unnamed: 0 white red blue green black orange golden

0 0 1 5 2 3 111 222 233

1 1 100 7 8 5 2333 23333 233333

2 2 20 3 3 3 12222 1222 23232

3 3 2000 7 8 5 2333 23333 233333

4 4 300 3 3 3 12222 1222 23232

最厉害的是，read_html()可以以网址作为参数，直接解析并抽取网页中的表格。

code.py

>>> favors = pd.read_html('http://baike.baidu.com/item/%E5%9B%9B%E6%9C%88%E6%98%AF%E4%BD%A0%E7%9A%84%E8%B0%8E%E8%A8%80/13382872#viewPageContent')

>>> now = favors[0].copy()

>>> now = now.set_index(0)

>>> now.columns = now.ix['话']

>>> now.index.name = None

>>> now.drop('话')

话标题(日/中) 剧本 \

1 モノトーン・カラフル单调·多彩吉冈孝夫

2 友人A 友人A 石黑恭平

3 春の中春光里神户守

4 旅立ち启程岩田和也河野亚矢子石黑恭平

5 どんてんもよう阴天石滨真史

6 帰り道归途井端义秀

7 カゲささやく暗影低语神户守

8 响け回响后藤圭二

9 共鸣共鸣神户守

10 君といた景色与你共赏的景色中村章子

11 命の灯生命之光朝仓海斗

12 トゥインクルリトルスター小星星神户守

13 爱の悲しみ爱的忧伤仓田绫子

14 足迹足迹柴山智隆

15 うそつき骗子神户守

16 似たもの同士相似的人黑木美幸

17 トワイライト暮光神户守

18 心重ねる心心相印石井俊匡

19 さよならヒーロー再见了英雄井端义秀

20 手と手手与手神户守

21 雪雪仓田绫子柴山智隆

22 春风春风石黑恭平

23 MOMENTS 岩田和也

话分镜 \

1 石黑恭平

2 原田孝宏

3 岩田和也

4 三木俊明河合拓也牧田昌也野野下伊织山田慎也菅井爱明小泉初荣浅贺和行

5 石滨真史小岛崇史

6 野野下伊织

7 间岛崇宽

8 高桥英俊

9 黑木美幸

10 原田孝宏

11 石黑恭平川越崇弘

12 福岛利规

13 野野下伊织

14 小泉初荣

15 矢岛武

16 山田真也野野下伊织小泉初荣三木俊明浅贺和行

17 河野亚矢子

18 河合拓也

19 こさや

20 矢岛武

21 野野下伊织小泉初荣门之园惠美高野绫河合拓也山田真也

22 石黑恭平黑木美幸

23 爱敬由纪子奥田佳子山田真也伊藤香织

话演出作画监督演奏作画监督总作画监督

1 爱敬由纪子浅贺和行 - NaN

2 三木俊明小林惠祐爱敬由纪子 NaN NaN

3 河合拓也 NaN NaN NaN

4 浅贺和行仓田绫子爱敬由纪子高野绫 NaN NaN

5 小岛崇史 - 爱敬由纪子 NaN

6 浅贺和行 NaN NaN NaN

7 山田真也 - NaN NaN

8 河合拓也浅贺和行 NaN NaN

9 小泉初荣 NaN NaN NaN

10 高野绫 NaN NaN NaN

11 山下惠中野彰子 - NaN NaN

12 长森佳容浅贺和行 NaN NaN

13 NaN NaN NaN NaN

14 - NaN NaN NaN

15 北岛勇树山下惠 C Company NAMU Animation 浅贺和行 NaN NaN

16 - 高野绫 NaN NaN

17 三木俊明高田晃浅贺和行爱敬由纪子 NaN

18 NaN NaN NaN NaN

19 小泉初荣野野下伊织高野绫山田真也河合拓也 NaN NaN NaN

20 野野下伊织小泉初荣河合拓也山田真也高野绫薗部爱子奥田佳子加藤万由子高田晃薮本和彦 NaN NaN NaN

21 NaN NaN NaN NaN

22 奥田桂子河合拓也野野下伊织高野绫小泉初荣伊藤香织浅贺和行高田晃爱敬由纪子 NaN NaN NaN

23 NaN NaN NaN NaN

很强大。但是因为外移了一行..搞了挺久终于完美显示了。

5. 其他格式

除了表列出来的文件格式，还有HDF5格式、pickle格式等。

weixin_39578197

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python中的pandas库如何读数据_Python之Pandas库学习（二）：数据读写

1. I/O API工具读取函数写入函数read_csvto_csvread_excelto_excelread_hdfto_hdfread_sqlto_sqlread_jsonto_jsonread_htmlto_htmlread_statato_stataread_clipboardto_clipboardread_pickleto_pickleread_msgpackto_mspackrea...
复制链接

扫一扫