Python数据分析笔记----第六章pandas文件操作

最新推荐文章于 2024-01-29 15:01:12 发布

Vae+

最新推荐文章于 2024-01-29 15:01:12 发布

阅读量292

点赞数

分类专栏： Python 文章标签： python

本文链接：https://blog.csdn.net/m0_48302179/article/details/111460411

版权

Python 专栏收录该内容

11 篇文章 1 订阅

订阅专栏

1.读取文件

1.1 以指定分隔符的方式读取文件

read_table(sep=)

In [3]: pd.read_table('F:/python程序文件/jupyter/pydata-book-2nd-edition/examples/ex1.csv',sep=',')
Out[3]:
   a   b   c   d message
0  1   2   3   4   hello
1  5   6   7   8   world
2  9  10  11  12     foo

1.2 忽略表头

read_csv()读取csv文件是，默认第一行为表头，索引列默认为整数索引

header=None：不要将第一行作为表头

In [4]: pd.read_csv('F:/python程序文件/jupyter/pydata-book-2nd-edition/examples/ex2.csv',header=None)
Out[4]:
   0   1   2   3      4
0  1   2   3   4  hello
1  5   6   7   8  world
2  9  10  11  12    foo

1.3 自定义表头

names=[]

In [5]: pd.read_csv('F:/python程序文件/jupyter/pydata-book-2nd-edition/examples/ex2.csv',names=['a','b','c','d','messag
   ...: e'])
Out[5]:
   a   b   c   d message
0  1   2   3   4   hello
1  5   6   7   8   world
2  9  10  11  12     foo

1.4 指定特定列为索引列

index_col = 列的序号或者列名

In [7]: pd.read_csv('F:/python程序文件/jupyter/pydata-book-2nd-edition/examples/ex2.csv',names=['a','b','c','d','messag
   ...: e'],index_col=4)
Out[7]:
         a   b   c   d
message
hello    1   2   3   4
world    5   6   7   8
foo      9  10  11  12

In [8]: pd.read_csv('F:/python程序文件/jupyter/pydata-book-2nd-edition/examples/ex2.csv',names=['a','b','c','d','messag
   ...: e'],index_col='message')
Out[8]:
         a   b   c   d
message
hello    1   2   3   4
world    5   6   7   8
foo      9  10  11  12

index_col可以传入一个列名的列表

In [10]: pd.read_csv('F:/python程序文件/jupyter/pydata-book-2nd-edition/examples/csv_mindex.csv',index_col=['key1','key
    ...: 2'])
Out[10]:
           value1  value2
key1 key2
one  a          1       2
     b          3       4
     c          5       6
     d          7       8
two  a          9      10
     b         11      12
     c         13      14
     d         15      16

1.5 读取文件时，跳过指定行

skiprows=[]

In [12]: pd.read_csv('F:/python程序文件/jupyter/pydata-book-2nd-edition/examples/ex4.csv',skiprows=[0,2,3])
Out[12]:
   a   b   c   d message
0  1   2   3   4   hello
1  5   6   7   8   world
2  9  10  11  12     foo

In [18]: !type F:\\python程序文件\\jupyter\\pydata-book-2nd-edition\\examples\\ex4.csv#预览文件内容
# hey!
a,b,c,d,message
# just wanted to make things more difficult for you
# who reads CSV files with computers, anyway?
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo

1.6 自定义缺失值标识

na_values=[]或字典

In [19]: result = pd.read_csv('F:/python程序文件/jupyter/pydata-book-2nd-edition/examples/ex5.csv',na_values=['NULL'])

In [20]: result
Out[20]:
  something  a   b     c   d message
0       one  1   2   3.0   4     NaN
1       two  5   6   NaN   8   world
2     three  9  10  11.0  12     foo

In [21]: sentinels = {'message':['foo','NA'],'something':['two']}

In [22]: pd.read_csv('F:/python程序文件/jupyter/pydata-book-2nd-edition/examples/ex5.csv',na_values=sentinels)
Out[22]:
  something  a   b     c   d message
0       one  1   2   3.0   4     NaN
1       NaN  5   6   NaN   8   world
2     three  9  10  11.0  12     NaN

1.7 设置pandas的显示的行数

pd.options.display.max_rows = 10#若DataFrame的行数超过10行，则只会显示前五行和后五行，中间用省略号代替

1.8 只读一个文件的前几行

In [23]: pd.read_csv('F:/python程序文件/jupyter/pydata-book-2nd-edition/examples/ex6.csv',nrows=5)
Out[23]:
        one       two     three      four key
0  0.467976 -0.038649 -0.295344 -1.824726   L
1 -0.358893  1.404453  0.704965 -0.200638   B
2 -0.501840  0.659254 -0.421691 -0.057688   G
3  0.204886  1.074134  1.388361 -0.982404   R
4  0.354628 -0.133116  0.283763 -0.837063   Q

1.9 分块读文件

In [24]: chunker = pd.read_csv('F:/python程序文件/jupyter/pydata-book-2nd-edition/examples/ex6.csv',chunksize=1000)

In [25]: chunker
Out[25]: <pandas.io.parsers.TextFileReader at 0x183e5e4eeb8>

In [26]: tot = pd.Series([])
    
In [27]: for piece in chunker:
    ...:     tot = tot.add(piece['key'].value_counts(),fill_value=0)
    ...:

In [28]: tot = tot.sort_values(ascending=False)

In [32]: tot.head(5).append(tot.tail(5))
Out[32]:
E    304.0
X    287.0
M    281.0
P    273.0
L    271.0
5    128.0
2    127.0
0    123.0
9    120.0
1    117.0
dtype: float64

2. 写文件

df.to_csv()默认会将行和列的标签都写入进文件

2.1 忽略行标签和列标签

to_csv(index=False,header=False)

In [33]: data = pd.read_csv('F:/python程序文件/jupyter/pydata-book-2nd-edition/examples/ex5.csv')

In [34]: data
Out[34]:
  something  a   b     c   d message
0       one  1   2   3.0   4     NaN
1       two  5   6   NaN   8   world
2     three  9  10  11.0  12     foo

In [35]: data.to_csv('F:/python程序文件/jupyter/pydata-book-2nd-edition/examples/out.csv',index = False,header=False)

In [36]: !type F:\\python程序文件\\jupyter\\pydata-book-2nd-edition\\examples\\out.csv
one,1,2,3.0,4,
two,5,6,,8,world
three,9,10,11.0,12,foo

2.1 仅写入dataframe的子集

In [37]: data.to_csv('F:/python程序文件/jupyter/pydata-book-2nd-edition/examples/out.csv',index=False,columns=['a','b',
    ...: 'c'])

In [38]: !type F:\\python程序文件\\jupyter\\pydata-book-2nd-edition\\examples\\out.csv
a,b,c
1,2,3.0
5,6,
9,10,11.0

Series也有to_csv()方法

3. 读取JSON数据

JSON(JavaScript Object Notation)已经成为web浏览器和其他应用间通过HTTP请求发送数据的标准格式。它是一种比CSV等表格文本形式更为自由的数据形式。字典字符串

obj = """
{"name": "Wes",
 "places_lived": ["United States", "Spain", "Germany"],
 "pet": null,
 "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]},
              {"name": "Katie", "age": 38,
               "pets": ["Sixes", "Stache", "Cisco"]}]
}
"""

不允许列表末尾出现逗号
基本类型是对象（字典）、数组（列表）、字符串、数字、布尔、空值
所有键必须是字符串

3.1 解析JSON数据

In [3]: import json

In [4]: obj = """^M
   ...: {"name": "Wes",^M
   ...:  "places_lived": ["United States", "Spain", "Germany"],^M
   ...:  "pet": null,^M
   ...:  "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]},^M
   ...:               {"name": "Katie", "age": 38,^M
   ...:                "pets": ["Sixes", "Stache", "Cisco"]}]^M
   ...: }^M
   ...: """    

In [5]: result = json.loads(obj)

In [6]: result
Out[6]:
{'name': 'Wes',
 'places_lived': ['United States', 'Spain', 'Germany'],
 'pet': None,
 'siblings': [{'name': 'Scott', 'age': 30, 'pets': ['Zeus', 'Zuko']},
  {'name': 'Katie', 'age': 38, 'pets': ['Sixes', 'Stache', 'Cisco']}]}

3.2 生成JSON数据

In [11]: result
Out[11]:
{'name': 'Wes',
 'places_lived': ['United States', 'Spain', 'Germany'],
 'pet': None,
 'siblings': [{'name': 'Scott', 'age': 30, 'pets': ['Zeus', 'Zuko']},
  {'name': 'Katie', 'age': 38, 'pets': ['Sixes', 'Stache', 'Cisco']}]}

In [12]: json.dumps(result)
Out[12]: '{"name": "Wes", "places_lived": ["United States", "Spain", "Germany"], "pet": null, "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]}, {"name": "Katie", "age": 38, "pets": ["Sixes", "Stache", "Cisco"]}]}'

3.3 pd.read_json()

3.4 data_json.to_json()

3.3和3.4不常用，容易出问题

4. 读取html文件

pd.read_html(url)，会将html文件中所有的table解析成DataFrame并以列表的形式返回

5. 正则表达式

正则表达式的用法

re库中方法的用法

‘.’不能匹配\n

6. 读写二进格式

pd.read_pickle()
pd.to_pickle()

pickle仅被推荐作为短期的存储格式。问题在pickle很难确保格式的长期有效；一个今天被pickle化的对象可能明天因为库的新版本而无法反序列化。

7. 读写HDF5

HDF5用于存储大量的科学数组数据
HDF代表分层数据格式
HDF5文件可以存储多个数据集并且支持元数据（元数据：描述数据的数据）

8.读写Excel文件

pd.ExcelFile()
pd.read_excel()

上面两个函数都是基于附加包xlrd和openpyxl来分别读取xls和xlsx文件的

当读取含有多个表的文件，生成ExcelFile更快

#方法一：通过pd.ExcelFile()来读取文件
In [14]: pd.ExcelFile('F:/python程序文件/jupyter/pydata-book-2nd-edition/examples/ex1.xlsx')
Out[14]: <pandas.io.excel._base.ExcelFile at 0x2ae34f4b550>

In [15]: xlsx = pd.ExcelFile('F:/python程序文件/jupyter/pydata-book-2nd-edition/examples/ex1.xlsx')

In [16]: pd.read_excel(xlsx)
Out[16]:
   Unnamed: 0  a   b   c   d message
0           0  1   2   3   4   hello
1           1  5   6   7   8   world
2           2  9  10  11  12     foo

#方法二：直接通过pd.read_excel()来读取文件
In [17]: pd.read_excel('F:/python程序文件/jupyter/pydata-book-2nd-edition/examples/ex1.xlsx','Sheet1')#此处的Sheet1必须首字母大写，否则报错
Out[17]:
   Unnamed: 0  a   b   c   d message
0           0  1   2   3   4   hello
1           1  5   6   7   8   world
2           2  9  10  11  12     foo

pd.ExcelWriter()
to_excel()

In [20]: data
Out[20]:
   Unnamed: 0  a   b   c   d message
0           0  1   2   3   4   hello
1           1  5   6   7   8   world
2           2  9  10  11  12     foo

#方法一：通过pd.ExcelWriter()来写入文件
In [21]: writer = pd.ExcelWriter('F:/python程序文件/jupyter/pydata-book-2nd-edition/examples/ex2.xlsx')

In [22]: data.to_excel(writer,'Sheet1')

#方法二：通过pd.to_excel()来写入文件
In [23]: data.to_excel('F:/python程序文件/jupyter/pydata-book-2nd-edition/examples/ex2.xlsx','Sheet1')

10. 爬虫

11.与数据库交互

import pymssql
connect = pymssql.connect('localhost', '用户名', '密码', '数据库名称')

if connect:
    print("连接成功!")
cursor = connect.cursor()
cursor.execute("create table C_test02(id varchar(20))")
connect.commit()
cursor.close()
connect.close()

Python数据库与DataFrame

Vae+

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Python数据分析笔记----第六章pandas文件操作

文章目录1.读取文件1.1 以指定分隔符的方式读取文件1.2 忽略表头1.3 自定义表头1.4 指定特定列为索引列1.5 读取文件时，跳过指定行1.6 自定义缺失值标识1.7 设置pandas的显示的行数1.8 只读一个文件的前几行1.9 分块读文件2. 写文件2.1 忽略行标签和列标签3. 读取JSON数据3.1 解析JSON数据3.2 生成JSON数据3.3 pd.read_json()3.4 data_json.to_json()4. 读取html文件5. 正则表达式6. 读写二进格式7. 读写HD
复制链接

扫一扫