Python数据分析笔记----第六章pandas文件操作

1.读取文件

1.1 以指定分隔符的方式读取文件

read_table(sep=)

In [3]: pd.read_table('F:/python程序文件/jupyter/pydata-book-2nd-edition/examples/ex1.csv',sep=',')
Out[3]:
   a   b   c   d message
0  1   2   3   4   hello
1  5   6   7   8   world
2  9  10  11  12     foo
1.2 忽略表头

read_csv()读取csv文件是,默认第一行为表头,索引列默认为整数索引

header=None:不要将第一行作为表头

In [4]: pd.read_csv('F:/python程序文件/jupyter/pydata-book-2nd-edition/examples/ex2.csv',header=None)
Out[4]:
   0   1   2   3      4
0  1   2   3   4  hello
1  5   6   7   8  world
2  9  10  11  12    foo
1.3 自定义表头

names=[]

In [5]: pd.read_csv('F:/python程序文件/jupyter/pydata-book-2nd-edition/examples/ex2.csv',names=['a','b','c','d','messag
   ...: e'])
Out[5]:
   a   b   c   d message
0  1   2   3   4   hello
1  5   6   7   8   world
2  9  10  11  12     foo
1.4 指定特定列为索引列

index_col = 列的序号或者列名

In [7]: pd.read_csv('F:/python程序文件/jupyter/pydata-book-2nd-edition/examples/ex2.csv',names=['a','b','c','d','messag
   ...: e'],index_col=4)
Out[7]:
         a   b   c   d
message
hello    1   2   3   4
world    5   6   7   8
foo      9  10  11  12

In [8]: pd.read_csv('F:/python程序文件/jupyter/pydata-book-2nd-edition/examples/ex2.csv',names=['a','b','c','d','messag
   ...: e'],index_col='message')
Out[8]:
         a   b   c   d
message
hello    1   2   3   4
world    5   6   7   8
foo      9  10  11  12

index_col可以传入一个列名的列表

In [10]: pd.read_csv('F:/python程序文件/jupyter/pydata-book-2nd-edition/examples/csv_mindex.csv',index_col=['key1','key
    ...: 2'])
Out[10]:
           value1  value2
key1 key2
one  a          1       2
     b          3       4
     c          5       6
     d          7       8
two  a          9      10
     b         11      12
     c         13      14
     d         15      16
1.5 读取文件时,跳过指定行

skiprows=[]

In [12]: pd.read_csv('F:/python程序文件/jupyter/pydata-book-2nd-edition/examples/ex4.csv',skiprows=[0,2,3])
Out[12]:
   a   b   c   d message
0  1   2   3   4   hello
1  5   6   7   8   world
2  9  10  11  12     foo

In [18]: !type F:\\python程序文件\\jupyter\\pydata-book-2nd-edition\\examples\\ex4.csv#预览文件内容
# hey!
a,b,c,d,message
# just wanted to make things more difficult for you
# who reads CSV files with computers, anyway?
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
1.6 自定义缺失值标识

​ na_values=[]或字典

In [19]: result = pd.read_csv('F:/python程序文件/jupyter/pydata-book-2nd-edition/examples/ex5.csv',na_values=['NULL'])

In [20]: result
Out[20]:
  something  a   b     c   d message
0       one  1   2   3.0   4     NaN
1       two  5   6   NaN   8   world
2     three  9  10  11.0  12     foo

In [21]: sentinels = {'message':['foo','NA'],'something':['two']}

In [22]: pd.read_csv('F:/python程序文件/jupyter/pydata-book-2nd-edition/examples/ex5.csv',na_values=sentinels)
Out[22]:
  something  a   b     c   d message
0       one  1   2   3.0   4     NaN
1       NaN  5   6   NaN   8   world
2     three  9  10  11.0  12     NaN
1.7 设置pandas的显示的行数
pd.options.display.max_rows = 10#若DataFrame的行数超过10行,则只会显示前五行和后五行,中间用省略号代替
1.8 只读一个文件的前几行
In [23]: pd.read_csv('F:/python程序文件/jupyter/pydata-book-2nd-edition/examples/ex6.csv',nrows=5)
Out[23]:
        one       two     three      four key
0  0.467976 -0.038649 -0.295344 -1.824726   L
1 -0.358893  1.404453  0.704965 -0.200638   B
2 -0.501840  0.659254 -0.421691 -0.057688   G
3  0.204886  1.074134  1.388361 -0.982404   R
4  0.354628 -0.133116  0.283763 -0.837063   Q
1.9 分块读文件
In [24]: chunker = pd.read_csv('F:/python程序文件/jupyter/pydata-book-2nd-edition/examples/ex6.csv',chunksize=1000)

In [25]: chunker
Out[25]: <pandas.io.parsers.TextFileReader at 0x183e5e4eeb8>

In [26]: tot = pd.Series([])
    
In [27]: for piece in chunker:
    ...:     tot = tot.add(piece['key'].value_counts(),fill_value=0)
    ...:

In [28]: tot = tot.sort_values(ascending=False)

In [32]: tot.head(5).append(tot.tail(5))
Out[32]:
E    304.0
X    287.0
M    281.0
P    273.0
L    271.0
5    128.0
2    127.0
0    123.0
9    120.0
1    117.0
dtype: float64

2. 写文件

df.to_csv()默认会将行和列的标签都写入进文件

2.1 忽略行标签和列标签

to_csv(index=False,header=False)

In [33]: data = pd.read_csv('F:/python程序文件/jupyter/pydata-book-2nd-edition/examples/ex5.csv')

In [34]: data
Out[34]:
  something  a   b     c   d message
0       one  1   2   3.0   4     NaN
1       two  5   6   NaN   8   world
2     three  9  10  11.0  12     foo

In [35]: data.to_csv('F:/python程序文件/jupyter/pydata-book-2nd-edition/examples/out.csv',index = False,header=False)

In [36]: !type F:\\python程序文件\\jupyter\\pydata-book-2nd-edition\\examples\\out.csv
one,1,2,3.0,4,
two,5,6,,8,world
three,9,10,11.0,12,foo

2.1 仅写入dataframe的子集

In [37]: data.to_csv('F:/python程序文件/jupyter/pydata-book-2nd-edition/examples/out.csv',index=False,columns=['a','b',
    ...: 'c'])

In [38]: !type F:\\python程序文件\\jupyter\\pydata-book-2nd-edition\\examples\\out.csv
a,b,c
1,2,3.0
5,6,
9,10,11.0

Series也有to_csv()方法

3. 读取JSON数据

JSON(JavaScript Object Notation)已经成为web浏览器和其他应用间通过HTTP请求发送数据的标准格式。它是一种比CSV等表格文本形式更为自由的数据形式。字典字符串

obj = """
{"name": "Wes",
 "places_lived": ["United States", "Spain", "Germany"],
 "pet": null,
 "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]},
              {"name": "Katie", "age": 38,
               "pets": ["Sixes", "Stache", "Cisco"]}]
}
"""
  • 不允许列表末尾出现逗号
  • 基本类型是对象(字典)、数组(列表)、字符串、数字、布尔、空值
  • 所有键必须是字符串
3.1 解析JSON数据
In [3]: import json

In [4]: obj = """^M
   ...: {"name": "Wes",^M
   ...:  "places_lived": ["United States", "Spain", "Germany"],^M
   ...:  "pet": null,^M
   ...:  "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]},^M
   ...:               {"name": "Katie", "age": 38,^M
   ...:                "pets": ["Sixes", "Stache", "Cisco"]}]^M
   ...: }^M
   ...: """    

In [5]: result = json.loads(obj)

In [6]: result
Out[6]:
{'name': 'Wes',
 'places_lived': ['United States', 'Spain', 'Germany'],
 'pet': None,
 'siblings': [{'name': 'Scott', 'age': 30, 'pets': ['Zeus', 'Zuko']},
  {'name': 'Katie', 'age': 38, 'pets': ['Sixes', 'Stache', 'Cisco']}]}
3.2 生成JSON数据
In [11]: result
Out[11]:
{'name': 'Wes',
 'places_lived': ['United States', 'Spain', 'Germany'],
 'pet': None,
 'siblings': [{'name': 'Scott', 'age': 30, 'pets': ['Zeus', 'Zuko']},
  {'name': 'Katie', 'age': 38, 'pets': ['Sixes', 'Stache', 'Cisco']}]}

In [12]: json.dumps(result)
Out[12]: '{"name": "Wes", "places_lived": ["United States", "Spain", "Germany"], "pet": null, "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]}, {"name": "Katie", "age": 38, "pets": ["Sixes", "Stache", "Cisco"]}]}'

3.3 pd.read_json()
3.4 data_json.to_json()

3.3和3.4不常用,容易出问题

4. 读取html文件

pd.read_html(url),会将html文件中所有的table解析成DataFrame并以列表的形式返回

5. 正则表达式

正则表达式的用法

re库中方法的用法

‘.’不能匹配\n

6. 读写二进格式

  • pd.read_pickle()
  • pd.to_pickle()

pickle仅被推荐作为短期的存储格式。问题在pickle很难确保格式的长期有效;一个今天被pickle化的对象可能明天因为库的新版本而无法反序列化。

7. 读写HDF5

  • HDF5用于存储大量的科学数组数据
  • HDF代表分层数据格式
  • HDF5文件可以存储多个数据集并且支持元数据(元数据:描述数据的数据)

8.读写Excel文件

  • pd.ExcelFile()
  • pd.read_excel()

上面两个函数都是基于附加包xlrd和openpyxl来分别读取xls和xlsx文件的

当读取含有多个表的文件,生成ExcelFile更快

#方法一:通过pd.ExcelFile()来读取文件
In [14]: pd.ExcelFile('F:/python程序文件/jupyter/pydata-book-2nd-edition/examples/ex1.xlsx')
Out[14]: <pandas.io.excel._base.ExcelFile at 0x2ae34f4b550>

In [15]: xlsx = pd.ExcelFile('F:/python程序文件/jupyter/pydata-book-2nd-edition/examples/ex1.xlsx')

In [16]: pd.read_excel(xlsx)
Out[16]:
   Unnamed: 0  a   b   c   d message
0           0  1   2   3   4   hello
1           1  5   6   7   8   world
2           2  9  10  11  12     foo

#方法二:直接通过pd.read_excel()来读取文件
In [17]: pd.read_excel('F:/python程序文件/jupyter/pydata-book-2nd-edition/examples/ex1.xlsx','Sheet1')#此处的Sheet1必须首字母大写,否则报错
Out[17]:
   Unnamed: 0  a   b   c   d message
0           0  1   2   3   4   hello
1           1  5   6   7   8   world
2           2  9  10  11  12     foo
  • pd.ExcelWriter()
  • to_excel()
In [20]: data
Out[20]:
   Unnamed: 0  a   b   c   d message
0           0  1   2   3   4   hello
1           1  5   6   7   8   world
2           2  9  10  11  12     foo

#方法一:通过pd.ExcelWriter()来写入文件
In [21]: writer = pd.ExcelWriter('F:/python程序文件/jupyter/pydata-book-2nd-edition/examples/ex2.xlsx')

In [22]: data.to_excel(writer,'Sheet1')

#方法二:通过pd.to_excel()来写入文件
In [23]: data.to_excel('F:/python程序文件/jupyter/pydata-book-2nd-edition/examples/ex2.xlsx','Sheet1')

10. 爬虫

11.与数据库交互

import pymssql
connect = pymssql.connect('localhost', '用户名', '密码', '数据库名称')

if connect:
    print("连接成功!")
cursor = connect.cursor()
cursor.execute("create table C_test02(id varchar(20))")
connect.commit()
cursor.close()
connect.close()

Python数据库与DataFrame

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值