数据分析工程师_第02讲Pandas教程上篇
数据分析工程师_第02讲Pandas教程(上篇)
pandas是一个专门用于数据分析的python工具库
Pandas简介
- python数据分析处理的一个package
- 基于numpy(对“矩阵”做科学计算)
- 有一种用python去操作Excel/SQL的感觉
目录
- series
- DataFrame
- Index
- csv文件读写
数据结构Series
import numpy as np
import pandas as pd
# json.loads()解码python json格式
import json
jsonStr = '{"name":"aspiring", "age": 17, "hobby": ["money","power", "read"],"parames":{"a":1,"b":2}}'
jsonData = json.loads(jsonStr)
print(jsonData)
print(type(jsonData))
print(jsonData['hobby'])
{'name': 'aspiring', 'age': 17, 'hobby': ['money', 'power', 'read'], 'parames': {'a': 1, 'b': 2}}
<class 'dict'>
['money', 'power', 'read']
# 读json文件
# json.load()加载python json格式文件
path1 = 'data/example.json'
open(path1,'r',encoding='utf-8').readline()
'{ "a": "Mozilla\\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\\/535.11 (KHTML, like Gecko) Chrome\\/17.0.963.78 Safari\\/535.11", "c": "US", "nk": 1, "tz": "America\\/New_York", "gr": "MA", "g": "A6qOVH", "h": "wfLQtf", "l": "orofrog", "al": "en-US,en;q=0.8", "hh": "1.usa.gov", "r": "http:\\/\\/www.facebook.com\\/l\\/7AQEFzjSi\\/1.usa.gov\\/wfLQtf", "u": "http:\\/\\/www.ncbi.nlm.nih.gov\\/pubmed\\/22415991", "t": 1331923247, "hc": 1331822918, "cy": "Danvers", "ll": [ 42.576698, -70.954903 ] }\n'
# python数据分析书籍例子
import json
path2 = 'data/example.json'
records = [json.loads(line) for line in open(path2, 'r', encoding='utf-8')]
records[0]
records[0]['tz']
'America/New_York'
构造和初始化series
s = pd.Series([7, 'Beijing', 3.14, -12345, 'HanXiaoyang'])
s
0 7
1 Beijing
2 3.14
3 -12345
4 HanXiaoyang
dtype: object
s.values
array([7, 'Beijing', 3.14, -12345, 'HanXiaoyang'], dtype=object)
s.index
RangeIndex(start=0, stop=5, step=1)
s[1]
'Beijing'
s
0 7
1 Beijing
2 3.14
3 -12345
4 HanXiaoyang
dtype: object
pandas会默认用0到n作为Series的index,但是我们也可以自己指定index。index可以类比理解为dic当中的key。
s = pd.Series([7, 'Beijing', 3.14, -12345, 'HanXiaoyang'], index=['A', 'B', 'C', 'D', 'E'])
s
A 7
B Beijing
C 3.14
D -12345
E HanXiaoyang
dtype: object
s['A']
7
s[ ['A','D','B'] ]
A 7
D -12345
B Beijing
dtype: object
我们可以用list来构建Series,同时可以指定index。实际上我们还可以用dic来初始化Series,因为Series本来就是key-value的结构。
cities = {
'Beijing':55000, 'ShangHai':60000, 'Shenzhen':50000, 'Hangzhou':30000, 'Guangzhou':40000, 'Suzhou':None}
cities
{'Beijing': 55000,
'Guangzhou': 40000,
'Hangzhou': 30000,
'ShangHai': 60000,
'Shenzhen': 50000,
'Suzhou': None}
apt = pd.Series(cities, name='income')
apt
Beijing 55000.0
Guangzhou 40000.0
Hangzhou 30000.0
ShangHai 60000.0
Shenzhen 50000.0
Suzhou NaN
Name: income, dtype: float64
# 索引
apt['Guangzhou']
40000.0
apt[1]
40000.0
apt[1:]
Guangzhou 40000.0
Hangzhou 30000.0
ShangHai 60000.0
Shenzhen 50000.0
Suzhou NaN
Name: income, dtype: float64
apt[:-1]
Beijing 55000.0
Guangzhou 40000.0
Hangzhou 30000.0
ShangHai 60000.0
Shenzhen 50000.0
Name: income, dtype: float64
apt[[3,4,1]]
ShangHai 60000.0
Shenzhen 50000.0
Guangzhou 40000.0
Name: income, dtype: float64
apt[ ['ShangHai', 'Shenzhen', 'Guangzhou'] ]
ShangHai 60000.0
Shenzhen 50000.0
Guangzhou 40000.0
Name: income, dtype: float64
# 简单的计算
# 广播特性
3*apt
Beijing 165000.0
Guangzhou 120000.0
Hangzhou 90000.0
ShangHai 180000.0
Shenzhen 150000.0
Suzhou NaN
Name: income, dtype: float64
apt/2.5
Beijing 22000.0
Guangzhou 16000.0
Hangzhou 12000.0
ShangHai 24000.0
Shenzhen 20000.0
Suzhou NaN
Name: income, dtype: float64
# list不可以直接做数学运算
my_list = [2,4,6,8,10]
my_list/2
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-29-39aba40a404f> in <module>()
----> 1 my_list/2
TypeError: unsupported operand type(s) for /: 'list' and 'int'
apt[1:]
Guangzhou 40000.0
Hangzhou 30000.0
ShangHai 60000.0
Shenzhen 50000.0
Suzhou NaN
Name: income, dtype: float64
apt[:-1]
Beijing 55000.0
Guangzhou 40000.0
Hangzhou 30000.0
ShangHai 60000.0
Shenzhen 50000.0
Name: income, dtype: float64
# 基于索引去做计算的
apt[1:] + apt[:-1]
Beijing NaN
Guangzhou 80000.0
Hangzhou 60000.0
ShangHai 120000.0
Shenzhen 100000.0
Suzhou NaN
Name: income, dtype: float64
# in判断index是否存在
'Hangzhou' in apt
True
'Chongqing' in apt
False
# apt['Chongqing'] 不OK的
print(apt.get('Chongqing'))
None
print(apt.get('Guangzhou'))
40000.0
booling indexing/条件判断索引
apt>=40000
Beijing True
Guangzhou True
Hangzhou False
ShangHai True
Shenzhen True
Suzhou False
Name: income, dtype: bool
#条件索引
apt[apt>=40000]
Beijing 55000.0
Guangzhou 40000.0
ShangHai 60000.0
Shenzhen 50000.0
Name: income, dtype: float64
# 统计计算
apt.mean()
47000.0
apt.median()
50000.0
apt.max()
60000.0
apt.min()
30000.0
Series赋值
apt
Beijing 55000.0
Guangzhou 40000.0
Hangzhou 30000.0
ShangHai 60000.0
Shenzhen 50000.0
Suzhou NaN
Name: income, dtype: float64
apt['Shenzhen'] = 70000
apt
Beijing 55000.0
Guangzhou 40000.0
Hangzhou 30000.0
ShangHai 60000.0
Shenzhen 70000.0
Suzhou NaN
Name: income, dtype: float64
#条件赋值
apt[apt<=40000] = 45000
apt
Beijing 55000.0
Guangzhou 45000.0
Hangzhou 45000.0
ShangHai 60000.0
Shenzhen 70000.0
Suzhou NaN
Name: income, dtype: float64
type(apt)
pandas.core.series.Series
#更多高级的数学运算
np.log(apt)
Beijing 10.915088
Guangzhou 10.714418
Hangzhou 10.714418
ShangHai 11.002100
Shenzhen 11.156251
Suzhou NaN
Name: income, dtype: float64
cars = pd.Series({
'Beijing':350000, 'ShangHai':400000, 'Shenzhen':300000, \
'Tianjin':200000, 'Guangzhou':250000, 'Chongqing':150000
})
cars
Beijing 350000
Chongqing 150000
Guangzhou 250000
ShangHai 400000
Shenzhen 300000
Tianjin 200000
dtype: int64
expense = cars + 10*apt
expense
Beijing 900000.0
Chongqing NaN
Guangzhou 700000.0
Hangzhou NaN
ShangHai 1000000.0
Shenzhen 1000000.0
Suzhou NaN
Tianjin NaN
dtype: float64
数据缺失
'Hangzhou' in apt
True
'Hangzhou' in cars
False
apt
Beijing 55000.0
Guangzhou 45000.0
Hangzhou 45000.0
ShangHai 60000.0
Shenzhen 70000.0
Suzhou NaN
Name: income, dtype: float64
# bool结果返回
apt.notnull()
Beij