下面的练习来源:pandas数据分析100道练习题 ,将将够了解熟悉一下pandas各种操作,我对有些题目使用到的函数还不是十分理解。 题目我都写到一个ipynb文件里了,已上传到CSDN,0积分,链接 。
另外分享两个网站练习,能通过实战一般练习到pandas(有十个实战的题目): 两个练习题我没全看,只看了和鲸Kesci网的,但是看github上的题目标题及练习的形式大概判断它们的本质是一模一样的。 和鲸Kesci:pandas数分析练习 \ pandas数分析练习:Github链接 知乎:pandas练习题100道 \ pandas练习题100道:Github链接
如何引入pandas并查看版本
import pandas as pd
print ( pd. __version__)
print ( pd. show_versions( as_json= True ) )
0.25.1
{'system': {'commit': None, 'python': '3.7.4.final.0', 'python-bits': 64, 'OS': 'Windows', 'OS-release': '10', 'machine': 'AMD64', 'processor': 'Intel64 Family 6 Model 142 Stepping 9, GenuineIntel', 'byteorder': 'little', 'LC_ALL': 'None', 'LANG': 'None', 'LOCALE': 'None.None'}, 'dependencies': {'pandas': '0.25.1', 'numpy': '1.16.5', 'pytz': '2019.3', 'dateutil': '2.8.0', 'pip': '19.2.3', 'setuptools': '41.4.0', 'Cython': '0.29.13', 'pytest': '5.2.1', 'hypothesis': None, 'sphinx': '2.2.0', 'blosc': None, 'feather': None, 'xlsxwriter': '1.2.1', 'lxml.etree': '4.4.1', 'html5lib': '1.0.1', 'pymysql': '0.9.3', 'psycopg2': None, 'jinja2': '2.10.3', 'IPython': '7.8.0', 'pandas_datareader': None, 'bs4': '4.8.0', 'bottleneck': '1.2.1', 'fastparquet': None, 'gcsfs': None, 'matplotlib': '3.1.1', 'numexpr': '2.7.0', 'odfpy': None, 'openpyxl': '3.0.0', 'pandas_gbq': None, 'pyarrow': None, 'pytables': None, 's3fs': None, 'scipy': '1.3.1', 'sqlalchemy': '1.3.9', 'tables': '3.5.2', 'xarray': None, 'xlrd': '1.2.0', 'xlwt': '1.3.0'}}
None
list或numpy array或dict转pd.Series
import numpy as np
mylist = list ( 'abcdefghijklmnopqrstuvwxyz' )
myarr = np. arange( 26 )
mydict = zip ( mylist, myarr)
ser1 = pd. Series( mylist)
ser2 = pd. Series( myarr)
ser3 = pd. Series( mydict)
print ( ser3. head( ) )
0 (a, 0)
1 (b, 1)
2 (c, 2)
3 (d, 3)
4 (e, 4)
dtype: object
series的index转dataframe的column
df = ser3. to_frame( ) . reset_index( )
df. head( )
index
0
0
0
(a, 0)
1
1
(b, 1)
2
2
(c, 2)
3
3
(d, 3)
4
4
(e, 4)
多个series合并成一个dataframe
df = pd. DataFrame( {
'col1' : ser1, 'col2' : ser2} )
df. head( )
col1
col2
0
a
0
1
b
1
2
c
2
3
d
3
4
e
4
根据index, 多个series合并成dataframe
s1 = ser1[ : 16 ]
s2 = ser2[ 14 : ]
pd. concat( [ s1, s2] , axis= 1 )
0
1
0
a
NaN
1
b
NaN
2
c
NaN
3
d
NaN
4
e
NaN
5
f
NaN
6
g
NaN
7
h
NaN
8
i
NaN
9
j
NaN
10
k
NaN
11
l
NaN
12
m
NaN
13
n
NaN
14
o
14.0
15
p
15.0
16
NaN
16.0
17
NaN
17.0
18
NaN
18.0
19
NaN
19.0
20
NaN
20.0
21
NaN
21.0
22
NaN
22.0
23
NaN
23.0
24
NaN
24.0
25
NaN
25.0
头尾拼接两个series
pd. concat( [ s1, s2] , axis= 0 )
0 a
1 b
2 c
3 d
4 e
5 f
6 g
7 h
8 i
9 j
10 k
11 l
12 m
13 n
14 o
15 p
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
dtype: object
找到元素 在series A中不在series B中
ser1 = pd. Series( [ 1 , 2 , 3 , 4 , 5 ] )
ser2 = pd. Series( [ 4 , 5 , 6 , 7 , 8 ] )
ser1[ ~ ser1. isin( ser2) ]
0 1
1 2
2 3
dtype: int64
两个seiries的并集
np. union1d( ser1, ser2)
array([1, 2, 3, 4, 5, 6, 7, 8], dtype=int64)
两个series的交集
np. intersect1d( ser1, ser2)
array([4, 5], dtype=int64)
两个series的非共有元素
u = pd. Series( np. union1d( ser1, ser2) )
i = pd. Series( np. intersect1d( ser1, ser2) )
u[ ~ u. isin( i) ]
0 1
1 2
2 3
5 6
6 7
7 8
dtype: int64
如何获得series的最小值,第25百分位数,中位数,第75位和最大值?
ser = pd. Series( np. random. normal( 10 , 5 , 25 ) )
np. random. RandomState( 100 )
np. percentile( ser, q= [ 0 , 25 , 50 , 75 , 100 ] )
array([-2.54523372, 7.86187042, 10.16123596, 15.60337005, 23.12409334])
如何获得系列中唯一项目的频率计数?
ser = pd. Series( np. take( list ( 'abcdefgh' ) , np. random. randint( 8 , size= 30 ) ) )
ser. value_counts( )
h 8
b 5
c 4
g 4
f 4
a 3
d 2
dtype: int64
series中计数排名前2的元素
v_cnt = ser. value_counts( )
print ( v_cnt)
cnt_cnt = v_cnt. value_counts( ) . index[ : 2 ]
print ( cnt_cnt)
cnt_cnt
h 8
b 5
c 4
g 4
f 4
a 3
d 2
dtype: int64
Int64Index([4, 5], dtype='int64')
Int64Index([4, 5], dtype='int64')
v_cnt[ v_cnt. isin( cnt_cnt) ] . index
Index(['b', 'c', 'g', 'f'], dtype='object')
如何将数字系列分成10个相同大小的组
ser = pd. Series( np. random. random( 20 ) )
ser. head( )
0 0.588945
1 0.356710
2 0.798986
3 0.170943
4 0.076717
dtype: float64
groups = pd. qcut( ser, q= [ 0 , .10 , .20 , .3 , .4 , .5 , .6 , .7 , .8 , .9 , 1 ] ,
labels= [ '1st' , '2nd' , '3rd' , '4th' , '5th' , '6th' , '7th' , '8th' , '9th' , '10th' ] )
groups. head( )
0 5th
1 3rd
2 9th
3 3rd
4 1st
dtype: category
Categories (10, object): [1st < 2nd < 3rd < 4th ... 7th < 8th < 9th < 10th]
如何将numpy数组转换为给定形状的dataframe
ser = pd. Series( np. random. randint( 1 , 10 , 35 ) )
df = pd. DataFrame( ser. values. reshape( 7 , 5 ) )
df
0
1
2
3
4
0
8
6
5
4
1
1
1
7
8
1
4
2
5
3
5
7
5
3
8
6
3
4
6
4
5
9
2
4
3
5
3
7
6
8
7
6
6
2
2
7
5
如何从一系列中找到2的倍数的数字位置
ser = pd. Series( np. random. randint( 1 , 10 , 7 ) )
ser
0 1
1 8
2 7
3 9
4 9
5 2
6 6
dtype: int32
np. argwhere( ser % 2 == 0 )
array([[1],
[5],
[6]], dtype=int64)
如何从系列中的给定位置提取项目
ser = pd. Series( list ( 'abcdefghijklmnopqrstuvwxyz' ) )
pos = [ 0 , 4 , 8 , 14 , 20 ]
ser. take( pos)
0 a
4 e
8 i
14 o
20 u
dtype: object
获取元素的位置
aims = list ( 'adhz' )
[ pd. Index( ser) . get_loc( i) for i in aims]
[0, 3, 7, 25]
如何计算真值和预测序列的均方误差
truth = pd. Series( range ( 10 ) )
pred = pd. Series( range ( 10 ) ) + np. random. random( 10 )
np. mean( ( truth - pred) ** 2 )
0.32466394194250286
如何将系列中每个元素的第一个字符转换为大写
ser = pd. Series( [ 'how' , 'are' , 'you' ] )
ser. map ( lambda x: x. title( ) )
0 How
1 Are
2 You
dtype: object
如何计算系列中每个单词的字符数
ser. map ( lambda x: len ( x) )
0 3
1 3
2 3
dtype: int64
如何计算时间序列数据的差分
ser = pd. Series( [ 1 , 3 , 6 , 10 , 15 , 21 , 27 , 35 ] )
ser. diff( )
0 NaN
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
6 6.0
7 8.0
dtype: float64
ser. diff( ) . diff( )
0 NaN
1 NaN
2 1.0
3 1.0
4 1.0
5 1.0
6 0.0
7 2.0
dtype: float64
series如何将一日期字符串转换为时间
import pandas as pd
ser = pd. Series(
[ '01 Jan 2010' ,
'02-02-2011' ,
'20120303' ,
'2013/0