Pandas读书笔记-数据分析
2021.8.6笔记
对于书上的一些内容类如函数映射这些有所删减
①Serises
1基本用法1
- 代码块
# 例1
obj = Series([4,-8,2,3])
print(obj.values)
print(obj.index)
print(obj)
- 运行结果
[ 4 -8 2 3] RangeIndex(start=0, stop=4, step=1) 0 4 1 -8 2 2 3 3 dtype: int64
2基本用法2(对索引进行修改)
- 代码块
# 对索引进行修改
obj2 = Series([1,3,-5,2],index = ['a','b','c','d'])
# 或者是obj2.index = ['a','b','c','d']
print(obj2['a'])
print(obj2[obj2 > 0])
print(obj2)
obj2.index
- 运行结果
1 a 1 b 3 d 2 dtype: int64 a 1 b 3 c -5 d 2 dtype: int64
Out[8]:
Index(['a', 'b', 'c', 'd'], dtype='object')
3传入字典
如果只传入字典,那么直接按照字典顺序排列。同时传入字典和索引,按照索引来,索引中在字典里面找不到的就用NaN来指示。
- 代码块
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj1 = Series(sdata)
print(obj1)
obj2 = Series(sdata,index = states)
print(obj2)
- 运行结果
Ohio 35000 Texas 71000 Oregon 16000 Utah 5000 dtype: int64 California NaN Ohio 35000.0 Oregon 16000.0 Texas 71000.0 dtype: float64
4判空isnull()
检测缺失数据NaN之类的。
- 代码块
# 借上一条的数据obj2
obj2.isnull()
- 运行结果
California True Ohio False Oregon False Texas False dtype: bool
5Series本身的属性-name
- 代码块
obj2.name = 'Location Price'
obj2.index.name = 'locationName'
obj2
- 运行结果
locationName California NaN Ohio 35000.0 Oregon 16000.0 Texas 71000.0 Name: Location Price, dtype: float64
②DataFrame部分
1构建一个DataFrame
DataFrame会自动上索引,根据字典的排序
- 代码块
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002, 2003],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
frame
- 运行结果
state year pop 0 Ohio 2000 1.5 1 Ohio 2001 1.7 2 Ohio 2002 3.6 3 Nevada 2001 2.4 4 Nevada 2002 2.9 5 Nevada 2003 3.2
2修改index和columns
- 修改columns代码
pd.DataFrame(data, columns=['year', 'state', 'pop'])
- 运行结果
year | state | pop | |
---|---|---|---|
0 | 2000 | Ohio | 1.5 |
1 | 2001 | Ohio | 1.7 |
2 | 2002 | Ohio | 3.6 |
3 | 2001 | Nevada | 2.4 |
4 | 2002 | Nevada | 2.9 |
5 | 2003 | Nevada | 3.2 |
- 修改index和columns
columns在data里面找不到的东西就用NaN来代替
frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],index=['one', 'two', 'three', 'four', 'five', 'six'])
frame2
- 运行结果
year state pop debt one 2000 Ohio 1.5 NaN two 2001 Ohio 1.7 NaN three 2002 Ohio 3.6 NaN four 2001 Nevada 2.4 NaN five 2002 Nevada 2.9 NaN six 2003 Nevada 3.2 NaN
3列操作
①增加新列
为不存在的列赋值会创建出一个新列
- 代码块
# 直接增加一个新列,长度必须与原来保持一致
frame2['new'] = 99
frame2['new2'] = np.arange(3,9,1)
# 通过Series来增加,通过索引来指定给DataFrame赋值
val = Series([1,3,-2],index = ['one','three','four'])
frame2['new3_Series'] = val
# 对原来DataFrmae进行条件判断结果赋值给新列
frame2['eastern'] = frame2.state == 'Ohio'
frame2
- 运行结果
year state pop debt new new2 new3_Series eastern one 2000 Ohio 1.5 NaN 99 3 1.0 True two 2001 Ohio 1.7 NaN 99 4 NaN True three 2002 Ohio 3.6 NaN 99 5 3.0 True four 2001 Nevada 2.4 NaN 99 6 -2.0 False five 2002 Nevada 2.9 NaN 99 7 NaN False six 2003 Nevada 3.2 NaN 99 8 NaN False
②删除列
- 代码块
del frame2['new2']
frame2
- 运行结果
year state pop debt new new3_Series eastern one 2000 Ohio 1.5 NaN 99 1.0 True two 2001 Ohio 1.7 NaN 99 NaN True three 2002 Ohio 3.6 NaN 99 3.0 True four 2001 Nevada 2.4 NaN 99 -2.0 False five 2002 Nevada 2.9 NaN 99 NaN False six 2003 Nevada 3.2 NaN 99 NaN False
③可以输入给DataFrame的数据
4index对象
Index对象是不可变的,因此用户不能对其进行修改。
pandas的索引对象负责管理轴标签和其他元数据(比如轴名称等)。构建Series或DataFrame时,所用到的任何数组或其他序列的标签都会被转换成一个Index。
index的方法和属性
5reindex
Series的reindex将会根据新索引进行重排。如果某个索引值当前不存在,就引入缺失值。
- 代码块
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
print(obj)
# 如果某个索引值当前不存在,就引入缺失值
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
print(obj2)
# 进行插值处理
obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
obj4 = obj3.reindex(range(6), method='ffill')
print(obj4)
# 修改行列
frame = pd.DataFrame(np.arange(9).reshape((3, 3)),index=['a', 'c', 'd'],columns=['Ohio', 'Texas', 'California'])
frame2 = frame.reindex(['a', 'b', 'c', 'd'])
# 列修改
states = ['Texas', 'Utah', 'California']
frame3 = frame.reindex(columns=states)
frame
frame2
frame3
- 运行结果
d 4.5 b 7.2 a -5.3 c 3.6 dtype: float64 a -5.3 b 7.2 c 3.6 d 4.5 e NaN dtype: float64 0 blue 1 blue 2 purple 3 purple 4 yellow 5 yellow dtype: object
- frame
Ohio Texas California a 0 1 2 c 3 4 5 d 6 7 8
- frame2
Ohio Texas California a 0.0 1.0 2.0 b NaN NaN NaN c 3.0 4.0 5.0 d 6.0 7.0 8.0
- frame3
Texas Utah California a 1 NaN 2 c 4 NaN 5 d 7 NaN 8
reindex
的(插值)method
选项
参数 | 说明 |
---|---|
ffill或pad | 向前填充(或搬运)值 |
bfill或backfill | 向后填充(或搬运)值 |
③
1删除指定轴上的值
- Series
obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
obj.drop(['c','d'])
a 0.0 b 1.0 e 4.0 dtype: float64
- DataFrame
# DataFrame默认删除
data = pd.DataFrame(np.arange(16).reshape((4, 4)),index=['Ohio', 'Colorado', 'Utah', 'New York'],columns=['one', 'two', 'three', 'four'])
data.drop(['Colorado', 'Ohio'])
# 指定删除轴
data.drop(['two', 'four'], axis='columns')
# 或者是 data.drop('two', axis=1)
one two three four Utah 8 9 10 11 New York 12 13 14 15
one three Ohio 0 2 Colorado 4 6 Utah 8 10 New York 12 14
2索引
- 代码块
obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
# 它可以使用自己的index来索引
print(obj)
print("obj['b']",obj['b'])
print("obj[1]",obj[1])
print("obj[2:4]",obj[2:4])
print("obj[['b', 'a', 'd']]",obj[['b', 'a', 'd']])
# 值可以用来选取
print("obj[[1, 3]]",obj[[1, 3]])
print("obj[obj < 2]",obj[obj < 2])
- 运行结果
a 0.0 b 1.0 c 2.0 d 3.0 dtype: float64 obj['b'] 1.0 obj[1] 1.0 obj[2:4] c 2.0 d 3.0 dtype: float64 obj[['b', 'a', 'd']] b 1.0 a 0.0 d 3.0 dtype: float64 obj[[1, 3]] b 1.0 d 3.0 dtype: float64
3loc方法和iloc方法
在新版本中ix已经被删掉了。所以就不要再根据书本中的内容去学习那块内容了。还好我是看书的时候同时实操一遍的,差点就做了无用功。
- 两者的区别
①loc是用名字来索引的
②iloc是用下标来索引的(index location)
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
.....: index=['Ohio', 'Colorado', 'Utah', 'New York'],
.....: columns=['one', 'two', 'three', 'four'])
print(data.loc[['Utah','Ohio'],'one'])
print(data.iloc[[0,2],[1,3]])
Utah 8 Ohio 0 Name: one, dtype: int32 two four Ohio 1 3 Utah 9 11
print(data)
# 一些转化
data.columns # 获取所有的列
data.columns.get_loc('two') # 获取列名为two的下标
data.iloc[-1, data.columns.get_loc('two')] # -1代表最后一行,把刚刚转化成下标的拿过来放这里
one two three four Ohio 0 1 2 3 Colorado 4 5 6 7 Utah 8 9 10 11 New York 12 13 14 15
Index(['one', 'two', 'three', 'four'], dtype='object')
1
13
4算术相加的一些对齐问题
- 两个长度不同的DataFrame对象相加。没有重叠的部分就会自动赋值为NaN
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
.....: index=['Ohio', 'Texas', 'Colorado'])
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
.....: index=['Utah', 'Ohio', 'Texas', 'Oregon'])
print(df1)
print(df2)
df1+df2
b c d Ohio 0.0 1.0 2.0 Texas 3.0 4.0 5.0 Colorado 6.0 7.0 8.0 b d e Utah 0.0 1.0 2.0 Ohio 3.0 4.0 5.0 Texas 6.0 7.0 8.0 Oregon 9.0 10.0 11.0
b c d Ohio 0.0 1.0 2.0 Texas 3.0 4.0 5.0 Colorado 6.0 7.0 8.0 b d e Utah 0.0 1.0 2.0 Ohio 3.0 4.0 5.0 Texas 6.0 7.0 8.0 Oregon 9.0 10.0 11.0
Out[100]:
b c d e Colorado NaN NaN NaN NaN Ohio 3.0 NaN 6.0 NaN Oregon NaN NaN NaN NaN Texas 9.0 NaN 12.0 NaN Utah NaN NaN NaN NaN
- 解决这个问题可以使用
add
的方法传入一个fill_value
参数
df2.add(df1,fill_value=0)
Out[105]:
b c d e Colorado 6.0 7.0 8.0 NaN Ohio 3.0 1.0 6.0 5.0 Oregon 9.0 NaN 10.0 11.0 Texas 9.0 4.0 12.0 8.0 Utah 0.0 NaN 1.0 2.0
tips
这里fill_value
只能对一者有一者没有的进行填充,对两者都没有的依然是NaN
如果想要把NaN
都处理成0可以使用df2[np.isnan(df2)] = 0
5DataFrame+Series
会采用向下广播的方法,也就是一直向下搜索知道知道匹配的index,如果没找到在最后新增一行。
frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
.....: columns=list('bde'),
.....: index=['Utah', 'Ohio', 'Texas', 'Oregon'])
series = frame.iloc[0]
print(frame)
print(series)
frame + series
b d e Utah 0.0 1.0 2.0 Ohio 3.0 4.0 5.0 Texas 6.0 7.0 8.0 Oregon 9.0 10.0 11.0 b 0.0 d 1.0 e 2.0 Name: Utah, dtype: float64
Out[114]:
b d e Utah 0.0 2.0 4.0 Ohio 3.0 5.0 7.0 Texas 6.0 8.0 10.0 Oregon 9.0 11.0 13.0
6排序和排名
- 对轴进行排序
frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
.....: index=['three', 'one'],
.....: columns=['d', 'a', 'b', 'c'])
print(frame)
frame = frame.sort_index(axis=0, ascending=False)
print(frame)
d a b c three 0 1 2 3 one 4 5 6 7 d a b c one 4 5 6 7 three 0 1 2 3
- 对值进行排序
frame.sort_values(by='d')
print(frame)
obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])
obj.sort_values()
d a b c three 0 1 2 3 one 4 5 6 7 d a b c one 4 5 6 7 three 0 1 2 3 d a b c one 4 5 6 7 three 0 1 2 3
Out[124]:
4 -3.0 5 2.0 0 4.0 2 7.0 1 NaN 3 NaN dtype: float64
- rank
obj = pd.Series([7, -5, 7, 4, 2, 0, 4])
obj.rank(method='first')
0 6.0 1 1.0 2 7.0 3 4.0 4 3.0 5 2.0 6 5.0 dtype: float64
7统计方法
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
.....: [np.nan, np.nan], [0.75, -1.3]],
.....: index=['a', 'b', 'c', 'd'],
.....: columns=['one', 'two'])
print(df)
df.mean(axis='columns', skipna=True)
one two a 1.40 NaN b 7.10 -4.5 c NaN NaN d 0.75 -1.3
Out[128]:
a 1.400 b 1.300 c NaN d -0.275 dtype: float64
8唯一值、值计数以及成员资格
obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
uniques = obj.unique()
counts = pd.value_counts(obj.values, sort=False)
mask = obj.isin(['b', 'c'])
print("uniques",uniques)
print("counts",counts)
print("mask",mask)
obj[mask]
uniques ['c' 'a' 'd' 'b'] counts b 2 d 1 a 3 c 3 dtype: int64 mask 0 True 1 False 2 False 3 False 4 False 5 True 6 True 7 True 8 True dtype: bool
Out[130]:
0 c 5 b 6 b 7 c 8 c dtype: object