🙌🙌🙌pandas基础小点
第一次发布小博客(其实就是上课的笔记😜),哈哈哈哈
因为是在jupyter上写好搬过来的,所以特别乱,还请见谅
先是一些基础的代码演示
import pandas as pd
dict_data = {
'student': ['lilei', 'hanmeimei', 'madongmei'],
'score': [98,99,100],
'gender':['M','F','F']
}
data = pd.DataFrame(dict_data)
print(data)
student score gender
0 lilei 98 M
1 hanmeimei 99 F
2 madongmei 100 F
print(data['student'])
0 lilei
1 hanmeimei
2 madongmei
Name: student, dtype: object
print(data.columns)
Index(['student', 'score', 'gender'], dtype='object')
用index改变索引
data = pd.DataFrame(dict_data,
columns = ['gender', 'student', 'score'],
index = ['a', 'b', 'c'] )
print(data)
gender student score
a M lilei 98
b F hanmeimei 99
c F madongmei 100
获取DataFrame数据中的某一列数据
print(data['student'])
print(data.student)
a lilei
b hanmeimei
c madongmei
Name: student, dtype: object
a lilei
b hanmeimei
c madongmei
Name: student, dtype: object
获取DataFrame数据中某一行数据
根据行编号
print(data.iloc[0])
gender M
student lilei
score 98
Name: a, dtype: object
根据列编号
print(data.loc['a']
File "<ipython-input-24-c695cddf6b19>", line 1
print(data.loc['a']
^
SyntaxError: unexpected EOF while parsing
切片
**切片后的数据会影响元数据
*想要一份副本不影响数据,例如下例中可以使用data[‘score’].copy()
slice_data = data['score']
print(slice_data)
slice_data[0] = 70
print(data)
a 98
b 99
c 100
Name: score, dtype: int64
gender student score
a M lilei 70
b F hanmeimei 99
c F madongmei 100
D:\Anaconda\anaconda\lib\site-packages\ipykernel_launcher.py:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
This is separate from the ipykernel package so we can avoid doing imports until
slice_data = data['score'].copy
print(slice_data)
slice_data[1] = 80
print(data)
<bound method NDFrame.copy of a 70
b 80
c 100
Name: score, dtype: int64>
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-30-884bad2cc7b3> in <module>
1 slice_data = data['score'].copy
2 print(slice_data)
----> 3 slice_data[1] = 80
4 print(data)
TypeError: 'method' object does not support item assignment
对应赋值
score_data = pd.Series([1,2,3],
index = ['c', 'a', 'b']
)
print(data)
print(score_data)
gender student score
a M lilei 70
b F hanmeimei 80
c F madongmei 100
c 1
a 2
b 3
dtype: int64
data['score'] = score_data
print(data)
gender student score
a M lilei 2
b F hanmeimei 3
c F madongmei 1
删除DataFrame数据中某一列数据
**del
del data['score']
print(data)
索引重排顺序(相当于行的顺序重排)
data = data.reindex(['c','b','a'])
print(data)
gender student score
c F madongmei 1
b F hanmeimei 3
a M lilei 2
多加一行
data = data.reindex(['c','b','a','d'])
print(data)
gender student score
c F madongmei 1.0
b F hanmeimei 3.0
a M lilei 2.0
d NaN NaN NaN
多加一行还要有数据
data = data.reindex(['c','b','a','d','e'], fill_value=0)
print(data)
gender student score
c F madongmei 1.0
b F hanmeimei 3.0
a M lilei 2.0
d NaN NaN NaN
e 0 0 0.0
从前从后插入数据
ffill:从前面数据计算插值
bfill:从后面数据计算插值
data = data.reindex(['c','b','a','d','e','f']).ffill()
print(data)
gender student score
c F madongmei 1.0
b F hanmeimei 3.0
a M lilei 2.0
d M lilei 2.0
e 0 0 0.0
f 0 0 0.0
data = data.reindex(['c','b','a','d','e','f','g']).bfill()
print(data)
gender student score
c F madongmei 1.0
b F hanmeimei 3.0
a M lilei 2.0
d M lilei 2.0
e 0 0 0.0
f 0 0 0.0
g NaN NaN NaN
扔掉包含缺失数据(NaN)的行
#print(data.dropna())
扔掉全部都是缺失数据(NaN)的行
#print(data.dropna(how='all'))
填充所有缺失数据为一个值
#print(data.fillna(0)) 丢失初全填0
按列填充缺失数据为不同值
print(data.fillna({'gender': 'M', 'student': 'unknown', 'score': 60}))
gender student score
c F madongmei 1.0
b F hanmeimei 3.0
a M lilei 2.0
d M lilei 2.0
e 0 0 0.0
f 0 0 0.0
g M unknown 60.0
删除某一行数据
data = data.drop('f')
print(data)
gender student score
c F madongmei 1.0
b F hanmeimei 3.0
a M lilei 2.0
d M lilei 2.0
e 0 0 0.0
g NaN NaN NaN
🙌🙌🙌pandas筛选数据
print(data[data['score']>=2])
gender student score
b F hanmeimei 3.0
a M lilei 2.0
d M lilei 2.0
从列表中筛选具体的数据
select_list = [1,3]
print(data[data['score'].isin(select_list)])
gender student score
c F madongmei 1.0
b F hanmeimei 3.0
👏利用groupby对数据进行分组并计算sum,mean等
data = pd.DataFrame({
'tag_id': ['a', 'b', 'c', 'a', 'a', 'c'],
'count': [10,30,20,10,15,22]
})
grouped_data = data.groupby('tag_id')
print(grouped_data.sum())
#将会计算出每个字母(索引)各自对应的数值之和
count
tag_id
a 35
b 30
c 42
数据排序——按索引名称升序排列
print(data.sort_index())
tag_id count
0 a 10
1 b 30
2 c 20
3 a 10
4 a 15
5 c 22
数据排序——按索引名称降序排列
print(data.sort_index(ascending = False))
tag_id count
5 c 22
4 a 15
3 a 10
2 c 20
1 b 30
0 a 10
数据排序——按某一列的数据进行排序
print(data.sort_values(by = 'count'))
tag_id count
0 a 10
3 a 10
4 a 15
2 c 20
5 c 22
1 b 30
print(data.sort_values(by = 'count', ascending = False))
tag_id count
1 b 30
5 c 22
2 c 20
4 a 15
0 a 10
3 a 10
一些常用方法
count | 计算非NaN数据的数量
min, max | 计算最小最大值
argmin, argmax | 计算最小最大值位置
sum | 计算数值的和
mean | 计算平均数
median | 计算中位数
var | 计算方差
std | 计算标准差
同一个轴可以用多种方式来索引
import numpy as np
book_ratings = pd.Series(
np.random.randint(1,6,size=7),
index = [
['b1','b1','b2','b2','b3','b4','b4'],
[1,2,1,2,1,2,3]
]
)
print(book_ratings)
#有两级索引
b1 1 3
2 4
b2 1 4
2 1
b3 1 5
b4 2 4
3 2
dtype: int32
两个DataFrame进行合并
book_name = pd.DataFrame({
'book_name': ['a','b','c','d','e','f'],
'book_id': [11,22,33,44,55,66]
})
id_rating = pd.DataFrame({
'book_id': [11,22,33,44,55,66,33,11,55],
'rating': [1,3,5,2,4,3,2,4,5]
})
print(pd.merge(book_name, id_rating))
#连接方式:按照共同都有的id进行分类
book_name book_id rating
0 a 11 1
1 a 11 4
2 b 22 3
3 c 33 5
4 c 33 2
5 d 44 2
6 e 55 4
7 e 55 5
8 f 66 3
两个DataFrame进行合并, 不指定 连接方式
🐱👤🐱👤🐱👤都是 找相同,去多余
data1 = pd.DataFrame({
'key': ['a','b','a','c','b','d'],
'data1': [1,2,3,4,5,6]
})
data2 = pd.DataFrame({
'key': ['a','b','c'],
'data2': [8,9,7]
})
print(pd.merge(data1, data2))
key data1 data2
0 a 1 8
1 a 3 8
2 b 2 9
3 b 5 9
4 c 4 7
两个DataFrame进行合并, 指定 连接方式
print(pd.merge(data1, data2, how = 'outer'))
#还可以使用left,right
key data1 data2
0 a 1 8.0
1 a 3 8.0
2 b 2 9.0
3 b 5 9.0
4 c 4 7.0
5 d 6 NaN
两个DataFrame进行合并, 指定连接的列名称
print(pd.merge(data1,data2,on = 'key'))
key data1 data2
0 a 1 8
1 a 3 8
2 b 2 9
3 b 5 9
4 c 4 7
两个DataFrame进行合并, 分别指定连接列的名称
data1 = pd.DataFrame({
'lkey': ['a','b','a','c','b','d'],
'data1': [1,2,3,4,5,6]
})
data2 = pd.DataFrame({
'rkey': ['a','b','c'],
'data2': [8,9,7]
})
print(pd.merge(data1, data2, left_on = 'lkey', right_on = 'rkey'))
lkey data1 rkey data2
0 a 1 a 8
1 a 3 a 8
2 b 2 b 9
3 b 5 b 9
4 c 4 c 7
🙌🙌🙌pandas文件存取
读取csv文件
#data = pd.read_csv('rating.csv')
#print(data)
#>>输出的表格会默认第一行未标题行
#data = pd.read_csv('rating.csv', header = None)
#print(data)
#>>告诉计算机,别把我的第一行当成标题了
#data = pd.read_csv('rating.csv', name = ['user_id, 'book_id', 'rating'])
#print(data)
#>>当然,也可以自己DIY标题行
指定索引列
#data = pd.read_csv('rating.csv',
# name = ['user_id, 'book_id', 'rating'],
# index_col = 'user_id'
# )
#print(data)
#指定user_id 为索引
指定分隔符
#data = pd.read_csv('rating.csv',
# name = ['user_id, 'book_id', 'rating'],
# sep = '|'
# )
#print(data)
#>>用“|”分割
储存数据为CSV文件
#data.to_csv('文件名.csv')