文章目录
01.Pandas概述
import pandas as pd
df = pd.read_csv('./data/example.csv')
df
Name | Age | Sex | City | |
---|---|---|---|---|
0 | Alice | 30 | female | New York |
1 | Bob | 25 | male | Los Angeles |
2 | Charlie | 35 | female | Chicago |
3 | Tom | 60 | male | Beijing |
4 | Danny | 45 | male | Canada |
5 | Jenny | 75 | female | Shanghai |
.head()可以读取前几条数据,指定前几条都可以
df.head(3)
Name | Age | Sex | City | |
---|---|---|---|---|
0 | Alice | 30 | female | New York |
1 | Bob | 25 | male | Los Angeles |
2 | Charlie | 35 | female | Chicago |
.info()可以返回当前的信息
df.info()
<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 6 entries, 0 to 5
Data columns (total 4 columns):
Column Non-Null Count Dtype
0 Name 6 non-null object
1 Age 6 non-null int64
2 Sex 6 non-null object
3 City 6 non-null object
dtypes: int64(1), object(3)
memory usage: 324.0+ bytes
df.index指的是DataFrame对象df的索引
df.index
RangeIndex(start=0, stop=6, step=1)
df.columns属性返回DataFrame对象df的列标签数组
df.columns
Index(['Name', 'Age', 'Sex', 'City'], dtype='object')
df.dtypes属性用于查看DataFrame df 中每列的数据类型。这个属性返回一个Series,其中索引是DataFrame的列名,值是每列的数据类型
df.dtypes
Name object
Age int64
Sex object
City object
dtype: object
df.values属性用于获取DataFrame df 中的数据值,将其作为一个NumPy数组返回。这个数组包含了DataFrame中的所有数据,但不包括索引和列名
df.values
array([[‘Alice’, 30, ‘female’, ‘New York’],
[‘Bob’, 25, ‘male’, ‘Los Angeles’],
[‘Charlie’, 35, ‘female’, ‘Chicago’],
[‘Tom’, 60, ‘male’, ‘Beijing’],
[‘Danny’, 45, ‘male’, ‘Canada’],
[‘Jenny’, 75, ‘female’, ‘Shanghai’]], dtype=object)
02.Panades基本操作
自己创建一个dataframe结构
import pandas as pd
data = {
'country':['aaa','bbb','ccc'],'population':[10,12,14]}
df_data = pd.DataFrame(data)
df_data
country | population | |
---|---|---|
0 | aaa | 10 |
1 | bbb | 12 |
2 | ccc | 14 |
df_data.info()
<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
Column Non-Null Count Dtype
0 country 3 non-null object
1 population 3 non-null int64
dtypes: int64(1), object(1)
memory usage: 180.0+ bytes
取指定的数据
df
Name | Age | Sex | City | |
---|---|---|---|---|
0 | Alice | 30 | female | New York |
1 | Bob | 25 | male | Los Angeles |
2 | Charlie | 35 | female | Chicago |
3 | Tom | 60 | male | Beijing |
4 | Danny | 45 | male | Canada |
5 | Jenny | 75 | female | Shanghai |
age = df['Age']
# age
age[:5] #相当于age[0:5]
0 30
1 25
2 35
3 60
4 45
Name: Age, dtype: int64
series是dataframe中的一行/列
age.index
RangeIndex(start=0, stop=6, step=1)
age.values
array([30, 25, 35, 60, 45, 75], dtype=int64)
可以自己指定索引
df = df.set_index('Name')
df.head()
Age | Sex | City | |
---|---|---|---|
Name | |||
Alice | 30 | female | New York |
Bob | 25 | male | Los Angeles |
Charlie | 35 | female | Chicago |
Tom | 60 | male | Beijing |
Danny | 45 | male | Canada |
df['Age'][:5]
Name
Alice 30
Bob 25
Charlie 35
Tom 60
Danny 45
Name: Age, dtype: int64
age = df['Age']
age[:5]
Name
Alice 30
Bob 25
Charlie 35
Tom 60
Danny 45
Name: Age, dtype: int64
age['Alice']
30
age = age + 10
age[:5]
Name
Alice 40
Bob 35
Charlie 45
Tom 70
Danny 55
Name: Age, dtype: int64
age = age * 10
age[:5]
Name
Alice 400
Bob 350
Charlie 450
Tom 700
Danny 550
Name: Age, dtype: int64
age.mean()
550.0
age.max()
850
age.min()
350
.describe()可以得到数据的基本统计特性
df.describe()
Age | |
---|---|
count | 6.000000 |
mean | 45.000000 |
std | 19.235384 |
min | 25.000000 |
25% | 31.250000 |
50% | 40.000000 |
75% | 56.250000 |
max | 75.000000 |
03.Pandas索引
import pandas as pd
df = pd.read_csv('./data/example.csv')
df
Name | Age | Sex | City | Price | |
---|---|---|---|---|---|
0 | Alice | 30 | female | New York | 100.4 |
1 | Bob | 25 | male | Los Angeles | 60.6 |
2 | Charlie | 35 | female | Chicago | 50.2 |
3 | Tom | 60 | male | Beijing | 64.8 |
4 | Danny | 45 | male | Canada | 75.5 |
5 | Jenny | 75 | female | Shanghai | 90.5 |
df[['Age','Name','Price']][:4]
Age | Name | Price | |
---|---|---|---|
0 | 30 | Alice | 100.4 |
1 | 25 | Bob | 60.6 |
2 | 35 | Charlie | 50.2 |
3 | 60 | Tom | 64.8 |
- loc 用label来去定位
- iloc 用position来去定位
df.iloc[0]
Name Alice
Age 30
Sex female
City New York
Price 100.4
Name: 0, dtype: object
df.iloc[0:2]
Name | Age | Sex | City | Price | |
---|---|---|---|---|---|
0 | Alice | 30 | female | New York | 100.4 |
1 | Bob | 25 | male | Los Angeles | 60.6 |
df.iloc[0:2,1:3]
Age | Sex | |
---|---|---|
0 | 30 | female |
1 | 25 | male |
# df = df.set_index('Name')
df
Age | Sex | City | Price | |
---|---|---|---|---|
Name | ||||
Alice | 30 | female | New York | 100.4 |
Bob | 25 | male | Los Angeles | 60.6 |
Charlie | 35 | female | Chicago | 50.2 |
Tom | 60 | male | Beijing | 64.8 |
Danny | 45 | male | Canada | 75.5 |
Jenny | 75 | female | Shanghai | 90.5 |
df.loc['Bob']
Age 25
Sex male
City Los Angeles
Price 60.6
Name: Bob, dtype: object
df.loc['Bob','Price']
60.6
df.loc['Bob':'Danny',:]
Age | Sex | City | Price | |
---|---|---|---|---|
Name | ||||
Bob | 25 | male | Los Angeles | 60.6 |
Charlie | 35 | female | Chicago | 50.2 |
Tom | 60 | male | Beijing | 64.8 |
Danny | 45 | male | Canada | 75.5 |
df.loc['Bob','Price'] = 1000
df.head()
Age | Sex | City | Price | |
---|---|---|---|---|
Name | ||||
Alice | 30 | female | New York | 100.4 |
Bob | 25 | male | Los Angeles | 1000.0 |
Charlie | 35 | female | Chicago | 50.2 |
Tom | 60 | male | Beijing | 64.8 |
Danny | 45 | male | Canada | 75.5 |
- 布尔类型的索引
df['Price'] > 100
Name
Alice True
Bob True
Charlie False
Tom False
Danny False
Jenny False
Name: Price, dtype: bool
df[df['Price'] > 100]
Age | Sex | City | Price | |
---|---|---|---|---|
Name | ||||
Alice | 30 | female | New York | 100.4 |
Bob | 25 | male | Los Angeles | 1000.0 |
df[df['Sex'] == 'male'][:5]
Age | Sex | City | Price | |
---|---|---|---|---|
Name | ||||
Bob | 25 | male | Los Angeles | 1000.0 |
Tom | 60 | male | Beijing | 64.8 |
Danny | 45 | male | Canada | 75.5 |
统计一下性别为male的Age的平均值
df.loc[df['Sex'] == 'male','Age'].mean()
43.333333333333336
统计一下Age大于的人的数量
(df['Age'] > 40).sum()
3
04.groupby操作
import pandas as pd
df = pd.DataFrame({
'key':['A','B','C','A','B','C','A','B','C'],
'data':[1,5,10,5,10,15,10,15,20]})
df
key | data | |
---|---|---|
0 | A | 1 |
1 | B | 5 |
2 | C | 10 |
3 | A | 5 |
4 | B | 10 |
5 | C | 15 |
6 | A | 10 |
7 | B | 15 |
8 | C | 20 |
假设有A B C 三个品牌,data的值是销售额,计算一下它们的销售额
for key in ['A','B','C']:
# print(key,df[df['key'] == key].sum())
print(key)
print(df[df['key'] == key].sum())
A
key AAA
data 16
dtype: object
B
key BBB
data 30
dtype: object
C
key CCC
data 45
dtype: object
df.groupby('key').sum()
data | |
---|---|
key | |
A | 16 |
B | 30 |
C | 45 |
aggregate 是 Pandas 库中 DataFrameGroupBy 对象的一个方法,它用于对分组后的数据进行聚合操作。在 Pandas 中,groupby 方法用于根据某个列(或某些列)的值将数据分组,而 aggregate 方法则用于对这些分组后的数据执行一个或多个聚合函数。
df.groupby('key').aggregate('mean')
data | |
---|---|
key | |
A | 5.333333 |
B | 10.000000 |
C | 15.000000 |
df = pd.read_csv('./data/example.csv')
- 统计性别这个指标去计算年龄的平均值
df.groupby('Sex')['Age'].mean()
Sex
female 46.666667
male 43.333333
Name: Age, dtype: float64
05.数值运算
import pandas as pd
df = pd.DataFrame([[1,2,3],[4,5,6]],index = ['a','b'],columns = ['A','B','C'])
df
A | B | C | |
---|---|---|---|
a | 1 | 2 | 3 |
b | 4 | 5 | 6 |
默认按列求和
df.sum()
A 5
B 7
C 9
dtype: int64
df.sum(axis = 0)
A 5
B 7
C 9
dtype: int64
按行求和
df.sum(axis = 1)
a 6
b 15
dtype: int64
也可以指定标签求和
df.sum(axis = 'columns')
a 6
b 15
dtype: int64
求均值
df.mean()
A 2.5
B 3.5
C 4.5
dtype: float64
df.mean(axis = 1)
a 2.0
b 5.0
dtype: float64
最小值和最大值
df.min()
df.max()
A 4
B 5
C 6
dtype: int64
求中位数- 列 A 包含值 1 和 4,中位数是 (1 + 4) / 2 = 2.5
df
A | B | C | |
---|---|---|---|
a | 1 | 2 | 3 |
b | 4 | 5 | 6 |
df.median()
A 2.5
B 3.5
C 4.5
dtype: float64
二元统计
df = pd.read_csv('./data/example.csv')
df.head()
Age | Name | Sex | City | Price | |
---|---|---|---|---|---|
0 | 30 | Alice | female | New York | 100.4 |
1 | 25 | Bob | male | Los Angeles | 60.6 |
2 | 35 | Charlie | female | Chicago | 50.2 |
3 | 60 | Tom | male | Beijing | 64.8 |
4 | 45 | Danny | male | Canada | 75.5 |
仅对数值列计算协方差
import numpy as np
numeric_df = df.select_dtypes(include=[np.number])
numeric_df.cov()
Age | Price | |
---|---|---|
Age | 309.027778 | 16.361111 |
Price | 16.361111 | 357.706944 |
计算相关系数矩阵
1 表示完全正相关,
-1 表示完全负相关,
0 表示没有线性相关。
numeric_df.corr()
Age | Price | |
---|---|---|
Age | 1.00000 | 0.04921 |
Price | 0.04921 | 1.00000 |
统计在不同的年龄值有多少人
df['Age'].value_counts() #默认降序
Age
25 3
45 2
30 1
35 1
60 1
75 1
Name: count, dtype: int64
df['Age'].value_counts(ascending = True) #升序
Age
30 1
35 1
60 1
75 1
45 2
25 3
Name: count, dtype: int64
bins可以进行分组 可指定几组
df['Age'].value_counts(ascending = True,bins = 5) #升序
(45.0, 55.0] 0
(55.0, 65.0] 1
(65.0, 75.0] 1
(35.0, 45.0] 2
(24.948999999999998, 35.0] 5
Name: count, dtype: int64
# print(help(pd.value_counts))
.count()统计样本个数
df['Age'].count()
9
06.对象操作
Series结构的增删改查
对象的增删改查
pd.Series 通常用于存储和操作一列数据
import pandas as pd
data = [10,11,12]
index = ['a','b','c']
s = pd.Series(data = data,index = index)
s
a 10
b 11
c 12
dtype: int64
- 查操作
s[0]
10
s[0:2]
a 10
b 11
dtype: int64
mask = [True,False,True]
s[mask]
a 10
c 12
dtype: int64
s.loc['b']
11
s.iloc[1]
11
- 改操作
s1 = s.copy()
s1['a'] = 100
s1
a 100
b 11
c 12
dtype: int64
在 pandas 中,Series.replace() 方法用于替换 Series 中的值。当你调用这个方法时,可以指定要替换的值(to_replace),新的值(value),以及是否在原地修改(inplace)。
在代码 s1.replace(to_replace=100, value=101, inplace=False) 中:
- to_replace=100 表示你想要替换掉所有值为 100 的元素。
- value=101 表示你想要将这些元素替换成 101。
- inplace=False 表示替换操作不会直接修改原始的 Series s1,而是返回一个新的 Series,其中包含了替换后的结果。
s1.replace(to_replace= 100,value = 101,inplace = False)
a 101
b 11
c 12
dtype: int64
s1
a 100
b 11
c 12
dtype: int64
s1.replace(to_replace= 100,value = 101,inplace = True)
s1
a 101
b 11
c 12
dtype: int64
s1.index
Index([‘a’, ‘b’, ‘c’], dtype=‘object’)
改变索引,可直接引入一个序列
s1.index = ['a','b','d']
s1
a 101
b 11
d 12
dtype: int64
也可以单独去指定某一个索引(字典结构)
s1.rename(index = {
'a':'A'},inplace = True)
s1
A 101
b 11
d 12
dtype: int64
- 增操作
s2 = pd.Series([100,500],index = ['h','k'])
s2
h 100
k 500
dtype: int64
s3 = pd.concat([s1, s2])
s3
A 101
b 11
d 12
h 100
k 500
dtype: int64
s1['j'] = 500
s1
A 101
b 11
d 12
j 500
dtype: int64
合并的时候要不要保留索引?
result = pd.concat([s1, s2],ignore_index = False)
result
A 101
b 11
d 12
j 500
h 100
k 500
dtype: int64
result = pd.concat([s1, s2],ignore_index = True)
result
0 101
1 11
2 12
3 500
4 100
5 500
dtype: int64
- 删操作
s1
A 101
b 11
d 12
j 500
dtype: int64
del s1['A']
s1
b 11
d 12
j 500
dtype: int64
s1.drop(['b','d'],inplace = True)
s1
j 500
dtype: int64
DataFrame结构的增删改查
data = [[1,2,3],[4,5,6]]
index = ['a','b']
columns = ['A','B','C']
df = pd.DataFrame(data=data,index=index,columns=columns)
df
A | B | C | |
---|---|---|---|
a | 1 | 2 | 3 |
b | 4 | 5 | 6 |
- 查操作是类似的
df['A']
a 1
b 4
Name: A, dtype: int64
df.iloc[0]
A 1
B 2
C 3
Name: a, dtype: int64
df.loc['a']
A 1
B 2
C 3
Name: a, dtype: int64
- 改操作
df.loc['a']['A']
1
df.loc['a']['A'] = 150
df
A | B | C | |
---|---|---|---|
a | 150 | 2 | 3 |
b | 4 | 5 | 6 |
df.index = ['f','g']
df
A | B | C | |
---|---|---|---|
f | 150 | 2 | 3 |
g | 4 | 5 | 6 |
- 增操作
df.loc['c'] = [1,2,3]
df
A | B | C | |
---|---|---|---|
f | 150 | 2 | 3 |
g | 4 | 5 | 6 |
c | 1 | 2 | 3 |
data = [[1,2,3],[4,5,6]]
index = ['j','k']
columns = ['A','B','C']
df2 = pd.DataFrame(data=data,index=index,columns=columns)
df2