数据分析处理库Pandas





01.Pandas概述

import pandas as pd
df = pd.read_csv('./data/example.csv')
df
Name Age Sex City
0 Alice 30 female New York
1 Bob 25 male Los Angeles
2 Charlie 35 female Chicago
3 Tom 60 male Beijing
4 Danny 45 male Canada
5 Jenny 75 female Shanghai

.head()可以读取前几条数据,指定前几条都可以

df.head(3)
Name Age Sex City
0 Alice 30 female New York
1 Bob 25 male Los Angeles
2 Charlie 35 female Chicago

.info()可以返回当前的信息

df.info()

<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 6 entries, 0 to 5
Data columns (total 4 columns):
Column Non-Null Count Dtype


0 Name 6 non-null object
1 Age 6 non-null int64
2 Sex 6 non-null object
3 City 6 non-null object
dtypes: int64(1), object(3)
memory usage: 324.0+ bytes


df.index指的是DataFrame对象df的索引
df.index
RangeIndex(start=0, stop=6, step=1)

df.columns属性返回DataFrame对象df的列标签数组

df.columns
Index(['Name', 'Age', 'Sex', 'City'], dtype='object')

df.dtypes属性用于查看DataFrame df 中每列的数据类型。这个属性返回一个Series,其中索引是DataFrame的列名,值是每列的数据类型

df.dtypes

Name object
Age int64
Sex object
City object
dtype: object

df.values属性用于获取DataFrame df 中的数据值,将其作为一个NumPy数组返回。这个数组包含了DataFrame中的所有数据,但不包括索引和列名

df.values

array([[‘Alice’, 30, ‘female’, ‘New York’],
[‘Bob’, 25, ‘male’, ‘Los Angeles’],
[‘Charlie’, 35, ‘female’, ‘Chicago’],
[‘Tom’, 60, ‘male’, ‘Beijing’],
[‘Danny’, 45, ‘male’, ‘Canada’],
[‘Jenny’, 75, ‘female’, ‘Shanghai’]], dtype=object)



02.Panades基本操作

自己创建一个dataframe结构

import pandas as pd
data = {
   'country':['aaa','bbb','ccc'],'population':[10,12,14]}
df_data = pd.DataFrame(data)
df_data
country population
0 aaa 10
1 bbb 12
2 ccc 14
df_data.info()

<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
Column Non-Null Count Dtype
0 country 3 non-null object
1 population 3 non-null int64
dtypes: int64(1), object(1)
memory usage: 180.0+ bytes


取指定的数据
df
Name Age Sex City
0 Alice 30 female New York
1 Bob 25 male Los Angeles
2 Charlie 35 female Chicago
3 Tom 60 male Beijing
4 Danny 45 male Canada
5 Jenny 75 female Shanghai
age = df['Age']
# age
age[:5] #相当于age[0:5]

0 30
1 25
2 35
3 60
4 45
Name: Age, dtype: int64



series是dataframe中的一行/列

age.index

RangeIndex(start=0, stop=6, step=1)

age.values

array([30, 25, 35, 60, 45, 75], dtype=int64)



可以自己指定索引
df = df.set_index('Name')
df.head()
Age Sex City
Name
Alice 30 female New York
Bob 25 male Los Angeles
Charlie 35 female Chicago
Tom 60 male Beijing
Danny 45 male Canada
df['Age'][:5]

Name
Alice 30
Bob 25
Charlie 35
Tom 60
Danny 45
Name: Age, dtype: int64

age = df['Age']
age[:5]

Name
Alice 30
Bob 25
Charlie 35
Tom 60
Danny 45
Name: Age, dtype: int64

age['Alice']

30

age = age + 10
age[:5]

Name
Alice 40
Bob 35
Charlie 45
Tom 70
Danny 55
Name: Age, dtype: int64

age = age * 10
age[:5]

Name
Alice 400
Bob 350
Charlie 450
Tom 700
Danny 550
Name: Age, dtype: int64

age.mean()

550.0

age.max()

850

age.min()

350


.describe()可以得到数据的基本统计特性

df.describe()
Age
count 6.000000
mean 45.000000
std 19.235384
min 25.000000
25% 31.250000
50% 40.000000
75% 56.250000
max 75.000000

03.Pandas索引

import pandas as pd
df = pd.read_csv('./data/example.csv')
df
Name Age Sex City Price
0 Alice 30 female New York 100.4
1 Bob 25 male Los Angeles 60.6
2 Charlie 35 female Chicago 50.2
3 Tom 60 male Beijing 64.8
4 Danny 45 male Canada 75.5
5 Jenny 75 female Shanghai 90.5
df[['Age','Name','Price']][:4]
Age Name Price
0 30 Alice 100.4
1 25 Bob 60.6
2 35 Charlie 50.2
3 60 Tom 64.8
  • loc 用label来去定位
  • iloc 用position来去定位
df.iloc[0]

Name Alice
Age 30
Sex female
City New York
Price 100.4
Name: 0, dtype: object

df.iloc[0:2]
Name Age Sex City Price
0 Alice 30 female New York 100.4
1 Bob 25 male Los Angeles 60.6
df.iloc[0:2,1:3]
Age Sex
0 30 female
1 25 male
# df = df.set_index('Name')
df
Age Sex City Price
Name
Alice 30 female New York 100.4
Bob 25 male Los Angeles 60.6
Charlie 35 female Chicago 50.2
Tom 60 male Beijing 64.8
Danny 45 male Canada 75.5
Jenny 75 female Shanghai 90.5
df.loc['Bob']

Age 25
Sex male
City Los Angeles
Price 60.6
Name: Bob, dtype: object

df.loc['Bob','Price']

60.6

df.loc['Bob':'Danny',:]
Age Sex City Price
Name
Bob 25 male Los Angeles 60.6
Charlie 35 female Chicago 50.2
Tom 60 male Beijing 64.8
Danny 45 male Canada 75.5
df.loc['Bob','Price'] = 1000
df.head()
Age Sex City Price
Name
Alice 30 female New York 100.4
Bob 25 male Los Angeles 1000.0
Charlie 35 female Chicago 50.2
Tom 60 male Beijing 64.8
Danny 45 male Canada 75.5
  • 布尔类型的索引
df['Price'] > 100

Name
Alice   True
Bob   True
Charlie  False
Tom   False
Danny  False
Jenny  False
Name:  Price, dtype: bool

df[df['Price'] > 100]
Age Sex City Price
Name
Alice 30 female New York 100.4
Bob 25 male Los Angeles 1000.0
df[df['Sex'] == 'male'][:5]
Age Sex City Price
Name
Bob 25 male Los Angeles 1000.0
Tom 60 male Beijing 64.8
Danny 45 male Canada 75.5

统计一下性别为male的Age的平均值

df.loc[df['Sex'] == 'male','Age'].mean()

43.333333333333336


统计一下Age大于的人的数量
(df['Age'] > 40).sum() 

3



04.groupby操作

import pandas as pd

df = pd.DataFrame({
   'key':['A','B','C','A','B','C','A','B','C'],
                   'data':[1,5,10,5,10,15,10,15,20]})
df
key data
0 A 1
1 B 5
2 C 10
3 A 5
4 B 10
5 C 15
6 A 10
7 B 15
8 C 20

假设有A B C 三个品牌,data的值是销售额,计算一下它们的销售额

for key in ['A','B','C']:
    # print(key,df[df['key'] == key].sum())
    print(key)
    print(df[df['key'] == key].sum())

A
key   AAA
data   16
dtype:  object
B
key   BBB
data   30
dtype:  object
C
key   CCC
data   45
dtype:  object

df.groupby('key').sum()
data
key
A 16
B 30
C 45

aggregate 是 Pandas 库中 DataFrameGroupBy 对象的一个方法,它用于对分组后的数据进行聚合操作。在 Pandas 中,groupby 方法用于根据某个列(或某些列)的值将数据分组,而 aggregate 方法则用于对这些分组后的数据执行一个或多个聚合函数。

df.groupby('key').aggregate('mean')
data
key
A 5.333333
B 10.000000
C 15.000000
df = pd.read_csv('./data/example.csv')
  • 统计性别这个指标去计算年龄的平均值
df.groupby('Sex')['Age'].mean()

Sex
female 46.666667
male 43.333333
Name: Age, dtype: float64



05.数值运算

import pandas as pd
df = pd.DataFrame([[1,2,3],[4,5,6]],index = ['a','b'],columns = ['A','B','C'])
df
A B C
a 1 2 3
b 4 5 6

默认按列求和

df.sum()

A  5
B  7
C  9
dtype: int64

df.sum(axis = 0)

A  5
B  7
C  9
dtype: int64

按行求和

df.sum(axis = 1)

a  6
b  15
dtype: int64

也可以指定标签求和

df.sum(axis = 'columns')

a  6
b  15
dtype: int64


求均值
df.mean()

A  2.5
B  3.5
C  4.5
dtype: float64

df.mean(axis = 1)

a  2.0
b  5.0
dtype: float64


最小值和最大值
df.min()
df.max()

A  4
B  5
C  6
dtype: int64


求中位数- 列 A 包含值 1 和 4,中位数是 (1 + 4) / 2 = 2.5
df
A B C
a 1 2 3
b 4 5 6
df.median()

A  2.5
B  3.5
C  4.5
dtype: float64


二元统计

df = pd.read_csv('./data/example.csv')
df.head()
Age Name Sex City Price
0 30 Alice female New York 100.4
1 25 Bob male Los Angeles 60.6
2 35 Charlie female Chicago 50.2
3 60 Tom male Beijing 64.8
4 45 Danny male Canada 75.5

仅对数值列计算协方差

import numpy as np
numeric_df = df.select_dtypes(include=[np.number])
numeric_df.cov()
Age Price
Age 309.027778 16.361111
Price 16.361111 357.706944


计算相关系数矩阵

1 表示完全正相关,
-1 表示完全负相关,
0 表示没有线性相关。

numeric_df.corr()
Age Price
Age 1.00000 0.04921
Price 0.04921 1.00000

统计在不同的年龄值有多少人

df['Age'].value_counts()  #默认降序

Age
25   3
45   2
30   1
35   1
60   1
75   1
Name: count, dtype: int64

df['Age'].value_counts(ascending = True)  #升序

Age
30   1
35   1
60   1
75   1
45   2
25   3
Name: count, dtype: int64


bins可以进行分组 可指定几组
df['Age'].value_counts(ascending = True,bins = 5)  #升序   

(45.0, 55.0]   0
(55.0, 65.0]   1
(65.0, 75.0]   1
(35.0, 45.0]   2
(24.948999999999998, 35.0]  5
Name: count, dtype: int64

# print(help(pd.value_counts))

.count()统计样本个数

df['Age'].count()

9



06.对象操作

Series结构的增删改查

对象的增删改查

pd.Series 通常用于存储和操作一列数据

import pandas as pd
data = [10,11,12]
index = ['a','b','c']
s = pd.Series(data = data,index = index)
s

a 10
b 11
c 12
dtype: int64


  • 查操作
s[0]

10

s[0:2]

a   10
b   11
dtype: int64

mask = [True,False,True]
s[mask]

a   10
c   12
dtype: int64

s.loc['b']

11

s.iloc[1]

11

  • 改操作
s1 = s.copy()
s1['a'] = 100
s1

a   100
b   11
c   12
dtype: int64



在 pandas 中,Series.replace() 方法用于替换 Series 中的值。当你调用这个方法时,可以指定要替换的值(to_replace),新的值(value),以及是否在原地修改(inplace)。

在代码 s1.replace(to_replace=100, value=101, inplace=False) 中:

  • to_replace=100 表示你想要替换掉所有值为 100 的元素。
  • value=101 表示你想要将这些元素替换成 101。
  • inplace=False 表示替换操作不会直接修改原始的 Series s1,而是返回一个新的 Series,其中包含了替换后的结果。
s1.replace(to_replace= 100,value = 101,inplace = False)

a   101
b   11
c   12
dtype: int64

s1

a   100
b   11
c   12
dtype: int64

s1.replace(to_replace= 100,value = 101,inplace = True)
s1

a   101
b   11
c   12
dtype: int64

s1.index

Index([‘a’, ‘b’, ‘c’], dtype=‘object’)


改变索引,可直接引入一个序列

s1.index = ['a','b','d']
s1

a   101
b   11
d   12
dtype: int64


也可以单独去指定某一个索引(字典结构)
s1.rename(index = {
   'a':'A'},inplace = True)
s1

A   101
b   11
d   12
dtype: int64


  • 增操作
s2 = pd.Series([100,500],index = ['h','k'])
s2

h   100
k   500
dtype: int64

s3 = pd.concat([s1, s2])
s3

A   101
b   11
d   12
h   100
k   500
dtype: int64

s1['j'] = 500
s1

A   101
b   11
d   12
j   500
dtype: int64

合并的时候要不要保留索引?

result = pd.concat([s1, s2],ignore_index = False)
result

A   101
b   11
d   12
j   500
h   100
k   500
dtype: int64

result = pd.concat([s1, s2],ignore_index = True)
result

0   101
1   11
2   12
3   500
4   100
5   500
dtype: int64

  • 删操作
s1

A   101
b   11
d   12
j   500
dtype: int64

del s1['A']
s1

b   11
d   12
j   500
dtype: int64

s1.drop(['b','d'],inplace = True)
s1

j   500
dtype: int64


DataFrame结构的增删改查

data = [[1,2,3],[4,5,6]]
index = ['a','b']
columns = ['A','B','C']

df = pd.DataFrame(data=data,index=index,columns=columns)
df
A B C
a 1 2 3
b 4 5 6
  • 查操作是类似的
df['A']
a    1
b    4
Name: A, dtype: int64
df.iloc[0]
A    1
B    2
C    3
Name: a, dtype: int64
df.loc['a']
A    1
B    2
C    3
Name: a, dtype: int64
  • 改操作
df.loc['a']['A']

1

df.loc['a']['A'] = 150
df
A B C
a 150 2 3
b 4 5 6
df.index = ['f','g']
df
A B C
f 150 2 3
g 4 5 6
  • 增操作
df.loc['c'] = [1,2,3]
df
A B C
f 150 2 3
g 4 5 6
c 1 2 3
data = [[1,2,3],[4,5,6]]
index = ['j','k']
columns = ['A','B','C']

df2 = pd.DataFrame(data=data,index=index,columns=columns)
df2
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值