机器学习 - Pandas 练习, 常见功能查阅

机器学习记录

Pandas

Pandas 官网

安装 pandas 库:

conda install -y pandas openpyxl

pd.show_versions()

查看版本信息

import os
import pandas as pd
import requests

PATH = r"/your_local_dir"

r = requests.get('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data')

with open(PATH + 'iris.data', 'w') as f:
    f.write(r.text)

os.chdir(PATH)

# 读取 csv 文件
df = pd.read_csv(PATH + 'iris.data', names=['sepal length', 'sepal width', 'petal length', 'petal width', 'class'])

# 读取前几条数据
print(df.head())

# 读取指定列
print(df['sepal length'])

# 读取指定行和列
print(df.loc[3:5, ['sepal length', 'petal length', 'class']])

# 读取指定行和列
print(df.iloc[3:5, [0, 2, 4]])

# 读取列名
print(df.columns)

# class 列去重后的值
print(df['class'].unique())

# 统计每列的数量
print(df.count())

# 根据 class 进行筛选, 序号还是原来的
print(df[df['class'] == 'Iris-versicolor'])

print(df[df['class'] == 'Iris-versicolor'].count())

# 根据 class 进行筛选, 重新编号, 从 0 开始
versicolor = df[df['class'] == 'Iris-versicolor'].reset_index(drop=True)
print(versicolor)

# 筛选数据
print(df[(df['class'] == 'Iris-versicolor') & (df['petal width'] > 1.5)])

# 各列统计信息
print(df.describe())

# 指定统计的百分比
print(df.describe(percentiles=[.20, .40, .80, .90, .95]))

# Pearson 相关系数
print(df.corr())

print(df.corr(method="spearman"))

print(df.corr(method="kendall"))

.describe()

对数据集进行统计描述, 包括计数, 均值, 标准差, 最小值, 最大值和四分位数等.

只对数值型时间型数据有效, 对于文本型数据, 只会给出计数, 唯一值, 最大频率和数据类型等信息.
返回值是一个 DataFrame, 每列是一个统计指标, 每行是数据集中的一种数值型或时间型变量.

.corr()

用于计算 DataFrame 中列之间的相关系数(correlation coefficient)的方法. 相关系数可以衡量两个变量之间的线性关系强度, 取值范围是-1到1, 值越接近1表示两个变量越正相关, 值越接近-1表示两个变量越负相关, 值为0表示两个变量之间没有线性关系.

pandas corr 方法语法如下:

DataFrame.corr(method='pearson', min_periods=1)

其中, 参数 method 指定所使用的相关系数计算方法, 可以是 pearson(默认), spearman(Spearman’s) 或 kendall(Kendall’s). 参数 min_periods 为最小的非空值数量, 当非空值数量小于 min_periods 时, 返回 NaN.

皮尔逊积矩相关系数公式(Pearson product-moment correlation coefficient, PPMCC)

该公式计算两个变量之间的线性关系, 其值介于 -1 和 +1 之间, 表示两个变量之间的相关性强度和方向. 具体公式为:

r = (n * Σ(xy) - Σ x * Σ y) / sqrt((n * Σ(x^2) - (Σ x)^2) * (n * Σ(y^2) - (Σ y)^2))

其中, x 和 y 分别表示两个变量的值, Σ 表示求和符号, n 表示样本个数, r 表示相关系数.

Kendall

Kendall 相关系统是用于度量数据集中的相关性和相似性的数学和统计工具. 其中最常用的是 Kendall 相关系数和 Kendall tau 相关系数. 这两种相关系数都是用于比较两组顺序或排名数据的相似性的方法.

  • Kendall 相关系数是一种度量两组数据(X 和 Y)之间的相关性的方法. 它的范围在 -1 和 1 之间, 其中 -1 表示完全的反相关, 0 表示无关, 1 表示完全相关.
  • Kendall tau 相关系数是一种度量两组数据(X 和 Y)之间的相似性的方法. 它的范围在 -1 和 1 之间, 其中 -1 表示完全的相反, 0 表示无关, 1 表示完全相同.
Kendall’s tau

可以用来测试排名数据中变量之间的关系. 它的公式为:

τ = (2/ n(n-1)) Σ Σ sign(x_i - x_j) sign(y_i - y_j)

其中, n 是样本大小, x_i 和 y_i 是第 i 个观察值的排名, i ≠ j. sign()是符号函数, 如果其参数为正则返回1, 否则返回-1.

该公式计算每个组合的符号和符号积, 并将它们加起来. 如果两个变量是同向的(即它们都按相同的方向移动), 则它们的符号积将是1, 否则它将是-1. 最终的结果将在-1到1之间, 其中-1表示完全相反的排名, 而1表示完全相同的排名.

Spearman’s

一种衡量两个变量相关性的方法, 也称为 Spearman’s 秩相关系数.

在 Spearman’s 中, 每个变量的值都被转换成它们的排序值(例如, 从最小值到最大值的排名). 然后, 比较这些排名, 以确定变量之间的关系强度.

Spearman’s 秩相关系数的公式如下:

ρ = 1 - (6Σ d² / n(n²-1))

其中,

  • ρ 是 Spearman’s 秩相关系数
  • d 是排名之间的差异
  • n 是双方比较的对象数量

Spearman’s 秩相关系数的范围是 -1 到 1, 其值的符号表示相关性的方向.

  • 如果 Spearman’s 秩相关系数等于 1, 则两个变量完全正相关;
  • 如果等于 -1, 则两个变量完全负相关;
  • 如果等于 0, 则两个变量没有相关性.

Map

适用于序列数据

修改 class 的内容, 使其值为 map 后的简称:

df['class'] = df['class'].map({'Iris-setosa': 'SET', 'Iris-virginica': 'VIR', 'Iris-versicolor': 'VER'})

选择一个不同的名称, 就可以保留原来的 class, 同时增加新的一列, 新的一列的值就是 class 的简称:

df['class_short'] = df['class'].map({'Iris-setosa': 'SET', 'Iris-virginica': 'VIR', 'Iris-versicolor': 'VER'})

通过函数来完成:

df['wide length'] = df['petal length'].map(lambda v: 2 if v >= 2 else 0)

Apply

可用于数据框 和 序列

通过函数来完成, 根据一列的数据来创建新的一列:

df['wide petal'] = df['petal width'].apply(lambda v: 1 if v >= 1.3 else 0)

在数据框上使用 apply (map 没有的功能):

df['petal area'] = df.apply(lambda r: r['petal length'] * r['petal width'], axis=1)

ApplyMap

对数据框里所有的数据单元执行一个函数.

常见用法: 根据一定的条件标准来转变或格式化每一个单元.

对于整个数据框的数据, 如果是 float 类型, 则取其对数, 否则使用原值:

df.applymap(lambda v: np.log(v) if isinstance(v, float) else v)

Groupby

基于某些选择的类别对数据进行分组

根据 class 进行分组, 求平均值:

df.groupby('class').mean(numeric_only=True)
  • numeric_only - default: False
    • False: 只能使用数字
    • True: 允许非数字

根据 class 进行分组, 统计信息:

df.groupby('class').describe()

根据 petal width 分类, 类别去重后得到数组.

df.groupby('petal width')['class'].unique().to_frame()

根据 class 分类, 计算 petal width 的最小值, 最大值, 以及间距

df.groupby('class')['petal width'].agg([
    ('max', np.max), 
    ('min', np.min),
    ('delta', lambda x: x.max() - x.min()),
])

数据

git clone https://github.com/KeithGalli/pandas.git

练习

import pandas as pd
data_dir = "/data_dir"
df = pd.read_csv(f'{data_dir}/pandas/pokemon_data.csv')
df.shape
(800, 12)
df.head()
#NameType 1Type 2HPAttackDefenseSp. AtkSp. DefSpeedGenerationLegendary
01BulbasaurGrassPoison4549496565451False
12IvysaurGrassPoison6062638080601False
23VenusaurGrassPoison808283100100801False
33VenusaurMega VenusaurGrassPoison80100123122120801False
44CharmanderFireNaN3952436050651False
# 查看列的数据类型
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 795 to 799
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   #           5 non-null      int64 
 1   Name        5 non-null      object
 2   Type 1      5 non-null      object
 3   Type 2      5 non-null      object
 4   HP          5 non-null      int64 
 5   Attack      5 non-null      int64 
 6   Defense     5 non-null      int64 
 7   Sp. Atk     5 non-null      int64 
 8   Sp. Def     5 non-null      int64 
 9   Speed       5 non-null      int64 
 10  Generation  5 non-null      int64 
 11  Legendary   5 non-null      bool  
dtypes: bool(1), int64(8), object(3)
memory usage: 577.0+ bytes
df = pd.read_excel(f'{data_dir}/pandas/pokemon_data.xlsx')
df.shape
(800, 12)
df.head()
#NameType 1Type 2HPAttackDefenseSp. AtkSp. DefSpeedGenerationLegendary
01BulbasaurGrassPoison4549496565451False
12IvysaurGrassPoison6062638080601False
23VenusaurGrassPoison808283100100801False
33VenusaurMega VenusaurGrassPoison80100123122120801False
44CharmanderFireNaN3952436050651False
def read():
    global df
    df = pd.read_csv(f'{data_dir}/pandas/pokemon_data.csv')
read()
df.columns
Index(['#', 'Name', 'Type 1', 'Type 2', 'HP', 'Attack', 'Defense', 'Sp. Atk',
       'Sp. Def', 'Speed', 'Generation', 'Legendary'],
      dtype='object')
df.Name
0                  Bulbasaur
1                    Ivysaur
2                   Venusaur
3      VenusaurMega Venusaur
4                 Charmander
               ...          
795                  Diancie
796      DiancieMega Diancie
797      HoopaHoopa Confined
798       HoopaHoopa Unbound
799                Volcanion
Name: Name, Length: 800, dtype: object
df['Name']
0                  Bulbasaur
1                    Ivysaur
2                   Venusaur
3      VenusaurMega Venusaur
4                 Charmander
               ...          
795                  Diancie
796      DiancieMega Diancie
797      HoopaHoopa Confined
798       HoopaHoopa Unbound
799                Volcanion
Name: Name, Length: 800, dtype: object
df[['Name', 'HP', 'Speed']]
NameHPSpeed
0Bulbasaur4545
1Ivysaur6060
2Venusaur8080
3VenusaurMega Venusaur8080
4Charmander3965
............
795Diancie5050
796DiancieMega Diancie50110
797HoopaHoopa Confined8070
798HoopaHoopa Unbound8080
799Volcanion8070

800 rows × 3 columns

# 第 [n] 行的数据
df.iloc[1]
#                   2
Name          Ivysaur
Type 1          Grass
Type 2         Poison
HP                 60
Attack             62
Defense            63
Sp. Atk            80
Sp. Def            80
Speed              60
Generation          1
Legendary       False
Name: 1, dtype: object
# 第 [n, m) 行的数据
df.iloc[0:4]
#NameType 1Type 2HPAttackDefenseSp. AtkSp. DefSpeedGenerationLegendary
01BulbasaurGrassPoison4549496565451False
12IvysaurGrassPoison6062638080601False
23VenusaurGrassPoison808283100100801False
33VenusaurMega VenusaurGrassPoison80100123122120801False
# 第 [n] 行, [m] 列的数据
df.iloc[2, 2]
'Grass'
# 行遍历
# for index, row in df.iterrows():
# #     print(index, row)
#     print(index, row['Name'])
0 Bulbasaur
1 Ivysaur
2 Venusaur
3 VenusaurMega Venusaur
4 Charmander
5 Charmeleon
...
796 DiancieMega Diancie
797 HoopaHoopa Confined
798 HoopaHoopa Unbound
799 Volcanion
# 根据字段过滤数据
df.loc[df['Type 1'] == "Grass"]
#NameType 1Type 2HPAttackDefenseSp. AtkSp. DefSpeedGenerationLegendary
01BulbasaurGrassPoison4549496565451False
12IvysaurGrassPoison6062638080601False
23VenusaurGrassPoison808283100100801False
33VenusaurMega VenusaurGrassPoison80100123122120801False
4843OddishGrassPoison4550557565301False
.......................................
718650ChespinGrassNaN5661654845386False
719651QuilladinGrassNaN6178955658576False
720652ChesnaughtGrassFighting881071227475646False
740672SkiddoGrassNaN6665486257526False
741673GogoatGrassNaN123100629781686False

70 rows × 12 columns

# 统计, 只对"数值"类型统计
# count:非空值的数量。
# mean:平均值。
# std:标准差。
# min:最小值。
# 25%:下四分位数。
# 50%:中位数(下四分位数和上四分位数的平均值)。
# 75%:上四分位数。
# max:最大值。
df.describe()
#HPAttackDefenseSp. AtkSp. DefSpeedGeneration
count800.000000800.000000800.000000800.000000800.000000800.000000800.000000800.00000
mean362.81375069.25875079.00125073.84250072.82000071.90250068.2775003.32375
std208.34379825.53466932.45736631.18350132.72229427.82891629.0604741.66129
min1.0000001.0000005.0000005.00000010.00000020.0000005.0000001.00000
25%184.75000050.00000055.00000050.00000049.75000050.00000045.0000002.00000
50%364.50000065.00000075.00000070.00000065.00000070.00000065.0000003.00000
75%539.25000080.000000100.00000090.00000095.00000090.00000090.0000005.00000
max721.000000255.000000190.000000230.000000194.000000230.000000180.0000006.00000
# 排序
df.sort_values('Name')
#NameType 1Type 2HPAttackDefenseSp. AtkSp. DefSpeedGenerationLegendary
510460AbomasnowGrassIce9092759285604False
511460AbomasnowMega AbomasnowGrassIce90132105132105304False
6863AbraPsychicNaN25201510555901False
392359AbsolDarkNaN65130607560753False
393359AbsolMega AbsolDarkNaN6515060115601153False
.......................................
632571ZoroarkDarkNaN6010560120601055False
631570ZoruaDarkNaN4065408040655False
4641ZubatPoisonFlying4045353040551False
695634ZweilousDarkDragon7285706570585False
794718Zygarde50% FormeDragonGround1081001218195956True

800 rows × 12 columns

# 排序: 倒序
df.sort_values('Name', ascending=False)
#NameType 1Type 2HPAttackDefenseSp. AtkSp. DefSpeedGenerationLegendary
794718Zygarde50% FormeDragonGround1081001218195956True
695634ZweilousDarkDragon7285706570585False
4641ZubatPoisonFlying4045353040551False
631570ZoruaDarkNaN4065408040655False
632571ZoroarkDarkNaN6010560120601055False
.......................................
393359AbsolMega AbsolDarkNaN6515060115601153False
392359AbsolDarkNaN65130607560753False
6863AbraPsychicNaN25201510555901False
511460AbomasnowMega AbomasnowGrassIce90132105132105304False
510460AbomasnowGrassIce9092759285604False

800 rows × 12 columns

# 排序: 多字段排序
df.sort_values(['Type 1', 'HP'])
#NameType 1Type 2HPAttackDefenseSp. AtkSp. DefSpeedGenerationLegendary
316292ShedinjaBugGhost190453030403False
230213ShuckleBugRock20102301023052False
462415CombeeBugFlying3030423042704False
603543VenipedeBugPoison3045593039575False
314290NincadaBugGround3145903030403False
.......................................
142131LaprasWaterIce13085808595601False
145134VaporeonWaterNaN130656011095651False
350320WailmerWaterNaN13070357035603False
655594AlomomolaWaterNaN16575804045655False
351321WailordWaterNaN17090459045603False

800 rows × 12 columns

# 排序: 多字段排序, 指定每个字段的排序顺序
df.sort_values(['Type 1', 'HP'], ascending=[0, 1])
#NameType 1Type 2HPAttackDefenseSp. AtkSp. DefSpeedGenerationLegendary
139129MagikarpWaterNaN2010551520801False
381349FeebasWaterNaN2015201055803False
9790ShellderWaterNaN30651004525401False
10698KrabbyWaterNaN30105902525501False
125116HorseaWaterNaN3040707025601False
.......................................
232214HeracrossMega HeracrossBugFighting8018511540105752False
678617AccelgorBugNaN807040100601455False
734666VivillonBugFlying8052509050896False
698637VolcaronaBugFire8560651351051005False
520469YanmegaBugFlying86768611656954False

800 rows × 12 columns

df['Total'] = df['HP'] + df['Attack'] + df['Defense'] + df['Sp. Atk'] + df['Sp. Def'] + df['Speed']
df
#NameType 1Type 2HPAttackDefenseSp. AtkSp. DefSpeedGenerationLegendaryTotal
01BulbasaurGrassPoison4549496565451False318
12IvysaurGrassPoison6062638080601False405
23VenusaurGrassPoison808283100100801False525
33VenusaurMega VenusaurGrassPoison80100123122120801False625
44CharmanderFireNaN3952436050651False309
..........................................
795719DiancieRockFairy50100150100150506True600
796719DiancieMega DiancieRockFairy501601101601101106True700
797720HoopaHoopa ConfinedPsychicGhost8011060150130706True600
798720HoopaHoopa UnboundPsychicDark8016060170130806True680
799721VolcanionFireWater8011012013090706True600

800 rows × 13 columns

df['Total'] = df['HP'] + df['Attack'] + df['Defense'] + df['Sp. Atk'] + df['Sp. Def'] + df['Speed']
# df.drop 不会修改 df, 需要赋值. 如果删除的 列 不存在, 则会报错
df = df.drop(columns=['Total'])

df.head()
#NameType 1Type 2HPAttackDefenseSp. AtkSp. DefSpeedGenerationLegendary
01BulbasaurGrassPoison4549496565451False
12IvysaurGrassPoison6062638080601False
23VenusaurGrassPoison808283100100801False
33VenusaurMega VenusaurGrassPoison80100123122120801False
44CharmanderFireNaN3952436050651False
# 当 axis=0 时,对每一列进行求和,返回一行结果。
# 当 axis=1 时,对每一行进行求和,返回一列结果。
df['Total'] = df.iloc[:, 4:10].sum(axis=1)

df.head()
#NameType 1Type 2HPAttackDefenseSp. AtkSp. DefSpeedGenerationLegendaryTotal
01BulbasaurGrassPoison4549496565451False318
12IvysaurGrassPoison6062638080601False405
23VenusaurGrassPoison808283100100801False525
33VenusaurMega VenusaurGrassPoison80100123122120801False625
44CharmanderFireNaN3952436050651False309
read()

df['Total'] = df.iloc[:, 4:10].sum(axis=1)

# 输出指定列
df = df[['Total', 'HP', 'Defense']]

df.head(5)
TotalHPDefense
03184549
14056063
25258083
362580123
43093943
read()

df['Total'] = df.iloc[:, 4:10].sum(axis=1)

cols = list(df.columns)
print(cols)

# 输出指定列
df = df[cols[0:4] + [cols[-1]] + cols[4:-1]]

df.head(5)
['#', 'Name', 'Type 1', 'Type 2', 'HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed', 'Generation', 'Legendary', 'Total']
#NameType 1Type 2TotalHPAttackDefenseSp. AtkSp. DefSpeedGenerationLegendary
01BulbasaurGrassPoison3184549496565451False
12IvysaurGrassPoison4056062638080601False
23VenusaurGrassPoison525808283100100801False
33VenusaurMega VenusaurGrassPoison62580100123122120801False
44CharmanderFireNaN3093952436050651False
# 写文件
# df.to_csv('/tmp/modified.csv', index=False)
# df.to_excel('/tmp/modified.xlsx', index=False)
# 多个字段过滤
df.loc[(df['Type 1'] == 'Grass') & (df['Type 2'] == 'Poison')]
#NameType 1Type 2TotalHPAttackDefenseSp. AtkSp. DefSpeedGenerationLegendary
01BulbasaurGrassPoison3184549496565451False
12IvysaurGrassPoison4056062638080601False
23VenusaurGrassPoison525808283100100801False
33VenusaurMega VenusaurGrassPoison62580100123122120801False
4843OddishGrassPoison3204550557565301False
4944GloomGrassPoison3956065708575401False
5045VileplumeGrassPoison49075808511090501False
7569BellsproutGrassPoison3005075357030401False
7670WeepinbellGrassPoison3906590508545551False
7771VictreebelGrassPoison490801056510070701False
344315RoseliaGrassPoison40050604510080653False
451406BudewGrassPoison2804030355070554False
452407RoseradeGrassPoison515607065125105904False
651590FoongusGrassPoison2946955455555155False
652591AmoongussGrassPoison46411485708580305False
# 多个字段过滤
df.loc[(df['Type 1'] == 'Grass') | (df['Type 2'] == 'Poison')]
#NameType 1Type 2TotalHPAttackDefenseSp. AtkSp. DefSpeedGenerationLegendary
01BulbasaurGrassPoison3184549496565451False
12IvysaurGrassPoison4056062638080601False
23VenusaurGrassPoison525808283100100801False
33VenusaurMega VenusaurGrassPoison62580100123122120801False
1613WeedleBugPoison1954035302020501False
..........................................
718650ChespinGrassNaN3135661654845386False
719651QuilladinGrassNaN4056178955658576False
720652ChesnaughtGrassFighting530881071227475646False
740672SkiddoGrassNaN3506665486257526False
741673GogoatGrassNaN531123100629781686False

89 rows × 13 columns

new_df = df.loc[(df['Type 1'] == 'Grass') & (df['Type 2'] == 'Poison') & (df['HP'] > 70)]

# new_df.to_csv('/tmp/filtered.csv', index=False)

# "index" 是 df 的 index. 会重新为 new_df 生成index
new_df = new_df.reset_index()

new_df
index#NameType 1Type 2TotalHPAttackDefenseSp. AtkSp. DefSpeedGenerationLegendary
023VenusaurGrassPoison525808283100100801False
133VenusaurMega VenusaurGrassPoison62580100123122120801False
25045VileplumeGrassPoison49075808511090501False
37771VictreebelGrassPoison490801056510070701False
4652591AmoongussGrassPoison46411485708580305False
new_df = df.loc[(df['Type 1'] == 'Grass') & (df['Type 2'] == 'Poison') & (df['HP'] > 70)]

# 会重新为 new_df 生成index, 删除 df 保留下来的 index 列
new_df = new_df.reset_index(drop=True)

new_df
#NameType 1Type 2TotalHPAttackDefenseSp. AtkSp. DefSpeedGenerationLegendary
03VenusaurGrassPoison525808283100100801False
13VenusaurMega VenusaurGrassPoison62580100123122120801False
245VileplumeGrassPoison49075808511090501False
371VictreebelGrassPoison490801056510070701False
4591AmoongussGrassPoison46411485708580305False
df.loc[df['Name'].str.contains('Mega')]
#NameType 1Type 2TotalHPAttackDefenseSp. AtkSp. DefSpeedGenerationLegendary
33VenusaurMega VenusaurGrassPoison62580100123122120801False
76CharizardMega Charizard XFireDragon63478130111130851001False
86CharizardMega Charizard YFireFlying63478104781591151001False
129BlastoiseMega BlastoiseWaterNaN63079103120135115781False
1915BeedrillMega BeedrillBugPoison495651504015801451False
2318PidgeotMega PidgeotNormalFlying579838080135801211False
7165AlakazamMega AlakazamPsychicNaN590555065175951501False
8780SlowbroMega SlowbroWaterPsychic590957518013080301False
10294GengarMega GengarGhostPoison600606580170951301False
124115KangaskhanMega KangaskhanNormalNaN590105125100601001001False
137127PinsirMega PinsirBugFlying6006515512065901051False
141130GyaradosMega GyaradosWaterDark6409515510970130811False
154142AerodactylMega AerodactylRockFlying615801358570951501False
163150MewtwoMega Mewtwo XPsychicFighting7801061901001541001301True
164150MewtwoMega Mewtwo YPsychicNaN780106150701941201401True
168154MeganiumGrassNaN525808210083100802False
196181AmpharosMega AmpharosElectricDragon6109095105165110452False
224208SteelixMega SteelixSteelGround610751252305595302False
229212ScizorMega ScizorBugSteel6007015014065100752False
232214HeracrossMega HeracrossBugFighting6008018511540105752False
248229HoundoomMega HoundoomDarkFire600759090140901152False
268248TyranitarMega TyranitarRockDark70010016415095120712False
275254SceptileMega SceptileGrassDragon6307011075145851453False
279257BlazikenMega BlazikenFireFighting6308016080130801003False
283260SwampertMega SwampertWaterGround63510015011095110703False
306282GardevoirMega GardevoirPsychicFairy6186885651651351003False
327302SableyeMega SableyeDarkGhost480508512585115203False
329303MawileMega MawileSteelFairy480501051255595503False
333306AggronMega AggronSteelNaN630701402306080503False
336308MedichamMega MedichamFightingPsychic510601008580851003False
339310ManectricMega ManectricElectricNaN575707580135801353False
349319SharpedoMega SharpedoWaterDark5607014070110651053False
354323CameruptMega CameruptFireGround56070120100145105203False
366334AltariaMega AltariaDragonFairy59075110110110105803False
387354BanetteMega BanetteGhostNaN55564165759383753False
393359AbsolMega AbsolDarkNaN5656515060115601153False
397362GlalieMega GlalieIceNaN5808012080120801003False
409373SalamenceMega SalamenceDragonFlying70095145130120901203False
413376MetagrossMega MetagrossSteelPsychic700801451501051101103False
418380LatiasMega LatiasDragonPsychic700801001201401501103True
420381LatiosMega LatiosDragonPsychic700801301001601201103True
426384RayquazaMega RayquazaDragonFlying7801051801001801001153True
476428LopunnyMega LopunnyNormalFighting580651369454961354False
494445GarchompMega GarchompDragonGround70010817011512095924False
498448LucarioMega LucarioFightingSteel6257014588140701124False
511460AbomasnowMega AbomasnowGrassIce59490132105132105304False
527475GalladeMega GalladePsychicFighting6186816595651151104False
591531AudinoMega AudinoNormalFairy5451036012680126505False
796719DiancieMega DiancieRockFairy700501601101601101106True
import re
df.loc[df['Type 1'].str.contains('Fire|grass', regex=True, flags=re.I)]
#NameType 1Type 2TotalHPAttackDefenseSp. AtkSp. DefSpeedGenerationLegendary
01BulbasaurGrassPoison3184549496565451False
12IvysaurGrassPoison4056062638080601False
23VenusaurGrassPoison525808283100100801False
33VenusaurMega VenusaurGrassPoison62580100123122120801False
44CharmanderFireNaN3093952436050651False
..........................................
735667LitleoFireNormal3696250587354726False
736668PyroarFireNormal507866872109661066False
740672SkiddoGrassNaN3506665486257526False
741673GogoatGrassNaN531123100629781686False
799721VolcanionFireWater6008011012013090706True

122 rows × 13 columns

import re
df.loc[df['Name'].str.contains('^pi[a-z]', regex=True, flags=re.I)]
#NameType 1Type 2TotalHPAttackDefenseSp. AtkSp. DefSpeedGenerationLegendary
2016PidgeyNormalFlying2514045403535561False
2117PidgeottoNormalFlying3496360555050711False
2218PidgeotNormalFlying47983807570701011False
2318PidgeotMega PidgeotNormalFlying579838080135801211False
3025PikachuElectricNaN3203555405050901False
136127PinsirBugNaN500651251005570851False
137127PinsirMega PinsirBugFlying6006515512065901051False
186172PichuElectricNaN2052040153535602False
219204PinecoBugNaN2905065903535152False
239221PiloswineIceGround450100100806060502False
438393PiplupWaterNaN3145351536156404False
558499PigniteFireFighting4189093557055555False
578519PidoveNormalFlying2645055503630435False
read()
# 第一个参数满足条件的, 设置第二个参数的列为后面的值
df.loc[df['Type 1'] == 'Grass', 'Type 1'] = 'Flamer'

df
#NameType 1Type 2HPAttackDefenseSp. AtkSp. DefSpeedGenerationLegendary
01BulbasaurFlamerPoison4549496565451False
12IvysaurFlamerPoison6062638080601False
23VenusaurFlamerPoison808283100100801False
33VenusaurMega VenusaurFlamerPoison80100123122120801False
44CharmanderFireNaN3952436050651False
.......................................
795719DiancieRockFairy50100150100150506True
796719DiancieMega DiancieRockFairy501601101601101106True
797720HoopaHoopa ConfinedPsychicGhost8011060150130706True
798720HoopaHoopa UnboundPsychicDark8016060170130806True
799721VolcanionFireWater8011012013090706True

800 rows × 12 columns

read()
# 第一个参数满足条件的, 设置第二个参数的列为后面的值
df.loc[df['Type 1'] == 'Grass', ['Type 1', 'Lendary']] = 'TEST VALUE'

df
#NameType 1Type 2HPAttackDefenseSp. AtkSp. DefSpeedGenerationLegendaryLendary
01BulbasaurTEST VALUEPoison4549496565451FalseTEST VALUE
12IvysaurTEST VALUEPoison6062638080601FalseTEST VALUE
23VenusaurTEST VALUEPoison808283100100801FalseTEST VALUE
33VenusaurMega VenusaurTEST VALUEPoison80100123122120801FalseTEST VALUE
44CharmanderFireNaN3952436050651FalseNaN
..........................................
795719DiancieRockFairy50100150100150506TrueNaN
796719DiancieMega DiancieRockFairy501601101601101106TrueNaN
797720HoopaHoopa ConfinedPsychicGhost8011060150130706TrueNaN
798720HoopaHoopa UnboundPsychicDark8016060170130806TrueNaN
799721VolcanionFireWater8011012013090706TrueNaN

800 rows × 13 columns

read()
# 第一个参数满足条件的, 设置第二个参数的列为后面的值
df.loc[df['Type 1'] == 'Grass', ['Type 1', 'Lendary']] = ['TEST 1', 'TEST 2']

df
#NameType 1Type 2HPAttackDefenseSp. AtkSp. DefSpeedGenerationLegendaryLendary
01BulbasaurTEST 1Poison4549496565451FalseTEST 2
12IvysaurTEST 1Poison6062638080601FalseTEST 2
23VenusaurTEST 1Poison808283100100801FalseTEST 2
33VenusaurMega VenusaurTEST 1Poison80100123122120801FalseTEST 2
44CharmanderFireNaN3952436050651FalseNaN
..........................................
795719DiancieRockFairy50100150100150506TrueNaN
796719DiancieMega DiancieRockFairy501601101601101106TrueNaN
797720HoopaHoopa ConfinedPsychicGhost8011060150130706TrueNaN
798720HoopaHoopa UnboundPsychicDark8016060170130806TrueNaN
799721VolcanionFireWater8011012013090706TrueNaN

800 rows × 13 columns

read()

# 分组 求平均数
df.groupby(['Type 1']).mean(numeric_only=True).sort_values('Defense', ascending=False)
#HPAttackDefenseSp. AtkSp. DefSpeedGenerationLegendary
Type 1
Steel442.85185265.22222292.703704126.37037067.51851980.62963055.2592593.8518520.148148
Rock392.72727365.36363692.863636100.79545563.34090975.47727355.9090913.4545450.090909
Dragon474.37500083.312500112.12500086.37500096.84375088.84375083.0312503.8750000.375000
Ground356.28125073.78125095.75000084.84375056.46875062.75000063.9062503.1562500.125000
Ghost486.50000064.43750073.78125081.18750079.34375076.46875064.3437504.1875000.062500
Water303.08928672.06250074.15178672.94642974.81250070.51785765.9642862.8571430.035714
Ice423.54166772.00000072.75000071.41666777.54166776.29166763.4583333.5416670.083333
Grass344.87142967.27142973.21428670.80000077.50000070.42857161.9285713.3571430.042857
Bug334.49275456.88405870.97101470.72463853.86956564.79710161.6811593.2173910.000000
Dark461.35483966.80645288.38709770.22580674.64516169.51612976.1612904.0322580.064516
Poison251.78571467.25000074.67857168.82142960.42857164.39285763.5714292.5357140.000000
Fire327.40384669.90384684.76923167.76923188.98076972.21153874.4423083.2115380.096154
Psychic380.80701870.63157971.45614067.68421198.40350986.28070281.4912283.3859650.245614
Electric363.50000059.79545569.09090966.29545590.02272773.70454584.5000003.2727270.090909
Flying677.75000070.75000078.75000066.25000094.25000072.500000102.5000005.5000000.500000
Fighting363.85185269.85185296.77777865.92592653.11111164.70370466.0740743.3703700.000000
Fairy449.52941274.11764761.52941265.70588278.52941284.70588248.5882354.1176470.058824
Normal319.17346977.27551073.46938859.84693955.81632763.72449071.5510203.0510200.020408
read()

# 分组 求和
df.groupby(['Type 1']).sum(numeric_only=True).sort_values('HP', ascending=False)
#HPAttackDefenseSp. AtkSp. DefSpeedGenerationLegendary
Type 1
Water339468071830581708379789873883204
Normal312797573720058655470624570122992
Grass241414709512549565425493043352353
Psychic2170640264073385856094918464519314
Bug230803925489748803717447142562220
Fire170253635440835244627375538711675
Rock172802876408644352787332124601524
Dragon1518026663588276430992843265712412
Electric159942631304029173961324337181444
Ground114012361306427151807200820451014
Dark143022071274021772314215523611252
Ghost155682062236125982539244720591342
Fighting9824188626131780143417471784910
Poison7050188320911927169218031780710
Steel119571761250334121823217714921044
Ice10165172817461714186118311523852
Fairy764212601046111713351440826701
Flying2711283315265377290410222
read()

# 分组 求数量
df.groupby(['Type 1']).count().sort_values('HP', ascending=False)
#NameType 2HPAttackDefenseSp. AtkSp. DefSpeedGenerationLegendary
Type 1
Water11211253112112112112112112112112
Normal9898379898989898989898
Grass7070377070707070707070
Bug6969526969696969696969
Psychic5757195757575757575757
Fire5252245252525252525252
Electric4444174444444444444444
Rock4444354444444444444444
Ghost3232223232323232323232
Ground3232193232323232323232
Dragon3232213232323232323232
Dark3131213131313131313131
Poison2828132828282828282828
Fighting272772727272727272727
Steel2727222727272727272727
Ice2424112424242424242424
Fairy171721717171717171717
Flying44244444444
read()

# 每行都设置 count = 1
df['count'] = 1

# 分组 求数量
df.groupby(['Type 1']).count()['count']
Type 1
Bug          69
Dark         31
Dragon       32
Electric     44
Fairy        17
Fighting     27
Fire         52
Flying        4
Ghost        32
Grass        70
Ground       32
Ice          24
Normal       98
Poison       28
Psychic      57
Rock         44
Steel        27
Water       112
Name: count, dtype: int64
read()

# 每行都设置 count = 1
df['count'] = 1

# 分组 求数量
df.groupby(['Type 1', 'Type 2']).count()['count']
Type 1  Type 2  
Bug     Electric     2
        Fighting     2
        Fire         2
        Flying      14
        Ghost        1
                    ..
Water   Ice          3
        Poison       3
        Psychic      5
        Rock         4
        Steel        1
Name: count, Length: 136, dtype: int64
# 读取大文件
# 分批读取数据, chunksize 行数
for df in pd.read_csv(f'{data_dir}/pandas/pokemon_data.csv', chunksize=5):
#     print('CHUNK DF:::')
#     print(df['Name'])
    ;

# 读取大文件
# 分批读取数据, chunksize 行数

new_df = pd.DataFrame()
for df in pd.read_csv(f'{data_dir}/pandas/pokemon_data.csv', chunksize=5):
    new_df = pd.concat([new_df, df])
#     results = df.groupby(['Type 1']).count()
#     new_df = pd.concat([new_df, results])

new_df

#NameType 1Type 2HPAttackDefenseSp. AtkSp. DefSpeedGenerationLegendary
01BulbasaurGrassPoison4549496565451False
12IvysaurGrassPoison6062638080601False
23VenusaurGrassPoison808283100100801False
33VenusaurMega VenusaurGrassPoison80100123122120801False
44CharmanderFireNaN3952436050651False
.......................................
795719DiancieRockFairy50100150100150506True
796719DiancieMega DiancieRockFairy501601101601101106True
797720HoopaHoopa ConfinedPsychicGhost8011060150130706True
798720HoopaHoopa UnboundPsychicDark8016060170130806True
799721VolcanionFireWater8011012013090706True

800 rows × 12 columns

read()

# 分组 求平均数
df.groupby(['Type 1'])['Defense'].mean().plot()

在这里插入图片描述

read()

# 分组 求平均数
df.groupby(['Type 1'])['Defense'].mean().plot(kind='bar')

在这里插入图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值