Pandas基础（二）

AI小白日记

于 2024-08-06 16:26:12 发布

阅读量855

点赞数 25

分类专栏： Python基础文章标签： pandas

本文链接：https://blog.csdn.net/zhanghemeng/article/details/140959141

版权

Python基础专栏收录该内容

16 篇文章 0 订阅

订阅专栏

Pandas基础（二）

一、本课目标

掌握DataFrame的特点和使用
掌握Pandas分析CSV文件
掌握Pandas分析JSON文件
掌握Pandas数据清洗
掌握Pandas常用函数

二、数据结构 - DataFrame

DataFrame 是 Pandas 中的另一个核心数据结构，用于表示二维表格型数据
DataFrame 既有行索引也有列索引，它可以被看做由 Series组成的字典（共同用一个索引）
DataFrame 提供了各种功能来进行数据访问、筛选、分割、合并、重塑、聚合以及转换等操作
创建 DataFrame
使用列表创建
- data = [[‘Google’, 10], [‘Baidu’, 12], [‘Wiki’, 13]]
- df = pd.DataFrame(data, columns=[‘Site’, ‘Age’])
使用字典创建
- data = {‘Site’:[‘Google’, ‘Baidu’, ‘Wiki’], ‘Age’:[10, 12, 13]}
- df = pd.DataFrame(data)
- data = [{‘Site’:‘Google’,‘Age’: 10},{‘Site’:‘Baidu’,‘Age’:10},{‘Site’:‘Wiki’}]
- df = pd.DataFrame(data)
- df = pd.DataFrame({“Nevada”:{2001:2.4,2002:2.9},“Ohio”:{2000:1.5,2001:1.7,2002:3.6}})
使用 ndarrays 创建
ndarray_data = np.array([[‘Google’, 10], [‘Baidu’, 12], [‘Wiki’, 13]])
df = pd.DataFrame(ndarray_data, columns=[‘Site’, ‘Age’])

import pandas as pd
import numpy as np

# 二维列表中的每个列表表示一行
dt = [['google', 20], ['baidu', 30], ['facebook', 40]]
df = pd.DataFrame(dt, columns = ['size', 'age'])
print(df)
print("*" * 30)

# 字典套列表，其中键作为列名，值做为列数据
di = {'size':['google', 'baidu', 'facebook'], 'age':[20, 30, 40]}
df = pd.DataFrame(di, columns = ['size', 'age'])
print(df)
print("*" * 30)

# 列表套字典，其中每个字典元表示一行数据
dl = [{'size':'google', 'age':20}, {'size':'baidu', 'age':30}, {'size':'facebook', 'age':40}]
df = pd.DataFrame(dl, columns = ['size', 'age'])
print(df)

# 字典套字典，其中外部字典的键作为列名，内部字典的键作为索引
di1 = {'a': 1, 'b': 2, 'c': 3}
di2 = {'a': 4, 'b': 5, 'd': 6}
di3 = {'k1': di1, 'k2': di2}
df = pd.DataFrame(di3)
print(df)

三、loc 属性

返回指定行的数据，索引从 0 开始
注意：返回结果其实就是一个 DataFrame 数据
例如
- data = {“calories”: [420, 380, 390],“duration”: [50, 40, 45]}
- #数据载入到 DataFrame 对象
- df = pd.DataFrame(data)
- #返回第一行,第二行
- print(df.loc[0])
- print(df.loc[1])
- #返回第一行和第二行
- print(df.loc[[0, 1]])
- df = pd.DataFrame(data, index = [“day1”, “day2”, “day3”])
- print(df.loc[“day2”])

data = {"calories": [420, 380, 390], "duration": [50, 40, 45]}
df = pd.DataFrame(data, columns=["calories", "duration"])
print(df)
print("*" * 30)
print(df.loc[2])
print(df.loc[0])
print('*' * 30)
print(df.loc[[0, 2]])
print(type(df.loc[0]))
print(type(df.loc[[0, 2]]))

data = {"calories": [420, 380, 390], "duration": [50, 40, 45]}
df = pd.DataFrame(data, ['day1', 'day2', 'day3'])
print(df)
print("*" * 30)
print(df.loc['day2']['calories'])
print(df.loc[['day1', 'day2']].loc['day2']['calories'])
'''
      calories  duration
day1       420        50
day2       380        40
day3       390        45
******************************
380
380
'''

四、练习

# 通过字典创建Data Frame
df = pd.DataFrame({'Column': [1, 2, 3], 'Column2': [4, 5, 6]})
# 通过列表的列表创建Data Frame
df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]], columns = ['Column', 'Column2', 'Column3'])
# 通过Numpy数组创建DataFrame
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8]]))
# 从Series创建DataFrame
s1 = pd.Series(['Alice', 'Bob', 'Charlie'])
s2 = pd.Series([25, 30, 35])
s3 = pd.Series(['New York', 'Los Angeles', 'Chicago'])
df = pd.DataFrame({'Name':s1, 'Age':s2, 'City':s3})

五、DataFrame 的属性和方法

print(df.shape) # 形状
print(df.columns) # 列名
print(df.index) # 索引
print(df.head()) # 前几行数据，默认是前 5 行
print(df.tail()) # 后几行数据，默认是后 5 行
print(df.info()) # 数据信息
print(df.describe())# 描述统计信息
print(df.mean()) # 求平均值
print(df.sum()) # 求和

s1 = pd.Series(["tom", "jack", "rose"],["001", "002", "003"])
s2 = pd.Series([99, 88, 77], ['001', '002', '003'])
s3 = pd.Series([66, 89, 90], ['001', '002', '003'])
df = pd.DataFrame({'name':s1, 'chinese':s2, 'math':s3})
print(df)
print("*" * 30)
print(df.shape)
print(df.columns)
print(df.index.values)
print(df.head(2))
print("*" * 30)
print(df.tail(2))
print("*" * 30)
print(df.info())
print("*" * 30)
print(df.describe())
print(df.mean()) # error，因为存在非数值列
print(df[['chinese', 'math']].mean())
print(df.sum())

六、访问 DataFrame 元素

访问列
- print(df[‘Column’]) # 通过列名访问
- print(df.Column) # 通过属性访问
- print(df.loc[:,‘Column’]) # 通过列名访问
- print(df.iloc[:, 0]) # 通过索引访问
- print(df[‘Column’][0])# 访问单个元素
访问行
- print(df.loc[Index, ‘Columns’])# 通过行标签访问

s1 = pd.Series(["tom", "jack", "rose"], ["001", "002", "003"])
s2 = pd.Series([99, 88, 77], ['001', '002', '003'])
s3 = pd.Series([66, 89, 90], ['001', '002', '003'])
df = pd.DataFrame({'name': s1, 'chinese': s2, 'math': s3})
print(df)
print("*" * 30)
print(df['chinese'])
print(df.loc['001':'002','chinese'])
print(df.chinese)
print(df.iloc[:,1])
print(df[['chinese', 'math']])
df['math'] = df['math'] + 10
print(df['math'])
df2 = df.drop(['math'], axis=1)
print(df2)
df2 = df.drop(['001', '002'], axis = 0)
print(df2)

七、修改 DataFrame

修改列数据：直接对列进行赋值。
- df[‘Column’] = [data1, data2, data3]
添加新列：给新列赋值。
- df[‘NewColumn’] = [data1, data2, data3]
添加新行：使用 loc 方法
- df.loc[Index] = [data1, data2, data3]

八、删除 DataFrame 元素

删除列：使用 drop 方法
- df_dropped = df.drop(‘Column’, axis=1)
删除行：同样使用 drop 方法
- df_dropped = df.drop(Index)

九、DataFrame 的统计分析

描述性统计：使用 .describe() 查看数值列的统计摘要
- df.describe()
计算统计数据：使用聚合函数如 .sum()、.mean()、.max() 等
- df[‘Column’].sum()
- df.mean()

del df['math']
print(df)
print(df.describe())
print(df[['chinese', 'math']].sum())
print(df[['chinese', 'math']].min())
print(df[['chinese', 'math']].max())
print(df[['chinese', 'math']].std())
print(df[['chinese', 'math']].mean())
print(df[['chinese', 'math']].count())
df2 = df.reset_index(drop = True)
print(df2)
df2 = df.set_index("name")
print(df2)
print(df[(df['chinese'] != 87) & (df['math'] >= 80)])
print(df[['chinese', 'math']].dtypes)
df2 = df["chinese"].astype("float64")
print(df2)

十、DataFrame 的索引操作

重置索引：使用 .reset_index()
- df_reset = df.reset_index(drop=True)
设置索引：使用 .set_index()
- df_set = df.set_index(‘Column’)
根据条件过滤 DataFrame
- df[(df[‘Column’] > 2) & (df[‘Column’] < 5)]

十一、DataFrame 的数据类型

查看数据类型：使用 dtypes 属性
- df.dtypes
转换数据类型：使用 astype 方法
- df[‘Column’] = df[‘Column’].astype(‘float64’)

十二、DataFrame 的合并

合并：使用 concat 或 merge 方法
- # 纵向合并
  - pd.concat([df1, df2], ignore_index=True)
- # 横向合并
  - pd.merge(df1, df2, on=‘Column’)

df1 = pd.DataFrame({"c1": ["aa", "bb", "cc"]})
df2 = pd.DataFrame({"c2": ["dd", "ee", "ff"]})
print(df1)
print(df2)
print("=" * 30)
df3 = pd.concat([df1, df2], ignore_index=True)
print(df3)

df1 = pd.DataFrame({"name": ["tom", "jack", "rose"], "chinese":[88, 99, 96]})
df2 = pd.DataFrame({"name": ["tom", "jack", "john"], "chinese":[77, 93, 86]})
print(df1)
print(df2)
print("=" * 20)
df3 = pd.merge(df1, df2, on="name")
print(df3)
df3 = pd.merge(df1, df2, on="name", how="left")
print(df3)
df3 = pd.merge(df1, df2, on="name", how="right")
print(df3)
df3 = pd.merge(df1, df2, on="name", how = "outer")
print(df3)

十三、索引和切片

DataFrame 支持对行和列进行索引和切片操作
- print(df[[‘Name’, ‘Age’]]) # 提取多列
- print(df[1:3]) # 切片多行
- print(df.loc[:, ‘Name’]) # 提取单列
- print(df.loc[1:2, [‘Name’, ‘Age’]]) # 根据标签提取指定行列
- print(df.iloc[:, 1:]) # 根据索引提取指定行列

s1 = pd.Series(["tom", "jack", "rose", "mark", "john"], ['001', '002', '003', '004', '005'])
s2 = pd.Series([99, 88, 77, 56, 69], ['001', '002', '003', '004', '005'])
s3 = pd.Series([66, 89, 90, 34, 87], ['001', '002', '003', '004', '005'])
df = pd.DataFrame({'name': s1, 'chinese': s2, 'math': s3})
print(df)
print("="*20)
# 提取一列
print(df['name'])
print("="*20)
# 提取多列
print(df[['name', 'math', 'chinese']])
print("="*20)
# 提取一行
print(df.loc['001'])
print("="*20)
# 提取多行
print(df.loc[['001', '002', '003']])

# 切片多行
print(df[1:3])
print(df.loc["001": "002", ["name", "chinese"]])
print(df.iloc[1:3, 0:2])  # 这些数字表示默认索引

十四、iloc和loc的区别

pandas iloc和loc的区别
访问数据的方式
Pandas中的‌和‌主要区别在于它们访问数据的方式。

loc 是基于标签（label-based）的索引方式，它通过行索引和列索引的标签来访问数据。这意味着，当你使用loc时，你需要提供行和列的标签（即索引名）来定位数据。例如，你可以使用**df.loc[行标签:列标签]**来选择特定行列的数据。如果选择所有行或所有列，可以使用:表示。loc在访问数据时，行和列的索引都是包含边界的 [起始索引:结束索引]。

iloc 是基于整数位置（integer-location）的索引方式，它通过行号和列号来访问数据。这意味着，当你使用iloc时，你需要提供行和列的整数位置（即从0开始的索引）。例如，你可以使用**df.iloc[行号,列号]**来选择单个数据，或者使用 df.iloc[:,列号] 来选择某一列的所有数据。iloc在访问数据时，行索引是半开半闭区间，即包含起始索引但不包含结束索引。

总结来说，选择loc还是iloc主要取决于你的数据是否具有明确的标签（如索引名或列名），或者你是否需要通过整数位置来访问数据。如果数据有明确的标签，使用loc会更加直观和方便；如果需要通过整数位置访问数据，则应使用iloc

import numpy as np
import pandas as pd

# 创建Dataframe
data = pd.DataFrame(np.arange(25).reshape(5, 5), index=list('abcde'), columns=list('ABCDE'))
print(data)
print("*" * 30)
# 1.提取行数据（这里以取第一行为例）
test1 = data.loc['a']
print(test1)

test2 = data.iloc[0]
print(test2)

# 2.提取列数据（这里已取第三列为例）
test3 = data.loc[:,['C']]
print(test3)

test4 = data.iloc[:,[2]]
print(test4)

# 3.提取指定行、列的数据（提取index为’a’,‘b’,列名为’A’,'B’中的数据）

test5 = data.loc[['a','b'],['A','B']]
print(test5)
test6 = data.iloc[[0,2],[1,3]]
print(test6)

# 4. 根据条件来取数据
# 提取A列中数值为15的所在行数据
test7 = data.loc[data['A']==15]
print(test7)

#isin函数

# 提取A列中数值为10的所在行数据
test8 = data[data['A'].isin([10])]
print(test8)