pandas学习笔记

最新推荐文章于 2022-11-15 13:28:57 发布

Wanghd_yummy

最新推荐文章于 2022-11-15 13:28:57 发布

阅读量212

点赞数

分类专栏： python科学数据库文章标签： python pandas

本文链接：https://blog.csdn.net/Wanghd_yummy/article/details/119302649

版权

python科学数据库专栏收录该内容

4 篇文章 0 订阅

订阅专栏

pandas学习笔记

Series
DataFrame
- DataFrame的索引
Pandas 字符串方法
Pandas缺失数据的处理
Pandas常用统计方法
练习：统计Genre
数据合并
Groupby
- 索引
分组聚合练习

Series

pandas中一维向量是带标签的。
可以通过给定列表设置标签，如果不设置标签则默认索引从0开始作为标签。

import pandas as pd
t1 = np.Series([1, 3, 5], index = [1, 2, 3])
print(t1)

也可以通过字典初始化pandas Series,标签则为字典中的标签。

temp_dict = {"name":"Jessica","age":20,"tel":10086}
t2 = pd.Seires(temp_dict)
print(t2)

可以通过标签进行索引，也可以通过行数进行索引

t2["tel"]
t2[2]
t2[:2] # 取连续的多行
t2[[0,2]] # 取不连续的多行
t2["name","tel"] # 取不连续的多行

获取Series的标签索引或值

t2.index # 获取索引标签
t2.values # 获取值

DataFrame

DataFrame既有行索引（index），也有列索引（column)

t = pd.DataFrame(np.arange(12).reshape(3, 4), index=list("abc"), columns=list("ABCD"))
print(t)

使用字典创建DataFrame，缺省值填充NaN：

d1 = {"name":"[Wang","Cao"],"age":[18,20]}
t1 = DataFrame(d1)

DataFrame的基础属性：

对某一列数据进行排序：

df = df.sort_values(by = "Count_AnimalName",ascending=False)
print(df.head())

DataFrame的索引

直接写数字索引是对行进行索引，对列进行索引使用字符串。

print(df[:20]["Count_AnimalName"])

Pandas优化的索引方式：
df.loc通过标签进行索引，使用标签索引的时候冒号后面能选中
df.iloc通过位置进行索引

t3 = pd.DataFrame(np.arange(12).reshape(3,4),index=list("abc"),columns=list("ABCD"))
print(t3.loc["a","A"])
print(t3.iloc[2])

Pandas bool索引：

t1 - df[(df["Count_AnimalName"]>800 & (df["Row_Labels"].str.len()>4))]

Pandas 字符串方法

Pandas字符串方法
tolist()方法可以把Series转换成一个大列表

print(df["info"].str.split("/").tolist())

Pandas缺失数据的处理

判断是否存在nan：

pd.isnan(t3)
pd.notnull(t3)
t3[pd.notnull(t3["W"])]

删除为nan的行：dropna

t3.dropna(axis=0,how="any") # how参数默认为any,只要存在一个nan就删除
t3.dropna(axis=0,how="all") # 全部为nan才删除
t3.dropna(axis=0,how="any",inplace=True) #对t3本身进行修改

填充nan数据：

t3.fillna(0) # 将nan填充为0
t3.fillna(t3.mean) # 将nan填充为均值
t2["age"] = t2["age"].fillna(t2["age"].mean) #对t2 age一列进行填充

Pandas常用统计方法

常用统计方法

练习：统计Genre

统计电影中流派信息：

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
file_path = './IMDB-Movie-Data.csv'
df = pd.read_csv(file_path)

print(df.head(1))

temp_list = df["Genre"].str.split(",").tolist()
genre_list = list(set([i for j in temp_list for i in j]))

zero_df = pd.DataFrame(np.zeros((df.shape[0], len(genre_list))), columns=genre_list)

for i in range(df.shape[0]):
    zero_df.loc[i, temp_list[i]] = 1
print(zero_df.head(3))

# Sum the count of genre
genre_count = zero_df.sum(axis=0)
genre_count = genre_count.sort_values()

plt.figure(figsize=(20, 8), dpi=80)
_x = genre_count.index
_y = genre_count.values
plt.bar(range(len(_x)), _y)
plt.xticks(range(len(_x)), _x)
plt.show()

数据合并

t1.join(t2):把行索引相同的数据合并在一起，行索引不同的话以t1为准，t2缺的行数据填充NaN，t2多的行数据则删除

t1.merge(t2,on = "a", how="inner") # 以a列为准进行内连接
t1.merge(t2,on = "a", how="outer") # 外连接
t1.merge(t2,on = "a", how="left") # 左连接，以t1为准
t1.merge(t2,on = "a", how="right") # 右连接，以t2为准

Groupby

grouped = df.groupby(by=“column names”) 返回DataFrameGroupBy对象，grouped每一个元素是一个元组，元组里面是（索引，分组后的DataFrame对象）

import pandas as pd
import numpy as np

file_path = "./directory.csv"
df = pd.read_csv(file_path)
# 按照国家进行分组
grouped = df.groupby(by="Country")
print(grouped)
# 进行统计
country_count = grouped["Brand"].count()
print(country_count["US"])
print(country_count["CN"])
# 计算中国每个省份的星巴克数量
# 先把中国的数据提取出来
china_data = df[df["Country"]=="CN"]
# 进行分组并统计
grouped = china_data.groupby(by="State/Province").count()["Brand"]
print(grouped)

Groupby对象的其他方法：
groupby的其它方法
对一列按照多个类别进行分组统计，则返回Series类型，索引由多个类别组成：

grouped = df["Brand"].grouped(by=[df["Country"],df["State/Province"]]).count()
print(grouped)

如果想要返回DataFrame类型，只需要对要取的数据添加 [ ] 即可

grouped = df[["Brand"]].grouped(by=[df["Country"],df["State/Province"]]).count()
print(grouped)

索引

DataFrame索引要加上loc，Series索引可以直接索引。
索引是可以重复的。

grouped1.index # 获取索引
df1.index = ["a","b"] # 修改索引
df.reindex=["a","f"] # 取df里的a、f两行，如果没有f，则填充NaN
df1.set_index("a", drop=False) # 把a这一列作为索引，默认drop为True，其去掉a这一列。
df1.set_index("a").index.unique() # 返回去重的index
df1.set_index(["a","b"]) # 可以设置多个索引
df1["one"]["t"] # Series多个索引情况下取one-t对应的值
df1.swaplevel() # Series将外层索引和内层索引进行交换
df1.loc["one"].loc["h"] # DataFrame多个索引情况下取值
df1.swapllevel() # DataFrame进行内外层索引交换，同Series

分组聚合练习

现在我们有全球排名靠前的10000本书的数据，那么请统计一下下面问题：
不同年份书的平均评分情况

import pandas as pd
from matplotlib import pyplot as plt
from matplotlib import font_manager

file_path = './books.csv'
df = pd.read_csv(file_path)

data1 = df[pd.notnull(df["original_publication_year"])]

grouped = data1["average_rating"].groupby(by=data1["original_publication_year"]).mean()
print(grouped)
plt.figure(figsize=(20, 8), dpi=80)

_x = grouped.index
_y = grouped.values
plt.xticks(range(len(_x))[::10], _x[::10].astype(int), rotation=45)
plt.plot(range(len(_x)), _y)
plt.show()

现在我们有全球排名靠前的10000本书的数据，那么请统计一下下面问题：
不同年份书的数量

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from matplotlib import font_manager
my_font = font_manager.FontProperties(fname="/usr/share/fonts/truetype/arphic/uming.ttc")

file_path = './directory.csv'
df = pd.read_csv(file_path)
df =df[df["Country"] == "CN"]
print(df.head(1))
data = df.groupby(by="City")["Brand"].count().sort_values(ascending=False)[:25]
_x = data.index
_y = data.values

plt.figure(figsize=(20, 8), dpi=80)
plt.bar(range(len(_x)), _y, width=0.3, color="orange")
plt.xticks(range(len(_x)), _x, fontproperties=my_font)
plt.show()

Wanghd_yummy

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
pandas学习笔记

pandas学习笔记SeriesDataFrameDataFrame的索引Pandas 字符串方法Pandas缺失数据的处理Pandas常用统计方法Seriespandas中一维向量是带标签的。可以通过给定列表设置标签，如果不设置标签则默认索引从0开始作为标签。import pandas as pdt1 = np.Series([1, 3, 5], index = [1, 2, 3])print(t1)也可以通过字典初始化pandas Series,标签则为字典中的标签。temp_dict
复制链接

扫一扫