数据分析——pandas

最新推荐文章于 2024-09-02 16:47:03 发布

Z_Coding

最新推荐文章于 2024-09-02 16:47:03 发布

阅读量170

点赞数

分类专栏： python 文章标签： pandas

本文链接：https://blog.csdn.net/z1360408752/article/details/113736220

版权

python 专栏收录该内容

16 篇文章 0 订阅

订阅专栏

这篇博客介绍了Pandas库的基础知识，包括Series的创建、切片与索引操作，以及DataFrame的构造和基本属性。讲解了如何利用loc和iloc进行数据选取，展示了布尔索引和缺失数据处理的方法。此外，还探讨了Pandas在实际数据分析中的应用，如统计电影数据的平均评分和导演数量。

摘要由CSDN通过智能技术生成

为什么学习pandas

numpy帮助我们处理数值，pandas处理数值之外（基于numpy）还能帮我们处理其他类型的数据。

Series

pandas的Series是一个带标签的数组

创建Series

Series的创建方法有两种，一种是使用list创建，另一种是使用字典创建。

import pandas as pd
import string
#使用list创建Series
t1=pd.Series([1,2,3,4])
#自定义索引
t2=pd.Series([1,2,3,4,5],index=list("abcde"))
print(t2)

#通过字典创建Series
dict={"name":"xiaoming","age":"18","tel":"123456"}
t3=pd.Series(dict)
print(t3)
a={string.ascii_uppercase[i]:i for i in range(10)}  #字典推导式创建一个字典a
print(a)
print(pd.Series(a)) #通过字典创建Series
t4=pd.Series(a,index=list(string.ascii_uppercase[5:15]))
print(t4)   #numpy种nan为float类型，pandas会自动将类型转换为float
print(t4.dtype)
print(t2.astype(int))   #自定义修改Series的dtype

a    1
b    2
c    3
d    4
e    5
dtype: int64
name    xiaoming
age           18
tel       123456
dtype: object
{'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4, 'F': 5, 'G': 6, 'H': 7, 'I': 8, 'J': 9}
A    0
B    1
C    2
D    3
E    4
F    5
G    6
H    7
I    8
J    9
dtype: int64
F    5.0
G    6.0
H    7.0
I    8.0
J    9.0
K    NaN
L    NaN
M    NaN
N    NaN
O    NaN
dtype: float64
float64
a    1
b    2
c    3
d    4
e    5
dtype: int32

Series的切片与索引

切片： 直接传入start end或者步长即可
索引： 一个的时候之间传入序号或者index，多个的时候传入index的列表

import pandas as pd
#通过字典创建Series
dict={"name":"xiaoming","age":"18","tel":"123456"}
t=pd.Series(dict)
print(t)
#根据切片取值
print(t[0])
print(t[0:2])
print(t[[0,2]])
#根据索引取值
print(t["name"])
print(t[["name","tel"]])
#布尔索引取值
t2=pd.Series(range(10))
print(t2[t2>5])

name    xiaoming
age           18
tel       123456
dtype: object
xiaoming
name    xiaoming
age           18
dtype: object
name    xiaoming
tel       123456
dtype: object
xiaoming
name    xiaoming
tel       123456
dtype: object
6    6
7    7
8    8
9    9
dtype: int64

对于一个陌生的Series，我们如何获取它的索引和具体的值呢？

import pandas as pd
#通过字典创建Series
dict={"name":"xiaoming","age":"18","tel":"123456"}
t=pd.Series(dict)
print(t)
print(t.index)
print(t.values)
print(type(t.index))
print(type(t.values))

name    xiaoming
age           18
tel       123456
dtype: object
Index(['name', 'age', 'tel'], dtype='object')
['xiaoming' '18' '123456']
<class 'pandas.core.indexes.base.Index'>
<class 'numpy.ndarray'>

Series对象本质上由两个数组构成。
一个数组构成对象的键（index，索引），一个数组构成对象的值（values），键->值

ndarray的很多方法都可可以运用于Series类型，比如argmax，clip
Series具有where方法，但是结果和ndarray不同

pandas读取外部数据

import pandas as pd
pd.read_csv()

pandas之DataFrame

import pandas as pd
import numpy as np
t=pd.DataFrame(np.arange(12).reshape(3,4))
print(t)

在这里插入图片描述

DataFrame对象既有行索引，又有列索引
行索引，表明不同行，横向索引，叫index，0轴，axis=0
列索引，表名不同列，纵向索引，叫columns，1轴，axis=1

import pandas as pd
import numpy as np
import string
t=pd.DataFrame(np.arange(12).reshape(3,4))
print(t)
#DataFrame设置行索引和列索引
#index代表行索引，columns代表列索引
t1=pd.DataFrame(np.arange(12).reshape(3,4),index=list(string.ascii_uppercase[:3]),columns=list(string.ascii_uppercase[-4:]))
print(t1)

DataFrame的基本属性和方法
在这里插入图片描述

pandas之loc

df.loc通过标签索引获取行数据
df.iloc通过位置获取行数据

import pandas as pd
import numpy as np
import string
t=pd.DataFrame(np.arange(12).reshape(3,4),index=list(string.ascii_uppercase[:3]),columns=list(string.ascii_uppercase[-4:]))
print(t)
print(t.loc["A","W"])   #A行W列的值
print(t.loc["A",["W","Z"]]) #A行 W列和Z列的值
print(type(t.loc["A"]))
#选择间隔的多行多列
print(t.loc[["A","C"],["W","Z"]])
print(t.loc["A":,["W","Z"]])
print(t.loc["A":"C",["W","Z"]]) #冒号在loc里是闭合的，即会取到冒号后面的数据

   W  X   Y   Z
A  0  1   2   3
B  4  5   6   7
C  8  9  10  11
0
W    0
Z    3
Name: A, dtype: int32
<class 'pandas.core.series.Series'>
   W   Z
A  0   3
C  8  11
   W   Z
A  0   3
B  4   7
C  8  11
   W   Z
A  0   3
B  4   7
C  8  11

pandas之iloc

df.loc通过标签索引获取行数据
df.iloc通过位置获取行数据

import pandas as pd
import numpy as np
import string
t=pd.DataFrame(np.arange(12).reshape(3,4),index=list(string.ascii_uppercase[:3]),columns=list(string.ascii_uppercase[-4:]))
print(t)
print(t.iloc[1:3,1:3])
#赋值更改数据
t.loc["A","Y"]=100
print(t)
t.iloc[1:2,0:2]=666
print(t)

   W  X   Y   Z
A  0  1   2   3
B  4  5   6   7
C  8  9  10  11
   X   Y
B  5   6
C  9  10
   W  X    Y   Z
A  0  1  100   3
B  4  5    6   7
C  8  9   10  11
     W    X    Y   Z
A    0    1  100   3
B  666  666    6   7
C    8    9   10  11

pandas之布尔索引

import pandas as pd
import numpy as np
import string
t=pd.DataFrame(np.arange(12).reshape(3,4),index=list(string.ascii_uppercase[:3]),columns=list(string.ascii_uppercase[-4:]))
print(t)
print(t[t>5])   #不符合条件的被赋值为nan

   W  X   Y   Z
A  0  1   2   3
B  4  5   6   7
C  8  9  10  11
     W    X     Y     Z
A  NaN  NaN   NaN   NaN
B  NaN  NaN   6.0   7.0
C  8.0  9.0  10.0  11.0

缺失数据的处理

判断是否为nan
pd.isnull(df),pd.notnull(df)

nan的处理方法：

删除nan所在的行或列
t.dropna (axis=0, how=‘any’, inplace=False)
填充有意义的数据
t[“Y”]=t[“Y”].fillna(t[“Y”].mean())

处理为0的数据
t[t==0]=np.nan

import pandas as pd
import numpy as np
import string
t=pd.DataFrame(np.arange(12).reshape(3,4),index=list(string.ascii_uppercase[:3]),columns=list(string.ascii_uppercase[-4:]))
print(t)
print(t[t>5])   #不符合条件的被赋值为nan
t=t[t>5]    #此时t种包含nan
print(t)
print(t[pd.notnull(t["W"])])    #选择t种不包含nan的行数据
#处理方法1：删除nan所在的行或列
#axis表示轴，how参数取值[any，all]，any表示包含nan修改，all表示全部为nan才修改，inplace为是否原地修改
t.dropna(axis=0,how='any',inplace=False)
print(t)
#处理方法2：填充有意义的数值
print(t.fillna(0))  #将nan的值修改为0
# print(t.fillna(t.mean()))
t["Y"]=t["Y"].fillna(t["Y"].mean()) #只操作一列
print(t)

实战

假设现在我们有一组从2006年到2016年1000部最流行的电影数据，我们想知道这些电影数据中评分的平均分，导演的人数等信息，我们应该怎么获取？
数据来源：https://www.kaggle.com/damianpanek/sunday-eda/data

import pandas as pd
from matplotlib import pyplot as plt

file_path="./IMDB-Movie-Data.csv"

df=pd.read_csv(file_path)
print(df.head(1))
print(df.info())    #获取简要摘要

#rating，runtime分布情况
#准备图形：直方图
#准备数据

runtime_data=df["Runtime (Minutes)"].values

max_runtime=runtime_data.max()
min_runtime=runtime_data.min()

#计算数组
print(max_runtime-min_runtime)
num_bin=(max_runtime-min_runtime)//5

#设置图形大小
plt.figure(figsize=(20,8),dpi=80)

#设置x轴
plt.xticks(range(min_runtime,max_runtime+5,5))

#画图
plt.hist(runtime_data,num_bin)

plt.show()

在这里插入图片描述

常用统计方法

假设现在我们有一组从2006年到2016年1000部最流行的电影数据，我们想知道这些电影数据中评分的平均分，导演的人数等信息，我们应该怎么获取？
数据来源：https://www.kaggle.com/damianpanek/sunday-eda/data

去重复的方法

import pandas as pd
from matplotlib import pyplot as plt

file_path="./IMDB-Movie-Data.csv"

df=pd.read_csv(file_path)
print(df.head(1))
print(df.info())    #获取简要摘要

#获取平均分
print(df["Rating"].mean())

#获取导演人数
print(len(set(df["Director"].tolist())))    #去除重复
print(len(df["Director"].unique()))  #去除重复

#获取演员人数
temp_actors_list=df["Actors"].str.split(",").tolist()   #类似[['a','b'],['c','d']]的多重列表
actors_list=[i for j in temp_actors_list for i in j]    #两重循环展开
actors_nums=len(set(actors_list))
print(actors_nums)

在这里插入图片描述

数据合并之join

join: 默认情况下他是把行索引相同的数据合并到一起

import pandas as pd
import numpy as np
import string
t1=pd.DataFrame(np.arange(12).reshape(3,4),index=list(string.ascii_uppercase[:3]),columns=list(string.ascii_uppercase[-4:]))
print(t1)
t2=pd.DataFrame(np.arange(12).reshape(2,6),index=list(string.ascii_uppercase[:2]))
print(t2)
print(t1.join(t2))