pandas

最新推荐文章于 2024-05-28 23:56:46 发布

qq_45780540

最新推荐文章于 2024-05-28 23:56:46 发布

阅读量1.4k

点赞数

分类专栏： python 文章标签： python

本文链接：https://blog.csdn.net/qq_45780540/article/details/120628172

版权

python 专栏收录该内容

20 篇文章 0 订阅

订阅专栏

什么是pandas

pandas是python中的第三方库，主要用来处理数据，类似于excel。

通过 import pandas as pd 来导入。

pandas常用的数据类型

1.Series 一维，带标签数组

2.DataFrame 二维

series的创建

1.创建一维数组

t=pd.Series([1,2,31,12,3,4])
print(t)#第一列为索引值，第二列为数据
t2=pd.Series([1,2,3,2,2,1],index=list('abcedf')) #index指定索引
print(t2)

上面t2中index指定索引值。

2.通过字典创建

temp_dict={'name':'xiaohong','age':30,'tel':10086}
t3=pd.Series(temp_dict)
print(t3)

3.切片和索引跟numpy中用法相同，numpy在上节中讲过。

pandas读取外部数据

df=pd.read_csv('./dogNames2.csv')
print(df)

DataFrame

1.通过DataFrame创建

print(pd.DataFrame(np.arange(12).reshape(3,4)))

结果：

可以发现它创建的3行4列的数组，但是有对应的行索引和列索引。

t1=pd.DataFrame(np.arange(12).reshape(3,4),index=list('abc'),columns=list('wxyz'))#index行索引，columns列索引
print(t1)

通过index和columns可以改变对应行索引和列索引。

2.通过字典传入

d1={'name':['xiaoming','xiaohong'],'age':[20,32],'tel':[10086,10010]}
print(pd.DataFrame(d1))

结果：

d2=[{'name':'xiaoming','age':'32','tel':'10010'},{'name':'xiaogang','tel':'10000'},{'name':'xiaohong','age':'22'}]
t2=pd.DataFrame(d2) #上述缺失的数据，会用NaN代替
print(t2)

上面字典有不同的键和值，缺失部分会用NAN补齐。

3.属性

d2=[{'name':'xiaoming','age':'32','tel':'10010'},{'name':'xiaogang','tel':'10000'},{'name':'xiaohong','age':'22'}]
t2=pd.DataFrame(d2) #上述缺失的数据，会用NaN代替

print(t2.index)#显示行索引
print(t2.columns)#显示列索引
print(t2.values)#取字典里面的值
print(t2.shape) #形状
print(t2.dtypes) #类型
print(t2.ndim) #表示维度
print(t2.head(1)) #显示头1行
print(t2.tail(2))#显示后2行
print(t2.info())#展示df的概览
print(t2.describe())#统计数字列的一些信息：均值、次数、最大值、最小值等

4.例题

读取的狗名字统计排序。

df=pd.read_csv('./dogNames2.csv')
# print(df.head())
# print(df.info())

#dataFrame中排序的方法
df=df.sort_values(by='Count_AnimalName',ascending=False)#升序排列 ascending='False'为降序
#print(df)
#print(df.head(5))

#索引取行取列
#方括号写数组，表示取行，对行进行操作
#写字符串，表示取列索引，对列进行操作
# print(df[:20]) #前20行
# print(df[:20]['Row_Labels']) #取Row_Labels这一列的前20行

#布尔索引
print(df[(80<df['Count_AnimalName'])&(df['Count_AnimalName']<100)]) #不同条件要用()

dataFrame中排序的方法通过sort_values()来排序。

索引取行取列：

1.方括号写数组，表示取行，对行进行操作
2.写字符串，表示取列索引，对列进行操作

loc和iloc

1.df.loc 通过标签索引行数据

print(t3)
print(t3.loc['a','Z']) #a行Z列
print(t3.loc['a']) #取a行
print(t3.loc[['a','c']]) #取a行和c行
print(t3.loc['a':'c']) #取a行到c行，并且c行也可以选中

2.iloc通过位置获取行数据

t3=pd.DataFrame(np.arange(12).reshape(3,4),index=list('abc'),columns=list('WXYZ'))
print(t3.iloc[1])#获取第一行，要注意是从0行开始的
print(t3.iloc[:,3])#取第三列
print(t3.iloc[:,[2,1]])#取第二列和第一列 ，其余操作跟切片操作是一样的

pandas之布尔索引

上面我们在DataFrame的例题中提到读取狗的名字来排序，我们要去出现次数超过80小于100，就要用到布尔索引。

格式：df[df['Count_AnimalName']]>次数]

print(df[(80<df['Count_AnimalName'])&(df['Count_AnimalName']<100)]) #不同条件要用()

pandas之字符串方法

缺失数据的处理

我们的数据缺失通常有两种情况：

1.就是空，None等，在pandas是NaN(和np.nan一样)。

处理方法：

t3=pd.DataFrame(np.arange(12).reshape(3,4),index=list('abc'),columns=list('WXYZ'))
t3.iloc[1:,:2]=np.nan

#判断是否为nan
print(pd.isnull(t3)) #是nan返回True

#判断是否不为nan
print(pd.notnull(t3))#是nan返回False

print(t3[pd.notnull(t3['W'])]) #取W这一列不为nan数字对应行的数据

#删除nan的行列
t3.dropna(axis=0,how='any')#删除带有nan的行
t3.dropna(axis=0,how='all')#当这一行全部为nan才删掉
t3.dropna(axis=0,how='any',inplace=True) #inplace=True 相当于用t3=t3.dropna(axis=0,how='any')
print(t3)

#填充数据
t3=t3.fillna(10) #把nan的位置换成10
print(t3)
t3=t3.fillna(t3.mean()) #填充均值
print(t3)
t3['X']=t3['X'].fillna(t3['X'].mean()) #只填充X列
print(t3)

2.就是出现0

处理方法：只需要把它赋值为nan，然后按上述结果处理。

pandas常用统计方法

1.例题

假设现在我们有一组从2006年到2016年1000部最流行的电影数据，我们想知道这些电影数据中评分的平均分，导演的人数等信息，我们应该怎么获取？

import pandas as pd
import numpy as np
file_path='IMDB-Movie-Data.csv'
df=pd.read_csv(file_path)
pd.set_option('display.max_columns',None) #显示所以列不会出现省略号
pd.set_option('display.max_rows',None)
print(df.info())

#获取电影的平均评分
print(df['Rating'].mean())

#导演人数
print(len(set(df['Director'].tolist())))#转换成列表后set变为一个集合求里面有多少个就可以求出人数
print(len(df['Director'].unique())) #方法二 unique自动转换为列表


#获取演员的人数
temp_actors_list=df['Actors'].str.split(',').tolist() #大列表嵌套小列表
actor_list=[i for j in temp_actors_list for i in j]
actors_num=len(set(actor_list))
print(actors_num)

2.对于这一组电影数据，如果我们想rating，runtime的分布情况，应该如何呈现数据？

import pandas as pd
from matplotlib import pyplot as plt

file_path='./IMDB-Movie-Data.csv'

df=pd.read_csv(file_path)
#print(df.head(1))
#print(df.info())

#rating,runtime分布情况
#选择图形，直方图
#准备数据
runtime_data=df['Runtime (Minutes)'].values

max_runtime=runtime_data.max()
min_runtime=runtime_data.min()
print(max_runtime-min_runtime)

#计算组数
num_bin=(max_runtime-min_runtime)//5

plt.hist(runtime_data,num_bin)  

plt.xticks(range(min_runtime,max_runtime+5,5))

plt.show()

3.对于这一组电影数据，如果我们希望统计电影分类(genre)的情况，应该如何处理数据？

思路：重新构造一个全为0的数组，列名为分类，如果某一条数据中分类出现过，就让0变为1

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

file_path='./IMDB-Movie-Data.csv'

df=pd.read_csv(file_path)
pd.set_option('display.max_columns',None) #显示所以列不会出现省略号
print(df['Genre'])

#统计分类的列表
temp_list=df['Genre'].str.split(',').tolist() #[[]]
#print(temp_list)

genre_list=list(set(i for j in temp_list for i in j)) #展开列表

#构造全为0的数组
zero_df=pd.DataFrame(np.zeros((df.shape[0],len(genre_list))),columns=genre_list)
#print(zero_df)

#给每个电影出现分类的位置赋值1
for i in range(df.shape[0]):
    zero_df.loc[i,temp_list[i]]=1

#print(zero_df.head(3))

#统计每个分类的电影的数量和
genre_count=zero_df.sum(axis=0)
print(genre_count)

#排序
genre_count=genre_count.sort_values()

_x=genre_count.index
_y=genre_count.values
#画图
plt.figure(figsize=(20,8),dpi=80)
plt.bar(range(len(_x)),_y)
plt.xticks(range(len(_x)),_x)
plt.show()

数据合并

join:默认情况下他是把行索引相同的数据合并到一起

merge:按照指定的列把数据按照一定的方式合并到一起。

import pandas as pd
import numpy as np

df1=pd.DataFrame(np.ones((2,4)),index=['A','B'],columns=list('abcd'))
df1.loc['A','a']=5
print(df1)
df2=pd.DataFrame(np.zeros((3,3)),index=['A','B','C'],columns=list('xyz'))
print(df1.join(df2)) #join按照行索引进行合并
df3=pd.DataFrame(np.arange(9).reshape((3,3)),columns=list('fax'))
print(df3)
#内连接
print(df1.merge(df3,on='a'))
#外连接
print(df1.merge(df3,on='a',how='outer'))
#左连接
print(df1.merge(df3,on='a',how='left')) #以df1为准，df1有多少行就有多少行,df1没有的补充nan
#右连接
print(df1.merge(df3,on='a',how='right'))#以df3为准，df3有多少行就有多少行,df3没有的补充nan

分组和聚合

df.groupby(by="columns_name")，通过列名字来分组。

现在我们有一组关于全球星巴克店铺的统计数据，如果我想知道美国的星巴克数量和中国的哪个多，或者我想知道中国每个省份星巴克的数量的情况，那么应该怎么办？

import pandas as pd
import numpy as np

file_path='./starbucks_store_worldwide.csv'

df=pd.read_csv(file_path)
pd.set_option('display.max_columns',None) #显示所以列不会出现省略号
pd.set_option('display.max_rows',None)
# print(df.head(1))
# print(df.info())

grouped=df.groupby(by='Country') #根据国家分组,把不同国家的数据拿出来放到一块
#print(grouped)

#ataFrameGroupBy
#可以进行遍历
for i in grouped:
    print(i)
    print('*'*100)
df[df['Country']='US']
#调佣聚合方法
# country_count=grouped['Brand'].count()#统计个数
# print(country_count['US'])
# print(country_count['CN'])

#统计中国每个省份的店铺的数量
# china_data=df[df['Country']=='CN']
#
# grouped=df.groupby(by='State/Province').count()['Brand']
# print(grouped)


#数据按照多个条件进行分组，返回Series
# grouped=df['Brand'].groupby(by=[df['Country'],df['State/Province']]).count()
# print(grouped) #带有两个索引的series类型

#数据按照多个条件进行分组，返回DataFrame
grouped1=df[['Brand']].groupby(by=[df['Country'],df['State/Province']]).count()
# grouped2=df.groupby(by=[df['Country'],df['State/Province']])[['Brand']].count()
# grouped3=df.groupby(by=[df['Country'],df['State/Province']])[['Brand']].count()[['Brand']]
# print(grouped1)
# print(type(grouped1))
# print(grouped2)
# print(type(grouped2))
# print(grouped3)
# print(type(grouped3))

#索引的方法和属性
#print(grouped1.index) #获取索引

# df1=pd.DataFrame(np.ones((2,4)),index=['A','B'],columns=list('abcd'))
# df1.loc['A','a']=5
# df1.index=['a','b'] #修改索引
#print(df1)
#print(df1.reindex(['a','f'])) #取出a行所以的值，但f行找不到全为nan
# print(df1.set_index('a')) #把a列作为索引
# print(df1.set_index('a',drop=False)) #把a列作为索引，并且原来的a列还存在
#print(df1['d'].unique()) #d列相同只取一个，不同取全取
# print(df1.set_index(['a','b'])) #设置a和b为索引

复合索引

a = pd.DataFrame({'a': range(7),
                  'b': range(7, 0, -1),
                  'c': ['one','one','one','two','two','two', 'two'],
                  'd': list("hjklmno")})

#print(a)
b=a.set_index(['c','d'])
print(b)
print(b.loc['one'].loc['h'])# 取到one和h这一行数据
# c=b['a']
# print(c) #series类型
# print(c['one']['j'])
# d=a.set_index(['d','c'])['a']
# print(d)
# print(d.swaplevel()) #想要取内层索引，可以让内层与外层索引交换位置

使用matplotlib呈现出店铺总数排名前10的国家

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

file_path='./starbucks_store_worldwide.csv'

df=pd.read_csv(file_path)

#使用matplotlib呈现出店铺总数排名前10的国家
#准备数据
data1=df.groupby(by='Country').count()['Brand'].sort_values(ascending=False)[:10] #sort_values()排序，ascending=False降序
_x=data1.index
_y=data1.values

#画图
plt.figure(figsize=(20,8),dpi=80)

plt.bar(range(len(_x)),_y)

plt.xticks(range(len(_x)),_x)

plt.show()

使用matplotlib呈现出每个中国每个城市的店铺数量

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from matplotlib import font_manager

my_font=font_manager.FontProperties(fname='C:/Windows/Fonts/msyh.ttc')
file_path='./starbucks_store_worldwide.csv'

df=pd.read_csv(file_path)
df=df[df['Country']=='CN']
#使用matplotlib呈现出每个中国每个城市的店铺数量
#准备数据
data1=df.groupby(by='City').count()['Brand'].sort_values(ascending=False)[:25] #sort_values()排序，ascending=False降序
_x=data1.index
_y=data1.values

#画图
plt.figure(figsize=(20,8),dpi=80)

plt.bar(range(len(_x)),_y,width=0.3,color='orange')
#plt.barh(range(len(_x)),_y,height=0.3,color='orange')---对应yticks

plt.xticks(range(len(_x)),_x,fontproperties=my_font)

plt.show()

现在我们有全球排名靠前的10000本书的数据，那么请统计一下下面几个问题：

不同年份书的数量

不同年份书的平均评分情况

import pandas as pd
from matplotlib import pyplot as plt

file_path='./books.csv'

df=pd.read_csv(file_path)

# print(df.info())

# data1=df[pd.notnull(df['original_publication_year'])]
#
# grouped=data1.groupby(by='original_publication_year').count()['title']

# 后续跟上一个一样

# 不同年份书的平均评分情况
#去除original_publication_year列中nan的行
data1=df[pd.notnull(df['original_publication_year'])]

groped=data1['average_rating'].groupby(by=data1['original_publication_year']).mean()

# print(groped)

_x=groped.index
_y=groped.values

plt.plot(range(len(_x)),_y)
plt.xticks(list(range(len(_x)))[::10],_x[::10].astype(int),rotation=90)
plt.show()

时间序列

df=pd.date_range(start='20171230',end='20180131',freq='D') #start起始时间 end结束时间 freq=‘D’以天为单位
#print(df)
df1=pd.date_range(start='20171230',end='20180131',freq='10D')#隔十天取值
#print(df1)
df2=pd.date_range(start='20171230',periods=10,freq='D')#periods生成10个数
print(df2)

start：开始时间

end：结束时间

freq：单位

periods：一段时间

2.DataFrame中使用时间序列

df["timeStamp"] = pd.to_datetime(df["timeStamp"] #转换为时间戳

3.重采样

重采样：指的是将时间序列从一个频率转化为另一个频率进行处理的过程，将高频率数据转化为低频率数据为降采样，低频率转化为高频率为升采样

语法：df.resample("M")

4.把分开的时间字符串通过periodIndex的方法转化为pandas的时间类型

pd.PeriodIndex(year=df["year"],month=df["month"],day=df["day"],hour=df["hour"],freq="H") 。

qq_45780540

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
pandas

什么是pandaspandas是python中的第三方库，主要用来处理数据，类似于excel。通过 import pandas as pd 来导入。pandas常用的数据类型1.Series 一维，带标签数组2.DataFrame 二维series的创建1.创建一维数组t=pd.Series([1,2,31,12,3,4])print(t)#第一列为索引值，第二列为数据t2=pd.Series([1,2,3,2,2,1],index=list('abcedf')) #i
复制链接

扫一扫