数据分析之利刃之Pandas（二）

最新推荐文章于 2024-02-15 09:58:25 发布

程序员修炼

最新推荐文章于 2024-02-15 09:58:25 发布

阅读量738

点赞数

分类专栏：数据分析文章标签：数据分析 pandas

本文链接：https://blog.csdn.net/jingyoushui/article/details/98777803

版权

数据分析专栏收录该内容

7 篇文章 0 订阅

订阅专栏

数据分析之利刃之Pandas（二）

pandas之字符串方法

方法	说明
cat	实现元素级的字符串连接操作，可指定分割符
contains	返回表示各字符串是否含有指定模式的布尔型数据
count	模式的出现次数
endswith,startwith	相当于对各元素执行x.endswith(pattern)或x.startwith(pattern)
findall	计算各字符串的模式列表
get	获取各元素的第i个字符
join	根据指定的分隔符将Series中各元素的字符串连接起来
len	计算各字符串的长度
lower,upper	转换大小写，相当于x.lower()和x.upper()
match	根据指定的正则表达式对各元素执行re.match
pad	在字符串的左边、右边、或左右两边添加空白符
center	相当于pad(sid=“both”)
repeat	重复值，例如s.str.repeat(3)相当于对各个字符串×3
replace	用指定字符串替换找到的模式
slice	对Series中的各个字符串进行子串截取
split	根据分隔符或正则表达式对字符串进行拆分
strip,rstrip,lstrip	去除空白符，包括换行符，相当于x.strip(),x.rstrip(),x.lstrip()

数据之合并join

join是根据行索引进行拼接

import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.ones((2,4)),index=["A","B"],columns=list("abcd"))
df1

df2 = pd.DataFrame(np.zeros((3,3)),index=["A","B","C"],columns=list("xyz"))
df2

df1.join(df2)

df2.join(df1)

运行结果：

df1:
 	a	 	b 		c 		d
A 	1.0 	1.0 	1.0 	1.0
B 	1.0 	1.0 	1.0 	1.0

df2:
 	x 		y 		z
A 	0.0 	0.0 	0.0
B 	0.0 	0.0 	0.0
C 	0.0 	0.0 	0.0

df1.join(df2):

 	a 	    b    	c 	    d    	x 	    y 	    z
A 	1.0 	1.0 	1.0 	1.0 	0.0 	0.0 	0.0
B 	1.0 	1.0 	1.0 	1.0 	0.0 	0.0 	0.0

df2.join(df1):
    x 	    y 	     z 	    a 	    b   	c 	    d
A 	0.0 	0.0 	0.0 	1.0 	1.0 	1.0 	1.0
B 	0.0 	0.0 	0.0 	1.0 	1.0 	1.0 	1.0
C 	0.0 	0.0 	0.0 	NaN 	NaN 	NaN 	NaN

数据之合并merge

merge是根据列索引进行合并

df3 = pd.DataFrame(np.arange(9).reshape(3,3),columns=list("man"))
df3

df1.loc["A","a"] = 11
df1

运行结果：

df3:
 	m 	a 	n
0 	0 	1 	2
1 	3 	4 	5
2 	6 	7 	8

df1:
 	a 	     b 	     c 	     d
A 	11.0 	1.0 	1.0 	1.0
B 	1.0 	1.0 	1.0 	1.0

df1.merge(df3,on="a")

根据a的值进行合并

 	m 	a 	n 	b 	    c 	    d
0 	0 	1 	2 	1.0 	1.0 	1.0

df1.merge(df3,on="a",how="inner")

默认就是inner方式合并

 	a 	    b 	    c    	d 	   m 	n
0 	1.0 	1.0 	1.0 	1.0 	0 	2

outer方式合并：

df1.merge(df3,on="a",how="outer")

结果：

 	a     	b 	    c 	    d 	    m 	    n
0 	11.0 	1.0 	1.0 	1.0 	NaN 	NaN
1 	1.0 	1.0 	1.0 	1.0 	0.0 	2.0
2 	4.0 	NaN 	NaN 	NaN 	3.0 	5.0
3 	7.0 	NaN 	NaN 	NaN 	6.0 	8.0

左合并：

df1.merge(df3,on="a",how="left")

 	a 	    b 	    c 	    d 	    m 	    n
0 	11.0 	1.0 	1.0 	1.0 	NaN 	NaN
1 	1.0 	1.0 	1.0 	1.0 	0.0 	2.0

右合并

df1.merge(df3,on="a",how="right")

 	a 	    b 	    c 	    d    	m 	n
0 	1.0 	1.0 	1.0 	1.0 	0 	2
1 	4.0 	NaN 	NaN 	NaN 	3 	5
2 	7.0 	NaN 	NaN 	NaN 	6 	8

指定列合并：

df1.merge(df3,left_on="a",right_on="m",how="outer")

 	a_x 	b 	    c 	    d 	    m 	    a_y 	n
0 	11.0 	1.0 	1.0 	1.0 	NaN 	NaN 	NaN
1 	1.0 	1.0 	1.0 	1.0 	NaN 	NaN 	NaN
2 	NaN 	NaN 	NaN 	NaN 	0.0 	1.0 	2.0
3 	NaN 	NaN 	NaN 	NaN 	3.0 	4.0 	5.0
4 	NaN 	NaN 	NaN 	NaN 	6.0 	7.0 	8.0

索引

改变df1的索引：

	a     	b 	    c 	    d
A 	11.0 	1.0 	1.0 	1.0
B 	1.0 	1.0 	1.0 	1.0

df1.index = ["C","D"]
df1

改变后的结果：

 	 a 	    b 	    c 	    d
C 	11.0 	1.0 	1.0 	1.0
D 	1.0 	1.0 	1.0 	1.0

reindex操作：

df1.reindex(["C","E"])

这个方法相当于取行操作，没有该索引的话为NaN

 	 a 	    b     	c 	    d
C 	11.0 	1.0 	1.0 	1.0
E 	NaN 	NaN 	NaN 	NaN

将某一列的值作为索引：set_index

df1.set_index("a")

结果，可以看出a是索引了：

 	    b 	    c 	    d
a 			
11.0 	1.0 	1.0 	1.0
1.0 	1.0 	1.0 	1.0

可以根据索引的值进行取数据了

df1.set_index("a").loc[11.0]["b"]

如果把a作为索引的同时，还想有a这一列

df1.set_index("a",drop=False)

结果如下：

 	    a 	    b 	     c 	     d
a 				
11.0 	11.0 	1.0 	1.0 	1.0
1.0 	1.0 	1.0 	1.0 	1.0

unique()方法取不同的值

df1["d"].unique()#结果为array([1.])

index.unique()可以取出索引中互不相同的值

df1.set_index("b").index.unique()
#结果：Float64Index([1.0], dtype='float64', name='b')

复合索引

将a和b同时作为索引：

df1.set_index(["a","b"])

结果：

 		         c 	     d
a   	b 		
11.0 	1.0 	1.0 	1.0
1.0 	1.0 	1.0 	1.0

查看索引：

df1.set_index(["a","b"]).index

MultiIndex([(11.0, 1.0),
            ( 1.0, 1.0)],
           names=['a', 'b'])

新建DataFrame

a = pd.DataFrame({'a': range(7),'b': range(7, 0, -1),'c': ['one','one','one','two','two','two', 'two'],'d': list("hjklmno")})
a

结果如下：

 	a 	b 	c 	    d
0 	0 	7 	one 	h
1 	1 	6 	one 	j
2 	2 	5 	one 	k
3 	3 	4 	two 	l
4 	4 	3 	two 	m
5 	5 	2 	two 	n
6 	6 	1 	two 	o

设置索引：

b = a.set_index(["c","d"])

 		a 	b
c 	d 
one h 	0 	7
	j 	1 	6
	k 	2 	5
two l 	3 	4
	m 	4 	3
	n 	5 	2
	o 	6 	1

利用外索引取值：

b.loc["one"]

结果：

利用外索引和列名进行取值

b.loc["one","a"]

利用两个索引共同取值;

b.loc["one"].loc["j"]

a    1
b    6
Name: j, dtype: int64

b.loc["one"].loc["j"]["a"]#结果为1

取a列：

c = b["a"]
c

c    d
one  h    0
     j    1
     k    2
two  l    3
     m    4
     n    5
     o    6
Name: a, dtype: int64

交换索引

d = a.set_index(["d","c"])["a"]
d.index

d  c  
h  one    0
j  one    1
k  one    2
l  two    3
m  two    4
n  two    5
o  two    6
Name: a, dtype: int64

MultiIndex([('h', 'one'),
            ('j', 'one'),
            ('k', 'one'),
            ('l', 'two'),
            ('m', 'two'),
            ('n', 'two'),
            ('o', 'two')],
           names=['d', 'c'])

这种情况如何根据内索引取值呢，需要交换一下索引

d.swaplevel()

c    d
one  h    0
     j    1
     k    2
two  l    3
     m    4
     n    5
     o    6
Name: a, dtype: int64

就可以用之前的方法取值了。

数据的分组

现在我们有一组关于全球星巴克店铺的统计数据，如果我想知道美国的星巴克数量和中国的哪个多，或者我想知道中国每个省份星巴克的数量的情况，那么应该怎么办？
数据集是starbucks_store_worldwide.csv。
下面用到的所有数据集都在这里提取码nwls

在pandas中类似的分组的操作我们有很简单的方式来完成
df.groupby(by=“columns_name”)
grouped = df.groupby(by=“columns_name”)
grouped是一个DataFrameGroupBy对象，是可迭代的
grouped中的每一个元素是一个元组
元组里面是（索引(分组的值)，分组之后的DataFrame）

import pandas as pd
import numpy as np

file_path = "./youtube_video_data/starbucks_store_worldwide.csv"

df = pd.read_csv(file_path)
print(df.head(1))
print(df.info())
grouped = df.groupby(by="Country")
print(grouped)

结果如下：

       Brand  Store Number     Store Name Ownership Type     Street Address  \
0  Starbucks  47370-257954  Meritxell, 96       Licensed  Av. Meritxell, 96   

               City State/Province Country Postcode Phone Number  \
0  Andorra la Vella              7      AD    AD500    376818720   

                  Timezone  Longitude  Latitude  
0  GMT+1:00 Europe/Andorra       1.53     42.51  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25600 entries, 0 to 25599
Data columns (total 13 columns):
Brand             25600 non-null object
Store Number      25600 non-null object
Store Name        25600 non-null object
Ownership Type    25600 non-null object
Street Address    25598 non-null object
City              25585 non-null object
State/Province    25600 non-null object
Country           25600 non-null object
Postcode          24078 non-null object
Phone Number      18739 non-null object
Timezone          25600 non-null object
Longitude         25599 non-null float64
Latitude          25599 non-null float64
dtypes: float64(2), object(11)
memory usage: 2.5+ MB
None
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fc013f29f60>

聚合之后就可以进行遍历：

#DataFrameGroupBy
#可以进行遍历
for i,j in grouped:
    print(i)
    print(j,type(j))

统计每个国家的星巴克的数量

country_count = grouped["Brand"].count()
print(country_count)
print(country_count["US"])
print(country_count["CN"])

结果如下：

Country
AD        1
AE      144
AR      108
AT       18
AU       22
      ...  
TT        3
TW      394
US    13608
VN       25
ZA        3
Name: Brand, Length: 73, dtype: int64
13608
2734

统计中国各个省份的星巴克数量：

#统计中国每个省店铺的数量
china_data = df[df["Country"] =="CN"]
grouped = china_data.groupby(by="State/Province").count()["Brand"]
print(grouped)

数据按照多个条件进行分组,返回Series

grouped = df["Brand"].groupby(by=[df["Country"],df["State/Province"]]).count()
print(grouped)
print(type(grouped))

结果如下：

Country  State/Province
AD       7                  1
AE       AJ                 2
         AZ                48
         DU                82
         FU                 2
                           ..
US       WV                25
         WY                23
VN       HN                 6
         SG                19
ZA       GT                 3
Name: Brand, Length: 545, dtype: int64
<class 'pandas.core.series.Series'>

数据按照多个条件进行分组,返回DataFrame

grouped1 = df[["Brand"]].groupby(by=[df["Country"],df["State/Province"]]).count()
print(grouped1,type(grouped1))

                        Brand
Country State/Province       
AD      7                   1
AE      AJ                  2
        AZ                 48
        DU                 82
        FU                  2
...                       ...
US      WV                 25
        WY                 23
VN      HN                  6
        SG                 19
ZA      GT                  3

[545 rows x 1 columns] <class 'pandas.core.frame.DataFrame'>

使用matplotlib呈现出店铺总数排名前10的国家

import pandas as pd
from matplotlib import pyplot as plt

file_path = "./youtube_video_data/starbucks_store_worldwide.csv"
df = pd.read_csv(file_path)

#使用matplotlib呈现出店铺总数排名前10的国家
#准备数据
data1 = df.groupby(by="Country").count()["Brand"].sort_values(ascending=False)[:10]

_x = data1.index
_y = data1.values

#画图
plt.figure(figsize=(20,8),dpi=80)

plt.bar(range(len(_x)),_y)

plt.xticks(range(len(_x)),_x)

plt.show()

在这里插入图片描述
使用matplotlib呈现出店铺总数排名前10的城市

# coding=utf-8
import pandas as pd
from matplotlib import pyplot as plt
from matplotlib import font_manager

my_font = font_manager.FontProperties(fname="/usr/share/fonts/simsun.ttc")

file_path = "./youtube_video_data/starbucks_store_worldwide.csv"

df = pd.read_csv(file_path)
df = df[df["Country"]=="CN"]
# print(df)

#使用matplotlib呈现出店铺总数排名前10的城市
#准备数据
data1 = df.groupby(by="City").count()["Brand"].sort_values(ascending=False)[:25]
_x = data1.index
_y = data1.values

#画图
plt.figure(figsize=(20,12),dpi=80)

# plt.bar(range(len(_x)),_y,width=0.3,color="orange")
plt.barh(range(len(_x)),_y,height=0.3,color="orange")

plt.yticks(range(len(_x)),_x,fontproperties=my_font)

plt.show()

在这里插入图片描述

数据的聚合

函数名	说明
count	分组中非NA值的数量
sum	非NA值的和
mean	非NA值的平均值
median	非NA值的算术中位数
std,var	无偏标准差和方差
min,max	非NA值的最小值和最大值

这部分方法在上面都有用到过

一些简单案例

以下所用到的数据集都是来着kaggle，前三个用到的数据集是IMDB-Movie-Data.csv
1.统计电影时长分布情况，并绘制直方图。


# coding=utf-8
import pandas as pd
from matplotlib import pyplot as plt
file_path = "./youtube_video_data/IMDB-Movie-Data.csv"

df = pd.read_csv(file_path)
# print(df.head(1))
# print(df.info())

#rating,runtime分布情况
#选择图形，直方图
#准备数据
runtime_data = df["Runtime (Minutes)"].values

max_runtime = runtime_data.max()
min_runtime = runtime_data.min()

#计算组数
print(max_runtime-min_runtime)
num_bin = (max_runtime-min_runtime)//5


#设置图形的大小
plt.figure(figsize=(20,8),dpi=80)
plt.hist(runtime_data,num_bin)
plt.xticks(range(min_runtime,max_runtime+5,5))
plt.show()

运行结果：
在这里插入图片描述
2.统计电影评分分布情况，并绘制直方图。

# coding=utf-8
import pandas as pd
from matplotlib import pyplot as plt
file_path = "./youtube_video_data/IMDB-Movie-Data.csv"

df = pd.read_csv(file_path)
# print(df.head(1))
# print(df.info())

#rating,runtime分布情况
#选择图形，直方图
#准备数据
runtime_data = df["Rating"].values

max_runtime = runtime_data.max()
min_runtime = runtime_data.min()

#计算组数
print(max_runtime-min_runtime)
num_bin = (max_runtime-min_runtime)//0.5
num_bin = int(num_bin)


#设置图形的大小
plt.figure(figsize=(20,8),dpi=80)
plt.hist(runtime_data,num_bin)

_x = [min_runtime]
i = min_runtime
while i<=max_runtime+0.5:
    i = i+0.5
    _x.append(i)

plt.xticks(_x)

plt.show()

在这里插入图片描述
上图可以看出，1.9到3.5评分的电影很少，我们把这个区间段的电影合并一下：

import numpy as np
from matplotlib import pyplot as plt

file_path = "./youtube_video_data/IMDB-Movie-Data.csv"

df = pd.read_csv(file_path)
runtime_data = df["Rating"].values

max_runtime = runtime_data.max()
min_runtime = runtime_data.min()
print(min_runtime,max_runtime)

#设置不等宽的组距，hist方法中取到的会是一个左闭右开的去见[1.9,3.5)
num_bin_list = [1.9,3.5]
i=3.5
while i<=max_runtime:
    i += 0.5
    num_bin_list.append(i)
print(num_bin_list)

#设置图形的大小
plt.figure(figsize=(20,8),dpi=80)
plt.hist(runtime_data,num_bin_list)

#xticks让之前的组距能够对应上
plt.xticks(num_bin_list)

plt.show()

合并之后的结果如下：
在这里插入图片描述
3.统计电影的分类情况

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

file_path = "./youtube_video_data/IMDB-Movie-Data.csv"

df = pd.read_csv(file_path)
#统计分类的列表
temp_list = df["Genre"].str.split(",").tolist()  #[[],[],[]]
# print(temp_list)

genre_list = list(set([i for j in temp_list for i in j]))
print(genre_list)

输出的分类列表如下：

['Drama', 'Mystery', 'Family', 'Sport', 'Fantasy', 'Western', 'Sci-Fi', 'Action', 'Music', 'Thriller', 'Horror', 'Musical', 'Animation', 'Comedy', 'Adventure', 'War', 'Biography', 'Romance', 'Crime', 'History']

接着写代码部分：
构造全为0的数组，数组的行数就是df的行数，列数就是电影类别的数量

zeros_df = pd.DataFrame(np.zeros((df.shape[0],len(genre_list))),columns=genre_list)

统计每个类别的电影的个数，利用全为0的数组，给每个电影出现分类的位置赋值1，然后统计每个分类的电影的数量和,行相加

#给每个电影出现分类的位置赋值1
for i in range(df.shape[0]):
    #zeros_df.loc[0,["Sci-fi","Mucical"]] = 1
    zeros_df.loc[i,temp_list[i]] = 1
# print(zeros_df.head(3))

#统计每个分类的电影的数量和,行相加
genre_count = zeros_df.sum(axis=0)
print(genre_count)

输出结果如下：

Drama        513.0
Mystery      106.0
Family        51.0
Sport         18.0
Fantasy      101.0
Western        7.0
Sci-Fi       120.0
Action       303.0
Music         16.0
Thriller     195.0
Horror       119.0
Musical        5.0
Animation     49.0
Comedy       279.0
Adventure    259.0
War           13.0
Biography     81.0
Romance      141.0
Crime        150.0
History       29.0
dtype: float64

这就是各电影类别的数量
接着把以上数据排序，并用柱状图展示出来：

#排序
genre_count = genre_count.sort_values()
_x = genre_count.index
_y = genre_count.values
#画图
plt.figure(figsize=(20,8),dpi=80)
plt.bar(range(len(_x)),_y,width=0.3,color="orange")
plt.xticks(range(len(_x)),_x)
plt.show()

在这里插入图片描述
统计不同年份书的平均评分情况
用到的数据集是books.csv

# coding=utf-8
import pandas as pd
from matplotlib import pyplot as plt


file_path = "./youtube_video_data/books.csv"

df = pd.read_csv(file_path)



#不同年份书的平均评分情况
#去除original_publication_year列中nan的行
data1 = df[pd.notnull(df["original_publication_year"])]

grouped = data1["average_rating"].groupby(by=data1["original_publication_year"]).mean()

# print(grouped)

_x = grouped.index
_y = grouped.values

#画图
plt.figure(figsize=(20,8),dpi=80)
plt.plot(range(len(_x)),_y)
print(len(_x))

plt.xticks(list(range(len(_x)))[::10],_x[::10].astype(int),rotation=45)
plt.show()