数据科学库Python——Pandas使用基础

Bayesian小孙

已于 2022-05-25 16:18:23 修改

阅读量1k

点赞数

分类专栏：数据科学库文章标签： python

于 2022-05-25 16:07:25 首次发布

本文链接：https://blog.csdn.net/weixin_43172869/article/details/124967727

版权

数据科学库专栏收录该内容

7 篇文章 1 订阅

订阅专栏

Pandas使用基础

注：numpy可以帮我们处理数值型数据，但是pandas除了处理数值之外(基于numpy)，还能够帮助我们处理其他类型的数据。

一、Series的使用基础

（1）通过pd.Series来创建数组

import pandas as pd
t = pd.Serires([12,34,25,75,67,87,54])
print(t)
print(type(t))

# 输出结果为：
0    12
1    34
2    25
3    75
4    67
5    87
6    54
dtype: int64
<class 'pandas.core.series.Series'>

一共有两列数据，左边为索引值index，右边是索引对应的value。

那么问题来了，默认的索引值为：RangeIndex(start=0, stop=7, step=1)。那么怎样进行修改呢？

（2）pd.Series中改变index

a.在pd.Series中传入index的具体参数值来指定索引值

import pandas as pd
t_1 = pd.Series([12,34,25,75,67,87,54],index = list("noodles"))
t_1
# 输出结果为：
n    12
o    34
o    25
d    75
l    67
e    87
s    54
dtype: int64

b.使用index方法来修改

import pandas as pd
t = pd.Series([12,34,25,75,67,87,54])
t.index = list('noodles')
print(t)
# 输出结果：
n    12
o    34
o    25
d    75
l    67
e    87
s    54
dtype: int64

（3）通过字典来创建Series

通过字典来创建Series，其中索引的值就是字典的key，内容就是字典对应的value。

tep_dic = {"name":"lxy","age":"24","tel":None}
c = pd.Series(tep_dic)
c
# 输出结果为：
name     lxy
age       24
tel     None
dtype: object

（4）修改Series的dtype

有时候我们需要将数据转换成浮点型，因为NaN类型只在float的type下才打得开。这个时候就需要修改Seroes的dtype。

具体实例如下：

b = pd.Series([1,22,33,44,55],index = list("abcde"))
b.astype('float')
# 输出结果如下：
a     1.0
b    22.0
c    33.0
d    44.0
e    55.0
dtype: float64

（5）Series的索引与切片

Series的索引和切片操作和numpy以及list基本操作差不多，不仅可以通过下标如[0:2]，还可以通过key来进行索引操作。

我们用继续使用上面的c来进行实例说明。

（1）通过key来索引

tep_dic = {"name":"lxy","age":"24","tel":None}
c = pd.Series(tep_dic)
# c输出结果为：
name     lxy
age       24
tel     None
dtype: object
# 我们通过key来索引
c["name"]
# 输出结果为：
'lxy'

（2）通过下标进行索引

通过下标索引的方式和list的切片操作是一样的，区间是左闭右开。

c[:2]
# 输出结果为：
name    lxy
age      24
dtype: object

注意：如果没有key的话，强行通过key的方式来索引取值，取出来是NaN。

（6）bool索引

b_t = pd.Series([12,3,25,75,17,87,54])
b_t[b_t>20]
# 输出结果如下：
2    25
3    75
5    87
6    54
dtype: int64

需要注意的地方，索引值也是一个可迭代对象。还可以进行这种操作，如：

for i in b_t.index:
  print(i)
  pass

print(type(b_t.values))
# 输出结果为：
<class 'numpy.ndarray'>

print(type(b_t.values))
# 输出结果为：
array(['lxy', '24', None], dtype=object)

（7）Series小结：

Series对象本质上由两个数组构成一个构成对象的键（index，索引），一个数组构成对象的值(value)。

ndarray的很多方法都可以运用于series类型，比如argmax，clip等方法。

Series也具有where方法，但是结果和ndarray有所不同，具体可以查阅官方文档，用的不多。

二、DataFrame的使用基础

DataFrame是二维数据结构表，数据以行和列的形式排列。

我们通过读取csv文件来介绍下DataFrame的使用方法。

import pandas as pd
df = pd.read_csv("./dogNames2.csv")
print(df)

输出结果为：

      Row_Labels  Count_AnimalName
0              1                 1
1              2                 2
2          40804                 1
3          90201                 1
4          90203                 1
...          ...               ...
16215      37916                 1
16216      38282                 1
16217      38583                 1
16218      38948                 1
16219      39743                 1

[16220 rows x 2 columns]

注意：这个时候，数据由num(rows)*num(climbs),不再是一一对应的关系，形成了一个类似于矩阵的数据。

于是，就有行索引和列索引的区分了。

行索引，表明不同行，横向索引，叫index，0轴，axis=0

列索引，表明不同列，纵向索引，叫columns，1轴，axis=1

DataFrame的基本定义格式如下：

df = pd.DataFrame(data = None,index = None,columns = None)

参数说明：

Data：具体数据如：ndarray、可叠戴对象、字典或DataFrame
index：行索引，表明不同行，横向索引，叫index，0轴，axis=0。
columns：列索引，表明不同列，纵向索引，叫columns，1轴，axis=1。

一、DataFrame的常用方法

（1）DataFrame的基础属性

import pandas as pd
df = pd.read_csv("./dogNames2.csv")
print(df)

输出结果为：

      Row_Labels  Count_AnimalName
0              1                 1
1              2                 2
2          40804                 1
3          90201                 1
4          90203                 1
...          ...               ...
16215      37916                 1
16216      38282                 1
16217      38583                 1
16218      38948                 1
16219      39743                 1

[16220 rows x 2 columns]

**df.shape:**查看行数和列数

(16220, 2)

df.dtypes: 查看类型

Row_Labels          object
Count_AnimalName     int64
dtype: object

df.ndim: 查看维度

# 输出结果为：
2

**df.index:**行索引

# 输出结果为：
RangeIndex(start=0, stop=16220, step=1)

**df.columns:**列索引

# 输出结果为：
Index(['Row_Labels', 'Count_AnimalName'], dtype='object')

（2）DataFrame的整体查询

**df.head()😗*显示头部几行，默认5行，根据传入的参数而定

**df.tail()😗*显示尾部几行，默认5行，根据传入的参数而定

**df.info( )：**相关信息概览

行数、列数、列索引，列非空值个数，列类型，内存占用

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16220 entries, 0 to 16219
Data columns (total 2 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Row_Labels        16217 non-null  object
 1   Count_AnimalName  16220 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 253.6+ KB

**df.describe:**快速综合统计结果

计数、均值、标准差、最大值、四分位数、最小值

在这里插入图片描述

（3）DataFrame的方法

**.sort_values()😗*对数据进行排序

DataFrame.sort_values(by, 
               axis=0, 
               ascending=True, 
               inplace=False, 
               kind='quicksort', 
               na_position='last', # last，first；默认是last
               ignore_index=False, 
               key=None)

**by：**表示根据什么字段或者索引进行排序，可以是一个或多个
**axis：**排序是在横轴还是纵轴，默认是纵轴axis=0
**ascending：**排序结果是升序还是降序，默认是升序
**inplace：**表示排序的结果是直接在原数据上的就地修改还是生成新的DatFrame
**kind：**表示使用排序的算法，快排quicksort,，归并mergesort，堆排序heapsort，稳定排序stable ，默认是：快排quicksort
**na_position：**缺失值的位置处理，默认是最后，另一个选择是首位
**ignore_index：**新生成的数据帧的索引是否重排，默认False（采用原数据的索引）
**key：**排序之前使用的函数

df_sorted = df.sort_values(by="Count_AnimalName",ascending=False)
print(df_sorted[:20])

 Row_Labels  Count_AnimalName
1156       BELLA              1195
9140         MAX              1153
2660     CHARLIE               856
3251        COCO               852
12368      ROCKY               823
8417        LOLA               795
8552       LUCKY               723
8560        LUCY               710
2032       BUDDY               677
3641       DAISY               649
11703   PRINCESS               603
829       BAILEY               532
9766       MOLLY               519
14466      TEDDY               485
2913       CHLOE               465
14779       TOBY               446
8620        LUNA               432
6515        JACK               425
8788      MAGGIE               393
13762     SOPHIE               383

ascending：
- 默认为True升序排序，
- 为False降序排序
**inplace：**是否修改原始Series
- 当 inplace = False 时，返回为修改过的数据，原数据不变。
- 当 inplace = True 时，返回值为 None，直接在原数据上进行操作

ascending和inplace用的比较多

不仅可以排序，还可以排序后进行切片操作。

print(df_sorted[:20]["Row_Labels"])

# 方括号写字符串，表示取列索引，对列进行操作

# 输出结果如下：
1156        BELLA
9140          MAX
2660      CHARLIE
3251         COCO
12368       ROCKY
8417         LOLA
8552        LUCKY
8560         LUCY
2032        BUDDY
3641        DAISY
11703    PRINCESS
829        BAILEY
9766        MOLLY
14466       TEDDY
2913        CHLOE
14779        TOBY
8620         LUNA
6515         JACK
8788       MAGGIE
13762      SOPHIE
Name: Row_Labels, dtype: object

刚刚我们知道了如何给数据按照某一行或者列排序，那么现在我们想单独研究使用次数前100的数据，应该如何做？

df_sorted = df.sort_values(by="Count_AnimalName")

df_sorted[:100]

（4）pandas之loc

**loc和iloc的含义：**loc是location的意思，和iloc中i的意思是指integer，所以它只接受整数作为参数，详情见下面。

1.df.loc 通过标签索引行数据
2.df.iloc 通过位置获取行数据
具体实例如下：

（5）pandas之布尔索引

我们回到这个狗狗的数据上,这个是我们之前已经读取过的狗狗数据。

假如我们想找到所有的使用次数超过700并且名字的字符串的长度大于4的狗的名字，应该怎么选择？

这个时候就需要用到布尔索引了。

df[(df["Row_Labels"].str.len()>4)&(df["Count_AnimalName"]>700)]

输出结果如下：

**& **: 且

｜: 或

需要注意pandas中的str方法，不同的条件用（）括起来，中间用逻辑符号。

（6）pandas中的字符串方法

二、缺失值处理

我们的数据缺失通常有两种情况：

一种就是空，None等，在pandas是NaN(和np.nan一样)

另一种是我们让其为0。

对于NaN的数据，在numpy中我们是如何处理的？

在pandas中我们处理起来非常容易

判断数据是否为NaN：

pd.isnull(df)
pd.notnull(df)

处理方式1：删除NaN所在的行列dropna (axis=0, how=‘any’, inplace=False)

处理方式2：填充数据，t.fillna(t.mean()),t.fiallna(t.median()),t.fillna(0)

处理为0的数据：t[t==0]=np.nan

当然并不是每次为0的数据都需要处理

计算平均值等情况，nan是不参与计算的，但是0会。

pd.isnull and pd.notnull

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-2Mp1mQRJ-1653465051953)(/Users/mac/Desktop/截屏2022-05-25 上午1.19.19.png)]
在这里插入图片描述

df.dropna()函数

dropna()函数的参数：

df.dropna(axis = 0 ,how = 'any/all',inplace = False/True)

（1）axis：

当axis=0或axis=‘index’,index表示行索引，若某行有空值，则删除该行；

当axis=1或axis=‘column’,column表示列索引，若某列有空值，则删除该列。

axis的默认值为0
在这里插入图片描述

（2）how：

any，表示该行/列只要有一个以上的空值，就删除该行/列；

all，表示该行/列全部都为空值，就删除该行/列。默认值为‘any’。

（3）inplace：

是否直接在原dataframe进行缺失值删除

默认为False，即不在原DataFrame上进行缺失值删除操作，dropna()函数返回值为缺失值删除操作后的结果。如果为True，则在原DataFrame上进行缺失值删除操作，dropna()函数返回值为None。

三、Pandas读取电影数据

读取电影

# 读取电影的数据
import pandas as pd
file_path = "./IMDB-Movie-Data.csv"
df = pd.read_csv(file_path)
df.info()

# 输出结果如下：
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Rank                1000 non-null   int64  
 1   Title               1000 non-null   object 
 2   Genre               1000 non-null   object 
 3   Description         1000 non-null   object 
 4   Director            1000 non-null   object 
 5   Actors              1000 non-null   object 
 6   Year                1000 non-null   int64  
 7   Runtime (Minutes)   1000 non-null   int64  
 8   Rating              1000 non-null   float64
 9   Votes               1000 non-null   int64  
 10  Revenue (Millions)  872 non-null    float64
 11  Metascore           936 non-null    float64
dtypes: float64(3), int64(4), object(5)
memory usage: 93.9+ KB
# 查看head文件  
df.head()
>>>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-mBBiLTwt-1653465051954)(/Users/mac/Desktop/截屏2022-05-25 下午2.30.36.png)]

获取电影平均分

df["Rating"].mean()
# 输出结果为：
>>>6.723200000000003

获取导演数目

# 获取导演数目 
# unique表示唯一的意思,只会出现一次。
len(df["Director"].unique())

# 输出结果为：
>>>644

上面这种是直接调用unique的内置方法，还可以通过遍历操作来实现：

df["Director"].tolist()
# 输出结果为一个关于Directors信息的list。
# 这个list中没有包含嵌套结构，所以接下来可以直接转为一个list
print(len(set(df["Director"].tolist())))
>>>644

获取演员的信息

# 获取演员的信息
# 中间是逗号
Actor_list_raw = df["Actors"].str.split(",").tolist()
actors_list = [i for j in Actor_list_raw for i in j]
actors_num = len(set(actors_list))
print(actors_num)
>>>2394

代码解释：

actors_list = [i for j in Actor_list_raw for i in j]

for j in Actor_list_raw:
  for i in j:
    print(i)
 # 因为导演的信息是嵌套列表，所以我们通过双层for循环来读取嵌套列表，最后set再统计all of numbers。

来画图

我们画 rating和runtime的分布情况

应当使用直方图。

from matplotlib import pyplot as plt
from matplotlib import font_manager
runtime_data = df["Runtime (Minutes)"].values
min_runtime = runtime_data.min()
max_runtime = runtime_data.max()
num_bin = (max_runtime-min_runtime)//5

#设置图形大小
plt.figure(figsize=(20,8),dpi=80)

plt.hist(runtime_data,num_bin)
 
# 设置x轴的刻度
plt.xticks(range(min_runtime,max_runtime+5,5))

plt.show()
print(min_runtime)