pandas中series一维数组的创建、索引的更改+索引切片和布尔索引+dataframe二维数组的创建、基本属性、索引方法（传统方法和loc&iloc）、nan操作、排序+案例

最新推荐文章于 2024-07-25 16:24:11 发布

斯外戈的小白

最新推荐文章于 2024-07-25 16:24:11 发布

阅读量4.3k

点赞数 2

文章标签：数据分析 pandas

本文链接：https://blog.csdn.net/weixin_51589123/article/details/116502494

版权

（2）通过列表形式创建的series带标签数组可以改变索引，传入索引参数即可

（3）取值方法一（类似于）——对行操作用数字，对列操作用索引标签

一、为什么要学习pandas？

numpy处理数值型数据；pandas用来处理字符串和时间序列等

二、pandas的常用数据类型

1、series——一维的且带标签的数组

（1）创建一维数组

通过列表来创建——pd.Series(传入一个列表)

通过字典来创建——pd.Series(传入一个字典) 键就是其索引

import pandas as pd

t = pd.Series([1, 2, 31, 12, 3, 4])
print(t)
'''
1     2
2    31
3    12
4     3
5     4
dtype: int64
'''
#前面第一列可以看做是标签，标签是可以更改的，指定索引
print(type(t))
#<class 'pandas.core.series.Series'>


#通过字典来创建
temp_dict = {"name": "小星星", "age": 18, "tel": 10086}
t2 = pd.Series(temp_dict)
print(t2)
'''
name      小星星
age        18
tel     10086
dtype: object
'''

print(t1.dtype)
#int64
print(t2.dtype)
#object,有字符串，所以是对象

#索引
print(t2["age"])
#18

print(t2["tel"])
#10086

print(t2["name"])
#小星星

print(t2[0])
print(t2[1])
print(t2[2])
'''
小星星
18
10086
'''

（2）通过列表形式创建的series带标签数组可以改变索引，传入索引参数即可

#指定索引
t1 = pd.Series([1, 23, 2, 2, 1], index=list("abcde"))
print(t1)
'''
a     1
b    23
c     2
d     2
e     1
dtype: int64
'''

（3）取值操作——切片和索引以及布尔索引

取不连续的值时，需要再带一个列表括号

temp_dict = {"name": "小星星", "age": 18, "tel": 10086}
t2 = pd.Series(temp_dict)
print(t2)
'''
name      小星星
age        18
tel     10086
dtype: object
'''

#索引
print(t2["age"])
#18

print(t2["tel"])
#10086

print(t2["name"])
#小星星

print(t2[0])
print(t2[1])
print(t2[2])
'''
小星星
18
10086
'''

#取连续的值
print(t2[:3])
'''
name      小星星
age        18
tel     10086
dtype: object
'''

#取不连续的值
print(t2[[0, 2]])
'''
name      小星星
tel     10086
dtype: object
'''

print(t2[["age", "tel"]])
#取不连续的值的时候要再带一个列表括号
'''
age       18
tel    10086
dtype: object
'''


print(t2.index)
#Index(['name', 'age', 'tel'], dtype='object')

for i in t2.index:  #可以迭代的！
    print(i)
'''
name
age
tel
'''


#布尔索引
t1 = pd.Series([1, 23, 2, 2, 1], index=list("abcde"))
print(t1)
print(t1[t1 > 10])
'''
b    23
dtype: int64
'''

（4）基本属性

temp_dict = {"name": "小星星", "age": 18, "tel": 10086}
t2 = pd.Series(temp_dict)
print(t2)
'''
name      小星星
age        18
tel     10086
dtype: object
'''

print(type(t2.index))
#<class 'pandas.core.indexes.base.Index'>
print(len(t2.index))
#3
print(list(t2.index))
#['name', 'age', 'tel']
print(list(t2.index)[:2])
#['name', 'age']

print(t2.values)
#['小星星' 18 10086]

print(type(t2.values))
#<class 'numpy.ndarray'>
#获取数组中的数值信息

2、DataFrame——二维数组

（1）二维数组的创建

二维数组有两个索引，行索引和列索引，可以改变指定索引，传入index 和columns的参数即可

*直接将np里面生成的二维数组进行dataframe操作

*通过多条字典信息和列表信息进行dataframe操作

import pandas as pd
import numpy as np

t = pd.DataFrame(np.arange(12).reshape((3, 4)))

print(t)
'''
   0  1   2   3
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11

比起之前来说，多了一个行索引 axis=0，多了一个列索引 axis=1
'''

t1 = pd.DataFrame(np.arange(12).reshape((3, 4)), index=list("abc"), columns=list("WXYZ"))
print(t1)
'''
   W  X   Y   Z
a  0  1   2   3
b  4  5   6   7
c  8  9  10  11
'''

d1 = {"name": ["大熊", "静香"], "age": [7, 8], "tel": [10086, 10010]}
d = pd.DataFrame(d1)
print(d)
'''
  name  age    tel
0   大熊    7  10086
1   静香    8  10010

每一行是一个数据
'''

print(type(d))
#<class 'pandas.core.frame.DataFrame'>

d2 = [{"name": "大熊", "age": 7, "tel": 10086}, {"name": "静香", "age": 8, "tel": 10010}, {"name": "哆啦A梦", "age": 100}]
t = pd.DataFrame(d2)
print(t)
'''
   name  age      tel
0    大熊    7  10086.0
1    静香    8  10010.0
2  哆啦A梦  100      NaN
'''

（2）dataframe的基本属性

index、columns、dytpes、head（）、info（）、describe（）

#DataFrame的基本属性
print(t.index)
#RangeIndex(start=0, stop=3, step=1)一共有3个索引

print(t.columns)
#Index(['name', 'age', 'tel'], dtype='object') 纵向索引有3个

print(t.dtypes)
'''
name     object
age       int64
tel     float64
dtype: object
'''

print(t.head()) #默认查看前五行
'''
   name  age      tel
0    大熊    7  10086.0
1    静香    8  10010.0
2  哆啦A梦  100      NaN
数据只有3行也没办法
'''

print("#"*100)
print(t.info())
'''
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   name    3 non-null      object 
 1   age     3 non-null      int64  
 2   tel     2 non-null      float64
dtypes: float64(1), int64(1), object(1)
memory usage: 200.0+ bytes
None
'''

print(t.describe())
'''
              age           tel
count    3.000000      2.000000
mean    38.333333  10048.000000
std     53.407240     53.740115
min      7.000000  10010.000000
25%      7.500000  10029.000000
50%      8.000000  10048.000000
75%     54.000000  10067.000000
max    100.000000  10086.000000

它会给出关于数值型数据的均值、方差等内容
'''

（3）取值方法一（类似于）——对行操作用数字，对列操作用索引标签

特别记忆一下：

dn[:20]["Row_Labels"] #只获取Row_Labels这一列的前20行

dn[(80 < dn["Count_AnimalName"]) & (dn["Count_AnimalName"] < 100)]

dn[(dn["Row_Labels"].str.len() > 4) & (dn["Count_AnimalName"] > 80)]

补充：pandas里面str的其他方法如下

import pandas as pd
import numpy as np

#pandas读取csv中的文件
dn = pd.read_csv(r".\dogNames2.csv")
#print(dn)
'''
     Row_Labels  Count_AnimalName
0         RENNY                 1
1        DEEDEE                 2
2     GLADIATOR                 1
3        NESTLE                 1
4          NYKE                 1
...         ...               ...
4159    ALEXXEE                 1
4160  HOLLYWOOD                 1
4161      JANGO                 2
4162  SUSHI MAE                 1
4163      GHOST                 3

[4164 rows x 2 columns]


print(dn[:20]) #取前20行
'''
    Row_Labels  Count_AnimalName
858       BELLA               112
4134        MAX                82
3273       LUCY                82
843       BUDDY                79
433       SADIE                77
      Row_Labels  Count_AnimalName
0          RENNY                 1
1         DEEDEE                 2
2      GLADIATOR                 1
3         NESTLE                 1
4           NYKE                 1
5      BABY GIRL                 3
6          EVVIE                 1
7        AMADEUS                 1
8         FINLEY                 4
9           C.C.                 1
10      ALLY MAY                 1
11         ADELE                 1
12  PRINCESS PEA                 1
13          OSLO                 1
14    ROMEO GRAY                 1
15          APPA                 1
16       BANDIDO                 1
17        BESSIE                 1
18    SUSIE Q II                 1
19       NAMASTE                 1

'''


print(dn["Row_Labels"])
'''
0           RENNY
1          DEEDEE
2       GLADIATOR
3          NESTLE
4            NYKE
          ...    
4159      ALEXXEE
4160    HOLLYWOOD
4161        JANGO
4162    SUSHI MAE
4163        GHOST
Name: Row_Labels, Length: 4164, dtype: object
'''

print(dn[:20]["Row_Labels"])
#只获取Row_Labels这一列的前20行
'''
1           DEEDEE
2        GLADIATOR
3           NESTLE
4             NYKE
5        BABY GIRL
6            EVVIE
7          AMADEUS
8           FINLEY
9             C.C.
10        ALLY MAY
11           ADELE
12    PRINCESS PEA
13            OSLO
14      ROMEO GRAY
15            APPA
16         BANDIDO
17          BESSIE
18      SUSIE Q II
19         NAMASTE
Name: Row_Labels, dtype: object
'''
print("*" * 1000)
#布尔索引
print(dn[80 < dn["Count_AnimalName"]])
'''
     Row_Labels  Count_AnimalName
858       BELLA               112
3273       LUCY                82
4134        MAX                82
'''

print(dn[(80 < dn["Count_AnimalName"]) & (dn["Count_AnimalName"] < 100)])
'''
     Row_Labels  Count_AnimalName
3273       LUCY                82
4134        MAX                82
'''
#且用&，或用|

print(dn[(dn["Row_Labels"].str.len() > 4) & (dn["Count_AnimalName"] > 80)])
#取名字字符串大于四，且出现次数超过80次的名字
'''
Row_Labels  Count_AnimalName
858      BELLA               112
'''

（4）取值方法二——通过loc和iloc

loc里面输入索引字段

import pandas as pd
import numpy as np

t1 = pd.DataFrame(np.arange(12).reshape(3, 4), index=list("ABC"), columns=list("WXYZ"))
print(t1)
'''
   W  X   Y   Z
A  0  1   2   3
B  4  5   6   7
C  8  9  10  11
'''''

print(t1.loc["A", "W"])
print(t1.loc["A"])
'''
W    0
X    1
Y    2
Z    3
Name: A, dtype: int32
'''

print(t1.loc["A":"C", ["W", "Z"]])
'''
   W   Z
A  0   3
B  4   7
C  8  11
'''

iloc输入数字

import pandas as pd
import numpy as np

t1 = pd.DataFrame(np.arange(12).reshape(3, 4), index=list("ABC"), columns=list("WXYZ"))
print(t1)
'''
   W  X   Y   Z
A  0  1   2   3
B  4  5   6   7
C  8  9  10  11
'''

print(t1.iloc[:, [2, 1]])
'''
    Y  X
A   2  1
B   6  5
C  10  9
'''
print(t1.iloc[1, :])
'''
W    4
X    5
Y    6
Z    7
取第一行的所有列
'''

print(t1.iloc[[0, 2], [2, 1]])
'''
    Y  X
A   2  1
C  10  9
'''

t1.iloc[1:, :2] = 30
print(t1)
'''
    W   X   Y   Z
A   0   1   2   3
B  30  30   6   7
C  30  30  10  11
'''

（5）pandas里面的nan操作

import pandas as pd
import numpy as np

t1 = pd.DataFrame(np.arange(12).reshape(3, 4), index=list("ABC"), columns=list("WXYZ"))
print(t1)
'''
   W  X   Y   Z
A  0  1   2   3
B  4  5   6   7
C  8  9  10  11
'''''

t1.iloc[1:, :2] = np.nan
print(t1)
'''
     W    X   Y   Z
A  0.0  1.0   2   3
B  NaN  NaN   6   7
C  NaN  NaN  10  11
'''
#DataFrame自动将数据类型转换为浮点型

print(pd.isnull(t1))
'''
       W      X      Y      Z
A  False  False  False  False
B   True   True  False  False
C   True   True  False  False
'''

print(pd.notnull(t1["W"]))
#找到W这一列里面不为null的那一行
'''
A     True
B    False
C    False
Name: W, dtype: bool
'''

a = t1.dropna(axis=0)
#删除了有nan的那些行
print(a)
'''
     W    X  Y  Z
A  0.0  1.0  2  3
'''

t2 = t1.dropna(axis=0, how="all")
#默认的how是any 只要出现nan就删除
#我们将how修改为all，只有全部都是nan的才删除
print(t2)
'''
     W    X   Y   Z
A  0.0  1.0   2   3
B  NaN  NaN   6   7
C  NaN  NaN  10  11
'''

#t1.dropna(axis=0, how="any", inplace=True)
#t1原地修改,省掉了频繁赋值的操作
#print(t1)
'''
     W    X  Y  Z
A  0.0  1.0  2  3
'''

#在pandas里面填充nan值
t2 = t1.fillna(t1.mean())
#将nan填充为t1数组的均值
print(t2)
'''
     W    X   Y   Z
A  0.0  1.0   2   3
B  0.0  1.0   6   7
C  0.0  1.0  10  11
'''

（6）排序

排序本身是升序方式，我们可以改变其源码,将升序改变为降序

import pandas as pd
import numpy as np

#pandas读取csv中的文件
dn = pd.read_csv(r".\dogNames2.csv")
#print(dn)
'''
     Row_Labels  Count_AnimalName
0         RENNY                 1
1        DEEDEE                 2
2     GLADIATOR                 1
3        NESTLE                 1
4          NYKE                 1
...         ...               ...
4159    ALEXXEE                 1
4160  HOLLYWOOD                 1
4161      JANGO                 2
4162  SUSHI MAE                 1
4163      GHOST                 3

[4164 rows x 2 columns]
'''
#a = dn.sort_values(by="Count_AnimalName")
#按照Count_AnimalName进行排列
#print(a)
'''
       Row_Labels  Count_AnimalName
0           RENNY                 1
1975        SUSSI                 1
1976      PRANCER                 1
1977         LITA                 1
3382  ALMOND ROCA                 1
...           ...               ...
433         SADIE                77
843         BUDDY                79
3273         LUCY                82
4134          MAX                82
858         BELLA               112
'''


#排序本身是升序方式，我们可以改变其源码,将升序改变为降序
a = dn.sort_values(by="Count_AnimalName", ascending=False)
print(a)
'''
     Row_Labels  Count_AnimalName
858       BELLA               112
4134        MAX                82
3273       LUCY                82
843       BUDDY                79
433       SADIE                77
...         ...               ...
1654  RUBY ROSE                 1
1655    MOO MOO                 1
1656      KYLIE                 1
1657       JEEP                 1
2082      ANIOT                 1
'''


#排序本身是升序方式，我们可以改变其源码,将升序改变为降序
a = dn.sort_values(by="Count_AnimalName", ascending=False)
print(a)
'''
     Row_Labels  Count_AnimalName
858       BELLA               112
4134        MAX                82
3273       LUCY                82
843       BUDDY                79
433       SADIE                77
...         ...               ...
1654  RUBY ROSE                 1
1655    MOO MOO                 1
1656      KYLIE                 1
1657       JEEP                 1
2082      ANIOT                 1
'''

三、案例——电影数据分析

绘制电影时长数据分析，连续型数据用直方图来分析

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

file_path = "datasets_IMDB-Movie-Data.csv"
df = pd.read_csv(file_path)

print(df["Rating"].mean())
#6.723200000000003

#导演人数
#print(len(set(df["Director"].tolist())))
#Pandas的tolist()函数用于将一个系列或数据帧中的列转换为列表
#644

#方法二
print(len(df["Director"].unique()))
#unique()出现多次的导演名字只计数1次，该方法会将值自动返回为列表形式

actors_list = df["Actors"].str.split(",").tolist()
print(actors_list)
#[['Chris Pratt', ' Vin Diesel', ' Bradley Cooper', ' Zoe Saldana']]
#结果是二维数组，我们将二维数组转换成一维数组的形式

actors_list = [i for j in actors_list for i in j]
#将二维数据转换为一维数据
print(actors_list)
actors_num = len(set(actors_list))
#['Chris Pratt', ' Vin Diesel', ' Bradley Cooper', ' Zoe Saldana', 'Noomi Rapace']
print(actors_num)
#2394
'''

#绘制Runtime连续数据的直方图
#print(df["Runtime (Minutes)"])
runtime_data = df["Runtime (Minutes)"].values
#print(runtime_data)
a = runtime_data.max()
b = runtime_data.min()

plt.figure(figsize=(20, 8), dpi=80)

num_bins = (a - b) // 5

plt.hist(runtime_data, num_bins)
plt.xticks(range(b, a+5, 5))

plt.show()
#根据图像可以分析出来，大部分电影的时长集中在90-120分钟

斯外戈的小白

关注

2
点赞
踩
21

收藏

觉得还不错? 一键收藏
0
评论
pandas中series一维数组的创建、索引的更改+索引切片和布尔索引+dataframe二维数组的创建、基本属性、索引方法（传统方法和loc&iloc）、nan操作、排序+案例

一、为什么要学习pandas？numpy处理数值型数据；pandas用来处理字符串和时间序列等二、pandas的常用数据类型1、series——一维的且带标签的数组（1）创建一维数组通过列表来创建——pd.Series(传入一个列表)通过字典来创建——pd.Series(传入一个字典) 键就是其索引import pandas as pdt = pd.Series([1, 2, 31, 12, 3, 4])print(t)'''1 22 313
复制链接

扫一扫