目录
(2)通过列表形式创建的series带标签数组可以改变索引,传入索引参数即可
(3)取值方法一(类似于)——对行操作用数字,对列操作用索引标签
一、为什么要学习pandas?
numpy处理数值型数据;pandas用来处理字符串和时间序列等
二、pandas的常用数据类型
1、series——一维的且带标签的数组
(1)创建一维数组
通过列表来创建——pd.Series(传入一个列表)
通过字典来创建——pd.Series(传入一个字典) 键就是其索引
import pandas as pd
t = pd.Series([1, 2, 31, 12, 3, 4])
print(t)
'''
1 2
2 31
3 12
4 3
5 4
dtype: int64
'''
#前面第一列可以看做是标签,标签是可以更改的,指定索引
print(type(t))
#<class 'pandas.core.series.Series'>
#通过字典来创建
temp_dict = {"name": "小星星", "age": 18, "tel": 10086}
t2 = pd.Series(temp_dict)
print(t2)
'''
name 小星星
age 18
tel 10086
dtype: object
'''
print(t1.dtype)
#int64
print(t2.dtype)
#object,有字符串,所以是对象
#索引
print(t2["age"])
#18
print(t2["tel"])
#10086
print(t2["name"])
#小星星
print(t2[0])
print(t2[1])
print(t2[2])
'''
小星星
18
10086
'''
(2)通过列表形式创建的series带标签数组可以改变索引,传入索引参数即可
#指定索引
t1 = pd.Series([1, 23, 2, 2, 1], index=list("abcde"))
print(t1)
'''
a 1
b 23
c 2
d 2
e 1
dtype: int64
'''
(3)取值操作——切片和索引以及布尔索引
取不连续的值时,需要再带一个列表括号
temp_dict = {"name": "小星星", "age": 18, "tel": 10086}
t2 = pd.Series(temp_dict)
print(t2)
'''
name 小星星
age 18
tel 10086
dtype: object
'''
#索引
print(t2["age"])
#18
print(t2["tel"])
#10086
print(t2["name"])
#小星星
print(t2[0])
print(t2[1])
print(t2[2])
'''
小星星
18
10086
'''
#取连续的值
print(t2[:3])
'''
name 小星星
age 18
tel 10086
dtype: object
'''
#取不连续的值
print(t2[[0, 2]])
'''
name 小星星
tel 10086
dtype: object
'''
print(t2[["age", "tel"]])
#取不连续的值的时候要再带一个列表括号
'''
age 18
tel 10086
dtype: object
'''
print(t2.index)
#Index(['name', 'age', 'tel'], dtype='object')
for i in t2.index: #可以迭代的!
print(i)
'''
name
age
tel
'''
#布尔索引
t1 = pd.Series([1, 23, 2, 2, 1], index=list("abcde"))
print(t1)
print(t1[t1 > 10])
'''
b 23
dtype: int64
'''
(4)基本属性
temp_dict = {"name": "小星星", "age": 18, "tel": 10086}
t2 = pd.Series(temp_dict)
print(t2)
'''
name 小星星
age 18
tel 10086
dtype: object
'''
print(type(t2.index))
#<class 'pandas.core.indexes.base.Index'>
print(len(t2.index))
#3
print(list(t2.index))
#['name', 'age', 'tel']
print(list(t2.index)[:2])
#['name', 'age']
print(t2.values)
#['小星星' 18 10086]
print(type(t2.values))
#<class 'numpy.ndarray'>
#获取数组中的数值信息
2、DataFrame——二维数组
(1)二维数组的创建
二维数组有两个索引,行索引和列索引,可以改变指定索引,传入index 和columns的参数即可
*直接将np里面生成的二维数组进行dataframe操作
*通过多条字典信息和列表信息进行dataframe操作
import pandas as pd
import numpy as np
t = pd.DataFrame(np.arange(12).reshape((3, 4)))
print(t)
'''
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
比起之前来说,多了一个行索引 axis=0,多了一个列索引 axis=1
'''
t1 = pd.DataFrame(np.arange(12).reshape((3, 4)), index=list("abc"), columns=list("WXYZ"))
print(t1)
'''
W X Y Z
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11
'''
d1 = {"name": ["大熊", "静香"], "age": [7, 8], "tel": [10086, 10010]}
d = pd.DataFrame(d1)
print(d)
'''
name age tel
0 大熊 7 10086
1 静香 8 10010
每一行是一个数据
'''
print(type(d))
#<class 'pandas.core.frame.DataFrame'>
d2 = [{"name": "大熊", "age": 7, "tel": 10086}, {"name": "静香", "age": 8, "tel": 10010}, {"name": "哆啦A梦", "age": 100}]
t = pd.DataFrame(d2)
print(t)
'''
name age tel
0 大熊 7 10086.0
1 静香 8 10010.0
2 哆啦A梦 100 NaN
'''
(2)dataframe的基本属性
index、columns、dytpes、head()、info()、describe()
#DataFrame的基本属性
print(t.index)
#RangeIndex(start=0, stop=3, step=1)一共有3个索引
print(t.columns)
#Index(['name', 'age', 'tel'], dtype='object') 纵向索引有3个
print(t.dtypes)
'''
name object
age int64
tel float64
dtype: object
'''
print(t.head()) #默认查看前五行
'''
name age tel
0 大熊 7 10086.0
1 静香 8 10010.0
2 哆啦A梦 100 NaN
数据只有3行也没办法
'''
print("#"*100)
print(t.info())
'''
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 3 non-null object
1 age 3 non-null int64
2 tel 2 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 200.0+ bytes
None
'''
print(t.describe())
'''
age tel
count 3.000000 2.000000
mean 38.333333 10048.000000
std 53.407240 53.740115
min 7.000000 10010.000000
25% 7.500000 10029.000000
50% 8.000000 10048.000000
75% 54.000000 10067.000000
max 100.000000 10086.000000
它会给出关于数值型数据的均值、方差等内容
'''
(3)取值方法一(类似于)——对行操作用数字,对列操作用索引标签
特别记忆一下:
dn[:20]["Row_Labels"] #只获取Row_Labels这一列的前20行
dn[(80 < dn["Count_AnimalName"]) & (dn["Count_AnimalName"] < 100)]
dn[(dn["Row_Labels"].str.len() > 4) & (dn["Count_AnimalName"] > 80)]
补充:pandas里面str的其他方法如下
import pandas as pd
import numpy as np
#pandas读取csv中的文件
dn = pd.read_csv(r".\dogNames2.csv")
#print(dn)
'''
Row_Labels Count_AnimalName
0 RENNY 1
1 DEEDEE 2
2 GLADIATOR 1
3 NESTLE 1
4 NYKE 1
... ... ...
4159 ALEXXEE 1
4160 HOLLYWOOD 1
4161 JANGO 2
4162 SUSHI MAE 1
4163 GHOST 3
[4164 rows x 2 columns]
print(dn[:20]) #取前20行
'''
Row_Labels Count_AnimalName
858 BELLA 112
4134 MAX 82
3273 LUCY 82
843 BUDDY 79
433 SADIE 77
Row_Labels Count_AnimalName
0 RENNY 1
1 DEEDEE 2
2 GLADIATOR 1
3 NESTLE 1
4 NYKE 1
5 BABY GIRL 3
6 EVVIE 1
7 AMADEUS 1
8 FINLEY 4
9 C.C. 1
10 ALLY MAY 1
11 ADELE 1
12 PRINCESS PEA 1
13 OSLO 1
14 ROMEO GRAY 1
15 APPA 1
16 BANDIDO 1
17 BESSIE 1
18 SUSIE Q II 1
19 NAMASTE 1
'''
print(dn["Row_Labels"])
'''
0 RENNY
1 DEEDEE
2 GLADIATOR
3 NESTLE
4 NYKE
...
4159 ALEXXEE
4160 HOLLYWOOD
4161 JANGO
4162 SUSHI MAE
4163 GHOST
Name: Row_Labels, Length: 4164, dtype: object
'''
print(dn[:20]["Row_Labels"])
#只获取Row_Labels这一列的前20行
'''
1 DEEDEE
2 GLADIATOR
3 NESTLE
4 NYKE
5 BABY GIRL
6 EVVIE
7 AMADEUS
8 FINLEY
9 C.C.
10 ALLY MAY
11 ADELE
12 PRINCESS PEA
13 OSLO
14 ROMEO GRAY
15 APPA
16 BANDIDO
17 BESSIE
18 SUSIE Q II
19 NAMASTE
Name: Row_Labels, dtype: object
'''
print("*" * 1000)
#布尔索引
print(dn[80 < dn["Count_AnimalName"]])
'''
Row_Labels Count_AnimalName
858 BELLA 112
3273 LUCY 82
4134 MAX 82
'''
print(dn[(80 < dn["Count_AnimalName"]) & (dn["Count_AnimalName"] < 100)])
'''
Row_Labels Count_AnimalName
3273 LUCY 82
4134 MAX 82
'''
#且用&,或用|
print(dn[(dn["Row_Labels"].str.len() > 4) & (dn["Count_AnimalName"] > 80)])
#取名字字符串大于四,且出现次数超过80次的名字
'''
Row_Labels Count_AnimalName
858 BELLA 112
'''
(4)取值方法二——通过loc和iloc
loc里面输入索引字段
import pandas as pd
import numpy as np
t1 = pd.DataFrame(np.arange(12).reshape(3, 4), index=list("ABC"), columns=list("WXYZ"))
print(t1)
'''
W X Y Z
A 0 1 2 3
B 4 5 6 7
C 8 9 10 11
'''''
print(t1.loc["A", "W"])
print(t1.loc["A"])
'''
W 0
X 1
Y 2
Z 3
Name: A, dtype: int32
'''
print(t1.loc["A":"C", ["W", "Z"]])
'''
W Z
A 0 3
B 4 7
C 8 11
'''
iloc输入数字
import pandas as pd
import numpy as np
t1 = pd.DataFrame(np.arange(12).reshape(3, 4), index=list("ABC"), columns=list("WXYZ"))
print(t1)
'''
W X Y Z
A 0 1 2 3
B 4 5 6 7
C 8 9 10 11
'''
print(t1.iloc[:, [2, 1]])
'''
Y X
A 2 1
B 6 5
C 10 9
'''
print(t1.iloc[1, :])
'''
W 4
X 5
Y 6
Z 7
取第一行的所有列
'''
print(t1.iloc[[0, 2], [2, 1]])
'''
Y X
A 2 1
C 10 9
'''
t1.iloc[1:, :2] = 30
print(t1)
'''
W X Y Z
A 0 1 2 3
B 30 30 6 7
C 30 30 10 11
'''
(5)pandas里面的nan操作
import pandas as pd
import numpy as np
t1 = pd.DataFrame(np.arange(12).reshape(3, 4), index=list("ABC"), columns=list("WXYZ"))
print(t1)
'''
W X Y Z
A 0 1 2 3
B 4 5 6 7
C 8 9 10 11
'''''
t1.iloc[1:, :2] = np.nan
print(t1)
'''
W X Y Z
A 0.0 1.0 2 3
B NaN NaN 6 7
C NaN NaN 10 11
'''
#DataFrame自动将数据类型转换为浮点型
print(pd.isnull(t1))
'''
W X Y Z
A False False False False
B True True False False
C True True False False
'''
print(pd.notnull(t1["W"]))
#找到W这一列里面不为null的那一行
'''
A True
B False
C False
Name: W, dtype: bool
'''
a = t1.dropna(axis=0)
#删除了有nan的那些行
print(a)
'''
W X Y Z
A 0.0 1.0 2 3
'''
t2 = t1.dropna(axis=0, how="all")
#默认的how是any 只要出现nan就删除
#我们将how修改为all,只有全部都是nan的才删除
print(t2)
'''
W X Y Z
A 0.0 1.0 2 3
B NaN NaN 6 7
C NaN NaN 10 11
'''
#t1.dropna(axis=0, how="any", inplace=True)
#t1原地修改,省掉了频繁赋值的操作
#print(t1)
'''
W X Y Z
A 0.0 1.0 2 3
'''
#在pandas里面填充nan值
t2 = t1.fillna(t1.mean())
#将nan填充为t1数组的均值
print(t2)
'''
W X Y Z
A 0.0 1.0 2 3
B 0.0 1.0 6 7
C 0.0 1.0 10 11
'''
(6)排序
排序本身是升序方式,我们可以改变其源码,将升序改变为降序
import pandas as pd
import numpy as np
#pandas读取csv中的文件
dn = pd.read_csv(r".\dogNames2.csv")
#print(dn)
'''
Row_Labels Count_AnimalName
0 RENNY 1
1 DEEDEE 2
2 GLADIATOR 1
3 NESTLE 1
4 NYKE 1
... ... ...
4159 ALEXXEE 1
4160 HOLLYWOOD 1
4161 JANGO 2
4162 SUSHI MAE 1
4163 GHOST 3
[4164 rows x 2 columns]
'''
#a = dn.sort_values(by="Count_AnimalName")
#按照Count_AnimalName进行排列
#print(a)
'''
Row_Labels Count_AnimalName
0 RENNY 1
1975 SUSSI 1
1976 PRANCER 1
1977 LITA 1
3382 ALMOND ROCA 1
... ... ...
433 SADIE 77
843 BUDDY 79
3273 LUCY 82
4134 MAX 82
858 BELLA 112
'''
#排序本身是升序方式,我们可以改变其源码,将升序改变为降序
a = dn.sort_values(by="Count_AnimalName", ascending=False)
print(a)
'''
Row_Labels Count_AnimalName
858 BELLA 112
4134 MAX 82
3273 LUCY 82
843 BUDDY 79
433 SADIE 77
... ... ...
1654 RUBY ROSE 1
1655 MOO MOO 1
1656 KYLIE 1
1657 JEEP 1
2082 ANIOT 1
'''
#排序本身是升序方式,我们可以改变其源码,将升序改变为降序
a = dn.sort_values(by="Count_AnimalName", ascending=False)
print(a)
'''
Row_Labels Count_AnimalName
858 BELLA 112
4134 MAX 82
3273 LUCY 82
843 BUDDY 79
433 SADIE 77
... ... ...
1654 RUBY ROSE 1
1655 MOO MOO 1
1656 KYLIE 1
1657 JEEP 1
2082 ANIOT 1
'''
三、案例——电影数据分析
绘制电影时长数据分析,连续型数据用直方图来分析
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
file_path = "datasets_IMDB-Movie-Data.csv"
df = pd.read_csv(file_path)
print(df["Rating"].mean())
#6.723200000000003
#导演人数
#print(len(set(df["Director"].tolist())))
#Pandas的tolist()函数用于将一个系列或数据帧中的列转换为列表
#644
#方法二
print(len(df["Director"].unique()))
#unique()出现多次的导演名字只计数1次,该方法会将值自动返回为列表形式
actors_list = df["Actors"].str.split(",").tolist()
print(actors_list)
#[['Chris Pratt', ' Vin Diesel', ' Bradley Cooper', ' Zoe Saldana']]
#结果是二维数组,我们将二维数组转换成一维数组的形式
actors_list = [i for j in actors_list for i in j]
#将二维数据转换为一维数据
print(actors_list)
actors_num = len(set(actors_list))
#['Chris Pratt', ' Vin Diesel', ' Bradley Cooper', ' Zoe Saldana', 'Noomi Rapace']
print(actors_num)
#2394
'''
#绘制Runtime连续数据的直方图
#print(df["Runtime (Minutes)"])
runtime_data = df["Runtime (Minutes)"].values
#print(runtime_data)
a = runtime_data.max()
b = runtime_data.min()
plt.figure(figsize=(20, 8), dpi=80)
num_bins = (a - b) // 5
plt.hist(runtime_data, num_bins)
plt.xticks(range(b, a+5, 5))
plt.show()
#根据图像可以分析出来,大部分电影的时长集中在90-120分钟