数据分析第六讲 pandas

最新推荐文章于 2021-12-27 13:57:11 发布

加油小羽哥

最新推荐文章于 2021-12-27 13:57:11 发布

阅读量3.5w

点赞数 3

分类专栏：数据分析文章标签：数据分析 python pandas

本文链接：https://blog.csdn.net/yangyusir/article/details/115066039

版权

数据分析专栏收录该内容

7 篇文章 5 订阅

订阅专栏

文章目录

数据分析第六讲 pandas

数据分析第六讲 pandas

在这里插入图片描述

一、pandas介绍

1.学习pandas的作用

numpy已经能够帮助我们处理数据，能够结合matplotlib解决我们数据分析的问题，那么pandas学习的目的在什么地方呢？
numpy能够帮我们处理数值型数据，但是这还不够。很多时候，我们的数据除了数值之外，还有字符串，还有时间序列等

2.pandas是什么？

pandas是基于NumPy 的一种工具,提供了高性能矩阵的运算,该工具是为了解决数据分析任务而创建的。
Pandas 纳入了大量库和一些标准的数据模型，提供了高效地操作大型数据集所需的工具。
2008年创建,最初被作为金融数据分析工具
pandas安装：pip install pandas -i https://pypi.douban.com/simple --trusted-host pypi.douban.com

二、pandas常用数据类型

1.Series一维,带标签数据

在这里插入图片描述

2.DataFrame二维,Series容器

三、pandas创建Series

1.根据数组创建

2.指定索引创建

3.通过字典来创建

4.通过ndarray创建

Series预览数据
head() 默认打印前五条数据
tail() 默认打印后五条数据
Series-name属性
pd.Series([1,3,5,4,55],index=list(“abcde”),name=‘series’)

import pandas as pd  # pip install pandas -i https://pypi.douban.com/simple --trusted-host pypi.douban.com
import numpy as np

t = pd.Series([1,2,3,4,5])  # 默认索引0开始
print(t)
'''
0    1
1    2
2    3
3    4
4    5
dtype: print(t4.astype('int'))'''
t1 = pd.Series([1,2,3,4,5],index=list("abcde"))  # 指定索引
print(t1)
'''
a    1
b    2
c    3
d    4
e    5
dtype: int64'''
dict1 = {'name':"yangyu",'age':18,'sex':'man'}
t3 = pd.Series(dict1)
print(t3)
'''
name    yangyu
age         18
sex        man
dtype: object'''
print(np.random.rand(5))  # 随机生成5个0到1之间的小数
'''
dtype: object
[0.52727399 0.58752451 0.25906087 0.9116376  0.28573861]'''
t4 = pd.Series(np.random.rand(5))
print(t4)
'''
0    0.084299
1    0.076995
2    0.946208
3    0.728213
4    0.506164
dtype: float64
'''
# 修改类型
print(t4.astype('int'))  # loat64转int类型
'''
0    0
1    0
2    0
3    0
4    0
dtype: int32'''
print(t.astype('float'))  # int64转float类型
'''
0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
dtype: float64'''

# 预览数据
t5 = pd.Series(np.random.rand(100))
print(t5)
'''
0     0.008632
1     0.772691
2     0.422130
3     0.931042
4     0.467934
        ...   
95    0.804642
96    0.410508
97    0.865550
98    0.279784
99    0.562883
Length: 100, dtype: float64'''
print(t5.head())  # 默认预览前5行
'''
0    0.959057
1    0.279906
2    0.644710
3    0.628255
4    0.960321
dtype: float64'''
print(t5.tail())  # 默认预览后5行
'''
95    0.266070
96    0.579535
97    0.457201
98    0.520111
99    0.276324
dtype: float64
'''
t6 = pd.Series(np.random.rand(3),index=list('abc'),name='t6')
print(t6)
'''
a    0.478545
b    0.983166
c    0.407203
Name: t6, dtype: float64'''
print(t6.name)  # t6
t6.index.name = 'Series'
print(t6)
'''
Series
a    0.993078
b    0.959705
c    0.843601
Name: t6, dtype: float64'''
t7 = pd.Series([1,3,5,4,55],index=list("abcde"),name='series')
print(t7)
'''
a     1
b     3
c     5
d     4
e    55
Name: series, dtype: int64'''

四、Series切片和索引

1.pandas的Series切片和索引

dict1 = {“name”:“yangyu”,“age”:18,“sex”:‘man’}
t1 = pd.Series(dict1)
1.通过键值
2.通过索引
3.t1.index
4.t1.values

# 取值
# 1.通过键值   t1['key']    t1.loc['key']
# 2.通过索引   t1[索引值]     t1i.iloc[索引值]
import pandas as pd


dict1 = {"name":"yangyu","age":18,"sex":'man'}
t1 = pd.Series(dict1)
print(t1)
'''
name    yangyu
age         18
sex        man
dtype: object'''
print(t1['age'])  # 18
print(t1[1])  # 18
print(t1.loc['age'])  # 18
print(t1.iloc[1])  # 18
# 取前两行数据
print(t1[:2])
'''
name    yangyu
age         18
dtype: object'''

# 取第1行和第3行数据
print(t1[[0,2]])  # 通过索引来取值
'''
name    yangyu
sex        man
dtype: object'''
print(t1.iloc[[0,2]])  # # 通过索引函数来取值
'''
name    yangyu
sex        man
dtype: object'''
print(t1[['name','sex']])  # 通过键值来取值
'''
name    yangyu
sex        man
dtype: object'''
print(t1.values)
'''['yangyu' 18 'man']'''
print(t1.keys)
'''
<bound method Series.keys of name    yangyu
age         18
sex        man
dtype: object>'''
print(t1.index)
'''Index(['name', 'age', 'sex'], dtype='object')'''
t2 = pd.Series([1,2,3,4,5])
print(t2)
'''
0    1
1    2
2    3
3    4
4    5
dtype: int64'''
print(t2>3)
'''
0    False
1    False
2    False
3     True
4     True
dtype: bool'''
print(t2[t2>3])
'''
3    4
4    5
dtype: int64'''
# 判断的是key,不是value
print('name' in t1)
'True'
print('yangyu' in t1)
'False'

# pandas会根据数据类型，自动处理缺失数据
data = ['a','b','c',None]
print(pd.Series(data))
'''
0       a
1       b
2       c
3    None
dtype: object
'''
data1 = [1,2,3,None]
print(pd.Series(data1))
'''
0    1.0
1    2.0
2    3.0
3    NaN
dtype: float64'''

2.pandas中Series的索引和值

对于一个陌生的series类型,我们如何知道他的索引和具体的值呢?

# pandas中Series的索引和值
import pandas as pd

t = pd.Series([1,2,3,4,5],name='Series')
# 获取索引
print(t.index)  # RangeIndex(start=0, stop=5, step=1)
print(t)
'''
0    1
1    2
2    3
3    4
4    5
Name: Series, dtype: int64'''
t1 = pd.Series([1,2,3,4,5],index=list("abcde"),name='Series')
# 获取索引
print(t1.index)  # Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
# 通过for循环获取索引
for i in t1.index:
    print(i)
'''
a
b
c
d
e'''
print(type(t1.index))  # <class 'pandas.core.indexes.base.Index'>
print(t1.index[1])  # b
# t1.index[1] = 'f'  # 不允许修改，报错raise TypeError("Index does not support mutable operations")
# TypeError: Index does not support mutable operations

# 索引重置
t1 = t1.reset_index()
print(t1)
'''
  index  Series
0     a       1
1     b       2
2     c       3
3     d       4
4     e       5
'''
print(t1.index)  # RangeIndex(start=0, stop=5, step=1)
t1 = t1.reset_index(drop=True)
print(t1.index)  # RangeIndex(start=0, stop=5, step=1)
print(t1.values)
'''
[['a' 1]
 ['b' 2]
 ['c' 3]
 ['d' 4]
 ['e' 5]]
'''
print(type(t1.values))  # <class 'numpy.ndarray'>

3.pandas中Series运算

t = pd.Series(range(10,20),index=range(10))
t1 = pd.Series(range(20,25),index=range(5))
t + t1
在这里插入图片描述

# pandas中Series运算
import pandas as pd

t = pd.Series(range(10,20),index=range(10))
print(t)
'''
0    10
1    11
2    12
3    13
4    14
5    15
6    16
7    17
8    18
9    19
dtype: int64'''
t1 = pd.Series(range(20,25),index=range(5))
print(t1)
'''
0    20
1    21
2    22
3    23
4    24'''
print(t+t1)  # 对应的索引位相加为浮点数，其余的为NaN
'''
0    30.0
1    32.0
2    34.0
3    36.0
4    38.0
5     NaN
6     NaN
7     NaN
8     NaN
9     NaN
'''
t2 = pd.Series(range(10,20),index=range(0,20,2))
print(t2)
'''
0     10
2     11
4     12
6     13
8     14
10    15
12    16
14    17
16    18
18    19
dtype: int64
'''
print(t1)
'''
0    20
1    21
2    22
3    23
4    24
dtype: int64'''
print(t2+t1)  #  对应的索引位相加为浮点数，其余的为NaN
'''
0     30.0
1      NaN
2     33.0
3      NaN
4     36.0
6      NaN
8      NaN
10     NaN
12     NaN
14     NaN
16     NaN
18     NaN
dtype: float64'''

五、pandas读取外部数据

pandas读取外部数据

我们的这组数据存在csv中，我们直接使用pd. read_csv即可

# pandas读取外部数据
import pandas as pd

data = pd.read_csv('./catNames2.csv')
print(data)
print(type(data))  # <class 'pandas.core.frame.DataFrame'>
'''
      Row_Labels  Count_AnimalName
0              1                 1
1              2                 2
2          40804                 1
3          90201                 1
4          90203                 1
...          ...               ...
16215      37916                 1
16216      38282                 1
16217      38583                 1
16218      38948                 1
16219      39743                 1

[16220 rows x 2 columns]'''
data1 = pd.read_clipboard()  # 从剪切板复制数据
print(data1)

六、pandas-DataFrame

1.pandas中的DataFrame创建

1.1 类似多维数组,每列数据可以是不同的类型,索引包括行索引和列索引

t = pd.DataFrame(np.arange(12).reshape(3,4))
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
DataFrame和Series有什么关系呢？

1.2 Series能够传入字典，那么DataFrame能够传入字典作为数据么？

dict_data = {
‘A’: 1.,
‘B’: date(year=2019,month=8,day=29),
‘C’: pd.Series(1, index=list(range(4)), dtype=‘float32’),
‘D’: np.array([3] * 4, dtype=‘int32’),
‘E’ : [‘Python’, ‘Java’, ‘C++’, ‘C#’],
‘F’ : ‘ChinaHadoop’
}

# pandas中的DataFrame创建
import pandas as pd
import numpy as np
from datetime import date


t = pd.DataFrame(np.arange(12).reshape(3,4))
print(t)
'''
  0  1   2   3
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11'''
# 行索引 Index    0轴   axis = 0
# 列索引 columns  1轴   axis = 1
t1 = pd.DataFrame(np.arange(12).reshape(3,4),index=list('abc'),columns=list('wxyz'))
print(t1)
'''
    w  x   y   z
a  0  1   2   3
b  4  5   6   7
c  8  9  10  11'''
# DataFrame和Series有什么关系呢？ 容器
dict_data = {
        'A': 1.,
        'B': date(year=2021,month=3,day=22),
        'C': pd.Series(1, index=list(range(4)), dtype='float32'),
        'D': np.array([3] * 4, dtype='int32'),
        'E': ['Python', 'Java', 'C++', 'C#'],
        'F': 'ChinaHadoop'
}
t2 = pd.DataFrame(dict_data)
print(t2)
'''
     A           B    C  D       E            F
0  1.0  2021-03-22  1.0  3  Python  ChinaHadoop
1  1.0  2021-03-22  1.0  3    Java  ChinaHadoop
2  1.0  2021-03-22  1.0  3     C++  ChinaHadoop
3  1.0  2021-03-22  1.0  3      C#  ChinaHadoop
'''
dict_data1 = {
        'A': 1.,
        'B': date(year=2021,month=3,day=22),
        'C': pd.Series(1, index=list(range(5)), dtype='float32'),
        'D': np.array([3] * 5, dtype='int32'),
        'E': ['Python', 'Java', 'C++', 'C#', 'PHP'],
        'F': 'ChinaHadoop'
}
t3 = pd.DataFrame(dict_data1)
print(t3)
'''
     A           B    C  D       E            F
0  1.0  2021-03-22  1.0  3  Python  ChinaHadoop
1  1.0  2021-03-22  1.0  3    Java  ChinaHadoop
2  1.0  2021-03-22  1.0  3     C++  ChinaHadoop
3  1.0  2021-03-22  1.0  3      C#  ChinaHadoop
4  1.0  2021-03-22  1.0  3     PHP  ChinaHadoop'''

dict3 = {'name':['yangyu','king'],'age':[18,20],'address':['shanghai','chengdu']}
t4 = pd.DataFrame(dict3)
print(t4)
'''
     name  age   address
0  yangyu   18  shanghai
1    king   20   chengdu'''
print(type(t4))  # <class 'pandas.core.frame.DataFrame'>
dict4 = [{'name':'yangyu','age':18,'tel':10000},{'name':'king','tel':10001},{'name':'lilei','age':20}]
t5 = pd.DataFrame(dict4)  # NaN会自动填充缺失的数据
print(t5)
'''
     name   age      tel
0  yangyu  18.0  10000.0
1    king   NaN  10001.0
2   lilei  20.0      NaN
'''

1.3 对于一个dataframe类型，既有行索引，又有列索引，我们能够对他做什么操作呢？

1.3.1DataFrame的基础属性

df.shape # 行数列数
df.dtypes # 列数据类型
df.ndim # 数据维度
df.index # 行索引
df.columns # 列索引
df.values # 对象值，二维ndarray数组
df.drop(columns=[‘name’,‘age’]) # 返回被删除之后的DataFrame,原数据不变
del df[‘name’]

# DataFrame的操作
import pandas as pd

dict1 = [{'name': 'yangyu', 'age': 18, 'tel': 10000}, {'name': 'king', 'tel': 10001}, {'name': 'lilei', 'age': 20}]
t1 = pd.DataFrame(dict1)  # NaN会自动填充缺失的数据
print(t1)
'''
     name   age      tel
0  yangyu  18.0  10000.0
1    king   NaN  10001.0
2   lilei  20.0      NaN'''
print(type(t1))  # <class 'pandas.core.frame.DataFrame'>
print(t1.shape)  # (3, 3) 行 列
print(t1.dtypes)  # 列的数据类型
'''
name     object
age     float64
tel     float64
dtype: object'''
print(t1.ndim)  # 维度 2
print(t1.index)  # 行索引  RangeIndex(start=0, stop=3, step=1)
t2 = pd.DataFrame(dict1,index=list('abc'))
print(t2.index)  # 行索引  Index(['a', 'b', 'c'], dtype='object')
print(t1.columns)  # 列索引 Index(['name', 'age', 'tel'], dtype='object')
print(t1.values)
'''
[['yangyu' 18.0 10000.0]
 ['king' nan 10001.0]
 ['lilei' 20.0 nan]]'''
print(type(t1.values))  # <class 'numpy.ndarray'>
print(t2.drop(index='a'))
'''
    name   age      tel
b   king   NaN  10001.0
c  lilei  20.0      NaN'''
print(t2.drop(index='a',columns='age'))
'''
    name      tel
b   king  10001.0
c  lilei      NaN'''
print(t2)  # 本身并没有变化
'''
     name   age      tel
a  yangyu  18.0  10000.0
b    king   NaN  10001.0
c   lilei  20.0      NaN'''
# inplace 修改本身
t2.drop(index='a',columns='age',inplace=True)
print(t2)
'''
    name      tel
b   king  10001.0
c  lilei      NaN'''
del t1['name']  # 删除整列
print(t1)
'''
    age      tel
0  18.0  10000.0
1   NaN  10001.0
2  20.0      NaN'''
del t1     # 删除全部
print(t1)  # NameError: name 't1' is not defined

1.3.2DataFrame的整体情况查询

t2.head(3) # 显示头部几行，默认5行
t2.tail(3) # 显示末尾几行，默认5行
t2.info() # 相关信息概述
t2.describe() # 快速综合统计结果

# DataFrame的整体情况查询
import pandas as pd

dict1 = [{'name': 'yangyu', 'age': 18, 'tel': 10000}, {'name': 'king', 'tel': 10001}, {'name': 'lilei', 'age': 20},{'name': 'yangyu', 'age': 18, 'tel': 10000},{'name': 'yangyu', 'age': 18, 'tel': 10000},{'name': 'yangyu', 'age': 18, 'tel': 10000},]
t1 = pd.DataFrame(dict1)  # NaN会自动填充缺失的数据
print(t1)
'''
     name   age      tel
0  yangyu  18.0  10000.0
1    king   NaN  10001.0
2   lilei  20.0      NaN
3  yangyu  18.0  10000.0
4  yangyu  18.0  10000.0
5  yangyu  18.0  10000.0
'''
print(t1.head())  # 默认预览前5行
'''
     name   age      tel
0  yangyu  18.0  10000.0
1    king   NaN  10001.0
2   lilei  20.0      NaN
3  yangyu  18.0  10000.0
4  yangyu  18.0  10000.0'''
print(t1.tail())  # 默认预览后5行
'''
     name   age      tel
1    king   NaN  10001.0
2   lilei  20.0      NaN
3  yangyu  18.0  10000.0
4  yangyu  18.0  10000.0
5  yangyu  18.0  10000.0'''
print(t1.info())  # 查看信息概述
'''
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   name    6 non-null      object 
 1   age     5 non-null      float64
 2   tel     5 non-null      float64
dtypes: float64(2), object(1)
memory usage: 272.0+ bytes
None'''
print(t1.describe())  # 统计结果
'''
             age           tel
count   5.000000      5.000000
mean   18.400000  10000.200000
std     0.894427      0.447214
min    18.000000  10000.000000
25%    18.000000  10000.000000
50%    18.000000  10000.000000
75%    18.000000  10000.000000
max    20.000000  10001.000000
'''

2.练习

现在假设我们有一个组关于猫的名字的统计数据,想知道使用次数最高的前几个名字是什么呢？
df.sort_values(by="",ascending=False)

# 现在假设我们有一个组关于猫的名字的统计数据,想知道使用次数最高的前几个名字是什么呢？
import pandas as pd

# 排序
df = pd.read_csv('catNames2.csv')
# print(df)

# 默认是升序   by 通过哪个列来进行排序
# print(df.sort_values(by='Count_AnimalName'))  # 升序 默认ascending=True
# print(df.sort_values(by='Count_AnimalName',ascending=False))  # 倒序
df.sort_values(by='Count_AnimalName',ascending=False,inplace=True)  # 改变数据本身
print(df.info())
'''
<class 'pandas.core.frame.DataFrame'>
Int64Index: 16220 entries, 1156 to 16219
Data columns (total 2 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Row_Labels        16217 non-null  object
 1   Count_AnimalName  16220 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 380.2+ KB
None
'''
print(df)
'''
      Row_Labels  Count_AnimalName
1156       BELLA              1195
9140         MAX              1153
2660     CHARLIE               856
3251        COCO               852
12368      ROCKY               823
...          ...               ...
6884        J-LO                 1
6888       JOANN                 1
6890        JOAO                 1
6891     JOAQUIN                 1
16219      39743                 1

[16220 rows x 2 columns]'''

3.pandas中的索引

pandas取行取列
方括号写数字，表示取行，对行进行操作
写字符串，表示的取列索引，对列进行操作

import pandas as pd

# 排序
df = pd.read_csv('catNames2.csv')
# print(df)

# 默认是升序   by 通过哪个列来进行排序
# print(df.sort_values(by='Count_AnimalName'))  # 升序 默认ascending=True
# print(df.sort_values(by='Count_AnimalName',ascending=False))  # 倒序
df.sort_values(by='Count_AnimalName',ascending=False,inplace=True)  # 改变数据本身
# 取行 前3行
print(df[:2])  # 或者 print(df.head(2))
'''
     Row_Labels  Count_AnimalName
1156      BELLA              1195
9140        MAX              1153'''
# 通过列的索引值取列
print(df['Count_AnimalName'])
'''
1156     1195
9140     1153
2660      856
3251      852
12368     823
         ... 
6884        1
6888        1
6890        1
6891        1
16219       1
'''
# 同时取行取列
print(df[:2]['Row_Labels'])
'''
1156    BELLA
9140      MAX
Name: Row_Labels, dtype: object'''
print(type(df[:2]['Row_Labels']))  # <class 'pandas.core.series.Series'>

还有更多的经过pandas优化过的选择方式
df.loc 通过标签索引行数据
df.iloc 通过位置获取行数据

# 还有更多的经过pandas优化过的选择方式
# df.loc 通过标签索引行数据
# df.iloc 通过位置获取行数据
import pandas as pd
import numpy as np

t = pd.DataFrame(np.arange(12).reshape(3, 4), index=list('abc'), columns=list('wxyz'))
print(t)
'''
   w  x   y   z
a  0  1   2   3
b  4  5   6   7
c  8  9  10  11'''
# 取行
print(t.loc['a'])
print(type(t.loc['a']))
'''
w    0
x    1
y    2
z    3
Name: a, dtype: int32
<class 'pandas.core.series.Series'>'''
# 取列
print(t.loc[:, 'z'])
print(type(t.loc[:, 'z']))
'''
a     3
b     7
c    11
Name: z, dtype: int32
<class 'pandas.core.series.Series'>'''
# 取多行
print(t.loc[['a', 'c']])
print(type(t.loc[['a', 'c']]))
'''
   w  x   y   z
a  0  1   2   3
c  8  9  10  11
<class 'pandas.core.frame.DataFrame'>'''
# 取多行
print(t.iloc[[0, 2]])
print(type(t.iloc[[0, 2]]))
'''
   w  x   y   z
a  0  1   2   3
c  8  9  10  11
<class 'pandas.core.frame.DataFrame'>'''
# 取多列
print(t.loc[:, ['w', 'z']])
print(type(t.loc[:, ['w', 'z']]))
'''
   w   z
a  0   3
b  4   7
c  8  11
<class 'pandas.core.frame.DataFrame'>'''
# 取多列
print(t.iloc[:, [0, 3]])
print(type(t.iloc[:, [0, 3]]))
'''
   w   z
a  0   3
b  4   7
c  8  11
<class 'pandas.core.frame.DataFrame'>'''
# 取某个值
print(t.iloc[0, 0])  # 0
t.iloc[0, 0] = 100
print(t)
'''
     w  x   y   z
a  100  1   2   3
b    4  5   6   7
c    8  9  10  11'''
t.iloc[0, 0] = np.nan   # dataframe不需要转换为float类型就可以赋值为nan
print(t)
'''
     w  x   y   z
a  NaN  1   2   3
b  4.0  5   6   7
c  8.0  9  10  11'''

4.pandas中DataFrame计算

pandas中DataFrame计算
t = pd.DataFrame(np.ones((2,2)),columns=[‘a’,‘b’])
t1 = pd.DataFrame(np.ones((3,3)),columns=[‘a’,‘b’,‘c’])

在这里插入图片描述

t2 = pd.Series(range(20,25),index=range(5))
t = pd.DataFrame(np.ones((2,2)),columns=[‘a’,‘b’])
在这里插入图片描述

# pandas中DataFrame计算
import pandas as pd
import numpy as np

t = pd.DataFrame(np.ones((2,2)),columns=['a','b'])
t1 = pd.DataFrame(np.ones((3,3)),columns=['a','b','c'])
print(t)
'''
     a    b
0  1.0  1.0
1  1.0  1.0'''
print(t1)
'''
     a    b    c
0  1.0  1.0  1.0
1  1.0  1.0  1.0
2  1.0  1.0  1.0'''
print(t + t1)
'''
     a    b   c
0  2.0  2.0 NaN
1  2.0  2.0 NaN
2  NaN  NaN NaN'''
t2 = pd.Series(range(20,25),index=range(5))
print(t2)
'''
0    20
1    21
2    22
3    23
4    24
dtype: int64'''
print(t)
'''
     a    b
0  1.0  1.0
1  1.0  1.0'''
print(t2 + t)
'''
    a   b   0   1   2   3   4
0 NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN'''

5.pandas的布尔索引

假如我们想找到所有的使用次数超过800的猫的名字，应该怎么选择？
假如我们想找到所有的使用次数超过700并且名字的字符串的长度大于4的狗的名字，应该怎么选择？

# pandas的布尔索引
import pandas as pd
import numpy as np
# 假如我们想找到所有的使用次数超过800的猫的名字，应该怎么选择？

df = pd.read_csv('catNames2.csv')
print(df[df['Count_AnimalName']>800])
'''
      Row_Labels  Count_AnimalName
1156       BELLA              1195
2660     CHARLIE               856
3251        COCO               852
9140         MAX              1153
12368      ROCKY               823'''

# 假如我们想找到所有的使用次数超过700并且名字的字符串的长度大于4的狗的名字，应该怎么选择？
print(df[(df['Count_AnimalName']>700) & (df['Count_AnimalName']<1000)])
'''
      Row_Labels  Count_AnimalName
2660     CHARLIE               856
3251        COCO               852
8417        LOLA               795
8552       LUCKY               723
8560        LUCY               710
12368      ROCKY               823'''
print(df[(df['Count_AnimalName']>700) & (df['Row_Labels'].str.len()>4)])
'''
      Row_Labels  Count_AnimalName
1156       BELLA              1195
2660     CHARLIE               856
8552       LUCKY               723
12368      ROCKY               823'''

6.pandas字符串常用方法

方法说明

contains
返回表示各字符串是否含有指定模式的布尔
型数组
lower,upper 转换大小写
slice 对series中的各个字符串进行子串截取
split 根据分隔符或正则表达式对字符串进行拆分

7.pandas中排序操作

按索引排序,sort_index()
按值排序,sort_values(by,ascending)
按单列的值排序
by=‘label’
ascending:True升序,False降序

# pandas中排序操作
import pandas as pd
import numpy as np

t = pd.DataFrame(np.arange(12).reshape(3,4))
print(t)
'''
   0  1   2   3
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11
'''
print(t.sort_index())
'''
   0  1   2   3
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11'''
print(t.sort_index(ascending=False))  # 索引倒序
'''
   0  1   2   3
2  8  9  10  11
1  4  5   6   7
0  0  1   2   3'''
t = pd.DataFrame(np.arange(12).reshape(3,4), index=list('abc'))
print(t.sort_index(ascending=False))  # 索引倒序  # a 97  b 98  c 99  A 65  B 66
'''
   0  1   2   3
c  8  9  10  11
b  4  5   6   7
a  0  1   2   3'''
t = pd.DataFrame(np.arange(12).reshape(3,4), index=list('aBc'))
print(t.sort_index(ascending=False))  # 索引倒序
'''
  0  1   2   3
c  8  9  10  11
a  0  1   2   3
B  4  5   6   7'''
print(t.sort_index(ascending=False, axis=1))  # 1轴倒序
'''
    3   2  1  0
a   3   2  1  0
B   7   6  5  4
c  11  10  9  8'''
print(t.sort_values(by=0))  # 0列正序
'''
   0  1   2   3
a  0  1   2   3
B  4  5   6   7
c  8  9  10  11'''
print(t.sort_values(by=0,ascending=False))  # 0列都倒序
'''
   0  1   2   3
c  8  9  10  11
B  4  5   6   7
a  0  1   2   3'''
t = pd.DataFrame(np.arange(12).reshape(3,4), index=list('aBc'),columns=list('efgh'))
print(t.sort_values(by='e',ascending=False))  # 'e'列倒序
'''
   e  f   g   h
c  8  9  10  11
B  4  5   6   7
a  0  1   2   3'''
print(t.sort_values(by='a',ascending=False, axis=1))  # 'a'行倒序
'''
    h   g  f  e
a   3   2  1  0
B   7   6  5  4
c  11  10  9  8'''

# pandas中的排序

import pandas as pd
import numpy as np

df = pd.DataFrame({'col1': ['A', 'A', 'B', np.nan, 'D', 'C'],
                   'col2': [2, 1, 9, 8, 7, 7],
                   'col3': [0, 1, 5, 4, 8, 2]})
print(df)
'''
  col1  col2  col3
0    A     2     0
1    A     1     1
2    B     9     5
3  NaN     8     4
4    D     7     8
5    C     7     2'''
print(df.sort_values(by=['col2'],ascending=False))  # 按col2列进行倒序，col2最大值所在的行变为第1行，依次排序。
'''
  col1  col2  col3
2    B     9     5
3  NaN     8     4
4    D     7     8
5    C     7     2
0    A     2     0
1    A     1     1'''
print(df.sort_values(by=['col2','col3']))  # 按col2列、col3列升序排序（col2列中出现相同的值，col3不同的值就表现出升序）
'''
  col1  col2  col3
1    A     1     1
0    A     2     0
5    C     7     2
4    D     7     8
3  NaN     8     4
2    B     9     5'''

8.pandas中缺失数据的处理

在这里插入图片描述
我们的数据缺失通常有两种情况：
一种就是空，None等，在pandas是NaN(和np.nan一样)
另一种是我们让其为0，蓝色框中

pandas中缺失数据的处理
判断数据是否为NaN：pd.isnull(df),pd.notnull(df)
处理方式1：删除NaN所在的行列dropna (axis=0, how=‘any’, inplace=False)
处理方式2：填充数据，t.fillna(t.mean()),t.fiallna(t.median()),t.fillna(0)

# pandas中缺失数据的处理
'''
判断数据是否为NaN：pd.isnull(df),pd.notnull(df)
处理方式1：删除NaN所在的行列t.dropna(axis=0, how='any', inplace=False)
处理方式2：填充数据，t.fillna(t.mean()),t.fiallna(t.median()),t.fillna(0)
'''
import pandas as pd
import numpy as np

t = pd.DataFrame(np.arange(24).reshape(4,6),dtype=float,index=list('ABCD'),columns=list('UVWXYZ'))
print(t)
'''
      U     V     W     X     Y     Z
A   0.0   1.0   2.0   3.0   4.0   5.0
B   6.0   7.0   8.0   9.0  10.0  11.0
C  12.0  13.0  14.0  15.0  16.0  17.0
D  18.0  19.0  20.0  21.0  22.0  23.0'''
t.iloc[0,0] = np.nan
t.iloc[0,5] = np.nan
t.iloc[3,2] = np.nan
t.iloc[1,4] = 0
print(t)
'''
      U     V     W     X     Y     Z
A   NaN   1.0   2.0   3.0   4.0   NaN
B   6.0   7.0   8.0   9.0   0.0  11.0
C  12.0  13.0  14.0  15.0  16.0  17.0
D  18.0  19.0   NaN  21.0  22.0  23.0'''
print(pd.isnull(t))
'''
       U      V      W      X      Y      Z
A   True  False  False  False  False   True
B  False  False  False  False  False  False
C  False  False  False  False  False  False
D  False  False   True  False  False  False'''
print(pd.notnull(t))
'''
A  False  True   True  True  True  False
B   True  True   True  True  True   True
C   True  True   True  True  True   True
D   True  True  False  True  True   True'''
# how="all" 必须满足该行或者该列全为NaN,才删除整行或整列
print(t.dropna())  # 默认参数：axis=0, how="any", thresh=None, subset=None, inplace=False  删除NaN所在行
'''
      U     V     W     X     Y     Z
B   6.0   7.0   8.0   9.0   0.0  11.0
C  12.0  13.0  14.0  15.0  16.0  17.0'''
print(t.dropna(axis=1))  # 删除NaN所在列
'''
      V     X     Y
A   1.0   3.0   4.0
B   7.0   9.0   0.0
C  13.0  15.0  16.0
D  19.0  21.0  22.0'''

print(t)
'''
      U     V     W     X     Y     Z
A   NaN   1.0   2.0   3.0   4.0   NaN
B   6.0   7.0   8.0   9.0   0.0  11.0
C  12.0  13.0  14.0  15.0  16.0  17.0
D  18.0  19.0   NaN  21.0  22.0  23.0'''
print(t.fillna(2))   # 用数字2来填充NaN
'''
      U     V     W     X     Y     Z
A   2.0   1.0   2.0   3.0   4.0   2.0
B   6.0   7.0   8.0   9.0   0.0  11.0
C  12.0  13.0  14.0  15.0  16.0  17.0
D  18.0  19.0   2.0  21.0  22.0  23.0'''
print(t.fillna(t.mean()))  # 全部用该列的中值来填充NaN
'''
      U     V     W     X     Y     Z
A  12.0   1.0   2.0   3.0   4.0  17.0
B   6.0   7.0   8.0   9.0   0.0  11.0
C  12.0  13.0  14.0  15.0  16.0  17.0
D  18.0  19.0   8.0  21.0  22.0  23.0'''
print("="*30)
print(t)
'''
      U     V     W     X     Y     Z
A   NaN   1.0   2.0   3.0   4.0   NaN
B   6.0   7.0   8.0   9.0   0.0  11.0
C  12.0  13.0  14.0  15.0  16.0  17.0
D  18.0  19.0   NaN  21.0  22.0  23.0'''
t['Z'] = t['Z'].fillna(t['Z'].mean())  # 单列用该列的中值来填充NaN
print(t)
'''  U     V     W     X     Y     Z
A   NaN   1.0   2.0   3.0   4.0  17.0
B   6.0   7.0   8.0   9.0   0.0  11.0
C  12.0  13.0  14.0  15.0  16.0  17.0
D  18.0  19.0   NaN  21.0  22.0  23.0'''
t['W'] = t['W'].fillna(t['X'].mean())  # 单列用别的列的中值来填充NaN
print("="*30)
print(t)
'''   
      U     V     W     X     Y     Z
A   NaN   1.0   2.0   3.0   4.0  17.0
B   6.0   7.0   8.0   9.0   0.0  11.0
C  12.0  13.0  14.0  15.0  16.0  17.0
D  18.0  19.0  12.0  21.0  22.0  23.0'''

9.pandas中处理重复数据

data = pd.DataFrame(
{
‘age’:[28,31,27,28],
‘gender’:[‘M’,‘M’,‘M’,‘F’],
‘name’:[‘Liu’,‘Li’,‘Chen’,‘Liu’]
}
)
判断是否存在重复数据
data.duplicated()

pandas中处理重复数据
删除重复数据
data.drop_duplicated()
subset 指定某些列
keep 保留第一次出现的数据

# pandas中处理重复数据
'''
判断数据是否为NaN：pd.isnull(df),pd.notnull(df)
处理方式1：删除NaN所在的行列t.dropna(axis=0, how='any', inplace=False)
处理方式2：填充数据，t.fillna(t.mean()),t.fiallna(t.median()),t.fillna(0)
'''
import pandas as pd
import numpy as np

data = pd.DataFrame(
    {
        'age': [28, 31, 27, 28],
        'gender': ['M', 'M', 'M', 'F'],
        'name': ['Liu', 'Li', 'Chen', 'Liu']
    }
)
print(data)
'''
   age gender  name
0   28      M   Liu
1   31      M    Li
2   27      M  Chen
3   28      F   Liu'''

# 判断是否存在重复数据  data.duplicated()
print(data.duplicated())   # 判断整行是否存在重复数据
'''
0    False
1    False
2    False
3    False
dtype: bool'''
data1 = pd.DataFrame(
    {
        'age': [28, 31, 27, 28],
        'gender': ['M', 'M', 'M', 'M'],
        'name': ['Liu', 'Li', 'Chen', 'Liu']
    }
)
print(data1.duplicated())  # 判断整行是否存在重复数据
'''
0    False
1    False
2    False
3     True
dtype: bool'''
# 判断年龄和姓名是否存在重复数据
print(data.duplicated(subset=['age','name']))
'''
0    False
1    False
2    False
3     True
dtype: bool'''
data.drop_duplicates(inplace=True)  # 如果整行没有重复，删除命令不执行
print(data)
'''
   age gender  name
0   28      M   Liu
1   31      M    Li
2   27      M  Chen
3   28      F   Liu'''
data.drop_duplicates(subset=['age','name'],inplace=True)  # 删除指定项重复的数据
print(data)
'''
   age gender  name
0   28      M   Liu
1   31      M    Li
2   27      M  Chen'''
data1.drop_duplicates(inplace=True)  # 删除整行重复的数据
print(data1)
'''
   age gender  name
0   28      M   Liu
1   31      M    Li
2   27      M  Chen'''
print("=="*30)
data2 = pd.DataFrame(
    {
        'age': [28, 31, 27, 28],
        'gender': ['M', 'M', 'M', 'M'],
        'name': ['Liu', 'Li', 'Chen', 'Liu']
    }
)
data2.drop_duplicates(inplace=True,keep='last')  # 删除前面重复数据，保留最后面的  # 默认first保留前面的
print(data2)
'''
   age gender  name
1   31      M    Li
2   27      M  Chen
3   28      M   Liu'''

10.pandas中数据替换

replace(to_replace)
to_replace为需要被替换的值,可以是
数值,字符串
列表
字典

# pandas中数据替换
'''
replace(to_replace)
to_replace为需要被替换的值,可以是
数值,字符串
列表
字典
'''
import pandas as pd
import numpy as np

data = pd.DataFrame(
    {
        'age': [28, 31, 27, 28],
        'gender': ['M', 'M', 'M', 'F'],
        'name': ['Liu', 'Li', 'Chen', 'Liu']
    }
)

print(data.replace(28, 30))  # 把28都替换为30
'''
   age gender  name
0   30      M   Liu
1   31      M    Li
2   27      M  Chen
3   30      F   Liu
'''
print(data.replace('M', 30))  # 把'M'都替换为30
'''
   age gender  name
0   28     30   Liu
1   31     30    Li
2   27     30  Chen
3   28      F   Liu'''
print(data.replace(['M', 'F'], 30))  # 把'M','F'都替换为30
'''
0   28      30   Liu
1   31      30    Li
2   27      30  Chen
3   28      30   Liu'''
print(data.replace(['M', 'F'], ['MAN', 'FAN']))  # 把'M'替换为'MAN',把'F'替换为'FAN'
'''
   age gender  name
0   28    MAN   Liu
1   31    MAN    Li
2   27    MAN  Chen
3   28    FAN   Liu'''
print(data.replace({28: 30, 'M': 'MAN'}))  # 用字典的方式把28替换为30，把'M'替换为'MAN'
'''
   age gender  name
0   30    MAN   Liu
1   31    MAN    Li
2   27    MAN  Chen
3   30      F   Liu'''

加油小羽哥

关注

3
点赞
踩
4

收藏

觉得还不错? 一键收藏
1
评论
数据分析第六讲 pandas

数据分析第六讲 pandas一、pandas介绍1.学习pandas的作用numpy已经能够帮助我们处理数据，能够结合matplotlib解决我们数据分析的问题，那么pandas学习的目的在什么地方呢？numpy能够帮我们处理数值型数据，但是这还不够。很多时候，我们的数据除了数值之外，还有字符串，还有时间序列等2.pandas是什么？pandas是基于NumPy 的一种工具,提供了高性能矩阵的运算,该工具是为了解决数据分析任务而创建的。Pandas 纳入了大量库和一些标准的数据模型，提
复制链接

扫一扫