Pandas进阶01--基础篇

最新推荐文章于 2023-12-19 22:10:10 发布

平原2018

最新推荐文章于 2023-12-19 22:10:10 发布

阅读量525

点赞数

分类专栏： pandas 文章标签： pandas

本文链接：https://blog.csdn.net/sinat_30353259/article/details/80765260

版权

pandas 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

一、Pandas简介

　　pandas是一个开源的，BSD许可的库，为Python编程语言提供了高性能，易用的数据结构和数据分析工具。
熊猫是NumFOCUS赞助的项目。

二、pandas安装环境

1、操作系统：
windows 8.1
2、开发工具：
• Anaconda 5.1
• Jupyter Notebook
3、 Python版本
• 3.6
4、第三方模块包
• numpy 1.13.3
• pandas 0.20.3
• matplotlib 2.1.2

三、pandas快速入门

目录介绍：
1、Pandas 基本数据结构-Series
2、Pandas 基本数据结构-DataFrame
3、pandas进阶–切片
4、pandas进阶–索引的选取和过滤
5、pandas进阶–数据运算
6、pandas进阶–函数应用和映射
7、pandas进阶–简单统计函数
8、pandas进阶–索引的排序
9、pandas进阶–缺失值处理
10、pandas进阶–value_counts与apply结合使用
11、pandas进阶–时间日期序列处理

1、Pandas 基本数据结构-Series

Series是一种类似于一维数组的对象,它由一组数据(各种 Numpy数据类型)以及一组与之相关的数据标签(即索引1)组成。仅由一组数据即可产生最简单的 Series

导入pandas和numpy：

from pandas import Series,DataFrame
import pandas as pd
import numpy as np

创建索引：

obj = Series(["a",22,False])
obj

打印obj结果：

0        a
1       22
2    False
dtype: object

通过列表创建索引自定义Series的索引,索引和值要一一对应：

series1 = Series(["小明","校长","小兰"],index=["a","b","c"])
series1

打印series1结果：

a    小明
b    校长
c    小兰
dtype: object

通过np.arange创建索引：

randomNum = Series(np.arange(5),index=['a','b','c','d','e'])
randomNum

打印randomNum结果：

a    0
b    1
c    2
d    3
e    4
dtype: int32

通过字典创建series 以key为索引，以value为值：

names = {'jack':22,'小红':18,'Mary':23}
persons = Series(names)
persons

打印persons结果：

Mary    23
jack    22
小红      18
dtype: int64

2、Pandas 基本数据结构-DataFrame

DataFrame 是一个表格型的数据结构,它含有一组有序的列 ,每列可以是不同的值类型 (数值、字符串、布尔值等)。 Dataframe既有行索引也有列索引,它可以被看做由 Series组成的字典(共用同一个索引)。跟其他类似的数据结构相比(如R的dataframe), Data frame中面向行和面向列的操作基本上是平衡的。其实,Dataframe中的数据是以一个或多个二维块存放的(而不是列表、字典或别的一维数据结构)。

#1. 构建DataFrame 的第一种方法：用字典创建   列索引(身份，年份，人口)   行索引（0,1,2）
pop = {"省份":["上海","浙江","河南"],
        "年份":[2015,2016,2017],
      "人口":[0.8,1.2,1.5]}

#自定义DataFrame的列索引（列索引名称要和数据里面的一致）
df = DataFrame(pop,columns=["省份","年份","人口"])
df

打印df结果：
    省份  年份  人口
0   上海  2015    0.8
1   浙江  2016    1.2
2   河南  2017    1.5

#获取DataFrame中的多列数据：
df[["省份","年份","人口"]]

结果如下：
    省份  年份  人口
0   上海  2015    0.8
1   浙江  2016    1.2
2   河南  2017    1.5

#获取DataFrame中的多行数据
df[0:2]

打印结果：
    省份  年份  人口
0   上海  2015    0.8
1   浙江  2016    1.2

#2.用loc或者iloc获取指定索引的单行数据
df.loc[0]

打印结果：
省份      上海
年份    2015
人口     0.8
Name: 0, dtype: object


#3.向DataFrame中添加一条数据
df["财政赤字"]=[1.5,1.3,1.1]    #或者df["财政赤字"] = np.arange(3)
df

打印结果：
    省份  年份  人口  财政赤字
0   上海  2015    0.8     1.5
1   浙江  2016    1.2     1.3
2   河南  2017    1.5     1.1


#4.构建DataFrame第二种方法：用Series
series = Series([1.5,2.2,5.4])
series

打印结果：
0    1.5
1    2.2
2    5.4
dtype: float64


#替换方法一：
#用series的值替换【“财政赤字中的值”】，注意：索引要一致，都为0,1,2
df["财政赤字"]=series
df

打印结果：
    省份  年份  人口  财政赤字
0   上海  2015    0.8     1.5
1   浙江  2016    1.2     2.2
2   河南  2017    1.5     5.4


#获取DataFrame的行索引
list(df.index)

打印结果：
[0, 1, 2]


#替换方法二
#构建DataFrame
series = Series([1.5,2.2,5.4],index=["one","two","three"])
#修改DataFrame的行索引
df.index = ["one","two","three"]
df["财政赤字"]=series
df

打印结果:
    省份  年份  人口  财政赤字
one  上海 2015    0.8      1.5
two  浙江 2016    1.2      2.2
three 河南    2017    1.5      5.4


# 5.根据条件增加一列
df["magic"] = df.省份 =="上海"
df

打印结果：
省份      年份  人口  财政赤字    magic
one 上海  2015    0.8   1.5       True
two 浙江  2016    1.2   2.2       False
three 河南    2017    1.5   5.4       False


# 6.删除一列
del df["magic"]
df

打印结果：
    省份  年份  人口  财政赤字
one 上海  2015    0.8 1.5
two 浙江  2016    1.2 2.2
three   河南  2017    1.5 5.4


# 获取列名
df.columns

打印结果：
Index(['省份', '年份', '人口', '财政赤字'], dtype='object')


#7.字典套字典的方式创建DataFrame
# 第一层字典的key作为列索引
# 第二层字典的key作为行索引
provice = {"上海":{2000:0.7,2001:0.8,2002:1.2},
          "江苏":{2000:1.1,2001:1.2,2002:1.3}}
df2 = DataFrame(provice)
df2

打印结果：
        上海  江苏
2000    0.7     1.1
2001    0.8     1.2
2002    1.2     1.3


df2.index
打印结果：
Int64Index([2000, 2001, 2002], dtype='int64')


#8.DataFrame的行列倒置即转置,本身没有变化，会返回一个新的值df
df2.T

打印结果：
        2000    2001    2002
上海  0.7     0.8     1.2
江苏  1.1     1.2     1.3


# 添加列索引的注释
df2.columns.name="省份"
# 添加行索引的注释
df2.index.name="年份"
df2

打印结果：
省份  上海  江苏
年份      
2000    0.7     1.1
2001    0.8     1.2
2002    1.2     1.3


#获取DataFrame数据结构中的所有值（values）
df2.values

打印结果，获取到数组：
array([[0.7, 1.1],
       [0.8, 1.2],
       [1.2, 1.3]])

3、pandas进阶–切片

#1.创建Series数据
obj = Series(range(3),index=["a","b","c"])
obj

打印结果：
a    0
b    1
c    2
dtype: int64


#2.获取索引
index = obj.index
index

打印结果
Index(['a', 'b', 'c'], dtype='object')


#3.通过索引获取Series行索引数据
index[1]

打印结果：
'b'


#通过切片可以获取到Series索引的多个数据
index[1:]

打印结果：
Index(['b', 'c'], dtype='object')


#5.删除DataFrame中的多列数据
df = DataFrame(np.arange(16).reshape((4,4)),
              index=["安徽","北京","上海","南京"],
              columns=["one","two","three","four"])
df

打印结果：
    one two three   four
安徽  0   1   2   3
北京  4   5   6   7
上海  8   9   10  11
南京  12  13  14  15


#6.删除DataFrame中指定索引的行数据 北京和上海
df.drop(["北京","上海"])

打印结果：
     one    two three   four
安徽  0   1   2       3
南京  12  13  14      15


#7.删除DataFrame中的多列数据
df.drop(["one","two"],axis=1)

打印结果：
      three four
安徽  2   3
北京  6   7
上海  10  11
南京  14  15

4、pandas进阶–索引的选取和过滤

series = Series(np.arange(4),index=["a","b","c","d"])
series
打印结果：
a    0
b    1
c    2
d    3
dtype: int32

series[1]
打印索引位置的值：
1


series[["a","b","c"]]

打印结果：
a    0
b    1
c    2
dtype: int32

# 筛选
series[series<2]

打印结果：
c    2
d    3
dtype: int32


#通过字符串获取数据，注意：闭合区间
series["c":]
c    2
d    3
dtype: int32

5、pandas进阶–数据运算

1、series加法
s1 = Series([4.7,1.5,2.3,6.3],index=["a","b","c","d"])
s2 = Series([1,2,3,4,5],index=["a","b","c","d","e"])

s1
打印结果：
a    4.7
b    1.5
c    2.3
d    6.3
dtype: float64

s2
打印结果：
a    1
b    2
c    3
d    4
e    5
dtype: int64

s1+s2
（如果没有对应的id，会变为NaN）
打印结果：
a     5.7
b     3.5
c     5.3
d    10.3
e     NaN
dtype: float64


2、DataFrame加减乘除计算
df1 = DataFrame(np.arange(9).reshape((3,3)),columns=list("bcd"),index=["江苏","浙江","上海"])
df2 = DataFrame(np.arange(12).reshape((4,3)),columns=list("bde"),index=["安徽","江苏","浙江","北京"])

df1
打印结果：
        b   c   d
江苏  0   1   2
浙江  3   4   5
上海  6   7   8

df2
打印结果：
        b   d   e
安徽  0   1   2
江苏  3   4   5
浙江  6   7   8
北京  9   10  11

df1+df2（只有对应的索引才能相加）
打印结果：
b   c   d   e
上海  NaN NaN NaN NaN
北京  NaN NaN NaN NaN
安徽  NaN NaN NaN NaN
江苏  3.0 NaN 6.0 NaN
浙江  9.0 NaN 12.0    NaN


3、#如何处理DataFrame相加之后NAN值
df3 = DataFrame(np.arange(12).reshape((3,4)),columns=list("abcd"))
df4 = DataFrame(np.arange(20).reshape((4,5)),columns=list("abcde"))

df3
打印结果：
    a   b   c   d
0   0   1   2   3
1   4   5   6   7
2   8   9   10  11

df4
打印结果：
    a   b   c   d   e
0   0   1   2   3   4
1   5   6   7   8   9
2   10  11  12  13  14
3   15  16  17  18  19


df3+df4
打印结果：
a   b   c   d   e
0   0.0 2.0 4.0 6.0 NaN
1   9.0 11.0 13.0 15.0  NaN
2   18.0 20.0 22.0 24.0 NaN
3   NaN  NaN NaN NaN NaN


# 现将df3和df4进行加法，nan的数据会被大的结构数据+0填充
df3.add(df4,fill_value=0)

打印结果：
    a       b       c       d       e
0   0.0     2.0     4.0     6.0     4.0
1   9.0     11.0    13.0    15.0    9.0
2   18.0    20.0    22.0    24.0    14.0
3   15.0    16.0    17.0    18.0    19.0


# 现将df3和df4进行加法，然后将得到的结果中的nan填充为0
df3.add(df4).fillna(0)
打印结果：
    a       b       c       d       e
0   0.0     2.0     4.0     6.0     0.0
1   9.0     11.0    13.0    15.0    0.0
2   18.0    20.0    22.0    24.0    0.0
3   0.0     0.0     0.0     0.0     0.0


# 加法  如df3+df4
df3.add(df4)
打印结果：
    a       b       c       d       e
0   0.0     2.0     4.0     6.0     NaN
1   9.0     11.0    13.0    15.0    NaN
2   18.0    20.0    22.0    24.0    NaN
3   NaN     NaN     NaN     NaN     NaN


# 4、DataFrame和Series进行相加或相减操作
df5 = DataFrame(np.arange(12).reshape((4,3)),columns=list("abc"),index=["北京","上海","天津","芜湖"])
df5

打印结果：
        a   b   c
北京  0   1   2
上海  3   4   5
天津  6   7   8
芜湖  9   10  11


#获取DataFrame数据中的一行
series=df5.loc["北京"]
series

打印结果：
a    0
b    1
c    2
Name: 北京, dtype: int32


# DataFrame和Series进行加减乘除操作
df5+series
打印结果：
        a   b   c
北京  0   2   4
上海  3   5   7
天津  6   8   10
芜湖  9   11  13


df5-series
打印结果：
        a   b   c
北京  0   0   0
上海  3   3   3
天津  6   6   6
芜湖  9   9   9

乘除同上。

6、pandas进阶–函数应用和映射

df = DataFrame(np.random.randn(4,3),columns=list("abc"),index=["one","two","three","four"])
df

打印结果：
          a         b           c
one     -0.436913   -0.041775   1.409574
two     -0.438378   0.451142    -0.826057
three   -0.637730   1.305314    -0.590082
four    1.580527    0.683622    2.069763


# 映射取绝对值，返回一个新值
np.abs(df)
打印结果：
            a           b           c
one     0.436913    0.041775    1.409574
two     0.438378    0.451142    0.826057
three   0.637730    1.305314    0.590082
four    1.580527    0.683622    2.069763


#按列的方向求最大值和最小值之差
f = lambda x: x.max()-x.min()
df.apply(f)

打印结果：
a    2.218256
b    1.347089
c    2.895820
dtype: float64

# 按行的方向求最大值和最小值之差
df.apply(f,axis=1)
打印结果：
one      1.846487
two      1.277199
three    1.943043
four     1.386141
dtype: float64


#处理小数点
f = lambda x: "%.2f"%x
# def toFix(x):
#     return "%.2f"%x
# 对DataFrame中的每一个元素使用函数进行映射
df.applymap(f)

打印结果：
         a      b       c
one     -0.44   -0.04   1.41
two     -0.44   0.45    -0.83
three   -0.64   1.31    -0.59
four    1.58    0.68    2.07


注意：
# DataFrame的apply方法和applymap方法的区别
#apply对整行或整列进行操作
#applymap是对DataFrame中的每一个元素进行操作


#对Series数据进行map操作
df["c"].map(f)

打印结果：
one       1.41
two      -0.83
three    -0.59
four      2.07
Name: c, dtype: object

7、pandas进阶–简单统计函数

from pandas import Series,DataFrame
import pandas as pd
import numpy as np
from numpy import nan as NA


df = DataFrame([[1.4,NA],
                [7.1,-4.5],
                [NA,NA],
                [0.75,-1.3]],
              index=[list("abcd")],
              columns=["one","two"])
df
打印结果：
    one     two
a   1.40    NaN
b   7.10    -4.5
c   NaN NaN
d   0.75    -1.3


#求和，默认省略NaN值，按列方向求和
df.sum()
打印结果：
one    9.25
two   -5.80
dtype: float64


# 求和，按行方向求和
df.sum(axis=1)
打印结果：
a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64


#求平均数,默认按列方向求平均，不计算NaN这一行数据
df.mean()
打印结果：
one    3.083333
two   -2.900000
dtype: float64


#求平均数，按行方向求平均,不计入NaN这一列数据  注：添加skipna=False，代表计入NaN这个项
df.mean(axis=1,skipna=False)
打印结果：
a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64


# 累积求和，默认是按列方向进行累积求和
df.cumsum()
打印结果：
    one     two
a   1.40    NaN
b   8.50    -4.5
c   NaN NaN
d   9.25    -5.8


# 常用结果统计描述
df.describe()
打印结果：
        one         two
count   3.000000    2.000000
mean    3.083333    -2.900000
std     3.493685    2.262742
min     0.750000    -4.500000
25%     1.075000    -3.700000
50%     1.400000    -2.900000
75%     4.250000    -2.100000
max     7.100000    -1.300000


#查看DataFrame的数据信息
df.info()
打印结果：
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 4 entries, (a,) to (d,)
Data columns (total 2 columns):
one    3 non-null float64
two    2 non-null float64
dtypes: float64(2)
memory usage: 276.0+ bytes


#查看series数据类型的describe
series=Series(["1","2","c","d"]*4)
series
打印结果：
0     a
1     a
2     c
3     d
4     a
5     a
6     c
7     d
8     a
9     a
10    c
11    d
12    a
13    a
14    c
15    d
dtype: object


series.describe()   #根据series中的数据类型不同，describe统计的结果会有差异
打印结果：
count     16
unique     3
top        a
freq       8
dtype: object


#从一组数据中抽取唯一值
obj = Series(["a","a","c","d"]*4)
obj
打印结果：
0     a
1     a
2     c
3     d
4     a
5     a
6     c
7     d
8     a
9     a
10    c
11    d
12    a
13    a
14    c
15    d
dtype: object


#从一组数据中抽取唯一值
obj.unique()
打印结果：》
array(['a', 'c', 'd'], dtype=object)


#统计一组数据中每个值出现的频率
obj.value_counts()
打印结果：
a    8
c    4
d    4
dtype: int64


#可以调用pandas顶级api方法实现统计一组数据中每个值出现的频率
#sort 根据每个值出现的频率进行排序
#ascending：是否按升序进行排序
pd.value_counts(obj.values,sort=True,ascending=True)
打印结果：
d    4
c    4
a    8
dtype: int64


#isin方法()
#用于判断数据集中的数据是否为传入该函数中参数的子集
mask = obj.isin(["c","d"])
#通过花式索引过滤数据
obj[mask]
打印结果：
2     c
3     d
6     c
7     d
10    c
11    d
14    c
15    d
dtype: object

8、pandas进阶–索引的排序

series = Series(np.arange(4),index=list("abdc"))
series
打印结果：
a    0
b    1
d    2
c    3
dtype: int32


#对series的索引进行排序
series.sort_index()
打印结果：
a    0
b    1
c    3
d    2
dtype: int32


#对DataFrame中的索引进行排序；行索引、列索引
frame = DataFrame(np.arange(8).reshape((2,4)),index=["two","one"],
                 columns=list("dabc"))
frame
打印结果：
    d   a   b   c
two 0   1   2   3
one 4   5   6   7


# 对行索引进行排序
frame.sort_index()
打印结果：
    d   a   b   c
one 4   5   6   7
two 0   1   2   3


# 对列索引进行排序
frame.sort_index(axis=1)
打印结果：
    a   b   c   d
two 1   2   3   0
one 5   6   7   4


#行索引排序默认按升序，如果需要进行降序排列，指定第二个参数
frame.sort_index(axis=1,ascending=False)
打印结果：
    d   c   b   a
two 0   3   2   1
one 4   7   6   5


#列索引排序默认按升序，如果需要进行降序排列，指定第二个参数
frame.sort_values(by=["b","c"],ascending=False)
打印结果：
    d   a   b   c
one 4   5   6   7
two 0   1   2   3


#判断索引是否唯一
series = Series(np.arange(5),index=list("aabbd"))
series
打印结果：
a    0
a    1
b    2
c    3
d    4
dtype: int32


#判断索引是否唯一
series.index.is_unique    #如果是false：有重复值    True：没有重复值
打印结果：
False

9、pandas进阶–缺失值处理

from pandas import Series,DataFrame
import pandas as pd
import numpy as np
from numpy import nan as NA


data = Series([1,NA,3.5,NA,7])
data
打印结果：
0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64


#用dropna 方法删除缺失值
data.dropna()
打印结果：
0    1.0
2    3.5
4    7.0
dtype: float64


# 通过花式索引进行筛选非NaN值
data[data.notnull()]
打印结果：
0    1.0
2    3.5
4    7.0
dtype: float64


# 如何去除DataFrame中的缺失值
frame = DataFrame([[1,6.5,3],
                  [1,NA,NA],
                  [NA,NA,NA],
                  [NA,6.5,3]])
frame
打印结果：
    0   1   2
0   1.0 6.5 3.0
1   1.0 NaN NaN
2   NaN NaN NaN
3   NaN 6.5 3.0


clean = frame.dropna()    #默认会删除所有会出现的NaN值
clean
打印结果：
    0   1   2
0   1.0 6.5 3.0


frame.dropna(how="all")  #只删除所有列数据值都为NaN的行
打印结果：
    0   1   2
0   1.0 6.5 3.0
1   1.0 NaN NaN
3   NaN 6.5 3.0


frame.dropna(axis=1,how="all")  # 只删除所有行数据值都为NaN的列
打印结果：
    0   1   2
0   1.0 6.5 3.0
1   1.0 NaN NaN
2   NaN NaN NaN
3   NaN 6.5 3.0


df = DataFrame(np.random.randn(7,3))
df
打印结果：
0   1   2
0   0.482700    0.012109    0.296137
1   -1.252882   -0.848309   1.349779
2   -0.138185   1.265977    -0.296554
3   -0.646391   0.361134    -0.590972
4   0.386160    0.277657    0.952054
5   -2.542542   0.359897    -0.958098
6   0.537748    -0.124399   1.319221


series = Series([2,4,5,NA,8,NA,9])
series
打印结果：
0    2.0
1    4.0
2    5.0
3    NaN
4    8.0
5    NaN
6    9.0
dtype: float64


#把series的值赋给df[0]，也就是第一列
df[0] = series
df
打印结果：
    0       1       2
0   2.0 0.012109    0.296137
1   4.0 -0.848309   1.349779
2   5.0 1.265977    -0.296554
3   NaN 0.361134    -0.590972
4   8.0 0.277657    0.952054
5   NaN 0.359897    -0.958098
6   9.0 -0.124399   1.319221


series = Series([NA,3.5,NA,-6.6,8,NA,"张三"])
series
打印结果：
0    NaN
1    3.5
2    NaN
3   -6.6
4      8
5    NaN
6     张三
dtype: object



#用series替换到df[2]中
df[2] = series
df
打印结果:

    0       1       2
0   2.0 0.012109    NaN
1   4.0 -0.848309   3.5
2   5.0 1.265977    NaN
3   NaN 0.361134    -6.6
4   8.0 0.277657    8
5   NaN 0.359897    NaN
6   9.0 -0.124399   张三


# 用字典调用fillna，实现对多列替换NaN值  填充
df.fillna({0:11,2:0})
打印结果：
    0       1       2
0   2.0 0.012109    0
1   4.0 -0.848309   3.5
2   5.0 1.265977    0
3   11.0    0.361134    -6.6
4   8.0 0.277657    8
5   11.0    0.359897    0
6   9.0 -0.124399   张三


df.fillna(0,inplace=True)  #fillna 给定inplace参数为True,则修改原数据集
df
打印结果：
0   1   2
0   2.0 0.012109    0
1   4.0 -0.848309   3.5
2   5.0 1.265977    0
3   0.0 0.361134    -6.6
4   8.0 0.277657    8
5   0.0 0.359897    0
6   9.0 -0.124399   张三

10、pandas进阶–value_counts与apply结合使用

from pandas import Series,DataFrame
import pandas as pd
import numpy as np
from numpy import nan as NA



data = DataFrame({"Qu1":[1,3,4,3,4],
                 "Qu2":[2,3,1,2,3],
                 "Qu3":[1,5,6,4,4]})
data
打印结果：
    Qu1 Qu2 Qu3
0   1   2   1
1   3   3   5
2   4   1   6
3   3   2   4
4   4   3   4


#统计每一列数据中，每个值出现的次数
data.apply(pd.value_counts).fillna(0)
打印结果：
    Qu1 Qu2 Qu3
1   1.0 1.0 1.0
2   0.0 2.0 0.0
3   2.0 2.0 0.0
4   2.0 0.0 2.0
5   0.0 0.0 1.0
6   0.0 0.0 1.0


result = data.apply(pd.value_counts)
result
打印结果：
    Qu1 Qu2 Qu3
1   1.0 1.0 1.0
2   NaN 2.0 NaN
3   2.0 2.0 NaN
4   2.0 NaN 2.0
5   NaN NaN 1.0
6   NaN NaN 1.0


mask = result.isnull()
mask
打印结果：
Qu1 Qu2 Qu3
1   False   False   False
2   True    False   True
3   False   False   True
4   False   True    False
5   True    True    False
6   True    True    False


#关于处理NaN的一些方法
string_data = Series(["江苏","上海",NA,"南京"])
mask = string_data.isnull()
string_data
打印结果：
0     江苏
1     上海
2    NaN
3     南京
dtype: object

#打印出为空的行
string_data[mask]
打印结果：
2    NaN
dtype: object


#取反结果  ~
string_data[~mask]
打印结果：
0    江苏
1    上海
3    南京
dtype: objec


notnull = string_data.notnull()
string_data
打印结果：
0     江苏
1     上海
2    NaN
3     南京
dtype: object


notnull
打印结果：
0     True
1     True
2    False
3     True
dtype: bool


#求出不为空的数据
string_data[notnull]
打印结果：
0    江苏
1    上海
3    南京
dtype: object

11、pandas进阶–时间日期序列处理

from pandas import Series,DataFrame
import pandas as pd
import numpy as np
from datetime import datetime

# 字符串形式的日期列表
datestr=["7-6-2011","8-6-2022"]


#用pandas将字符串形式的日期转换成时间序列
pd.to_datetime(datestr)
打印结果：
DatetimeIndex(['2011-07-06', '2022-08-06'], dtype='datetime64[ns]', freq=None)

idx = pd.to_datetime(datestr+[None])
idx
打印结果：
DatetimeIndex(['2011-07-06', '2022-08-06', 'NaT'], dtype='datetime64[ns]', freq=None)

#用pandas创建一组时间日期数据
idx[1]
打印结果：
Timestamp('2022-08-06 00:00:00')


#判断pandas时间日期序列中的值是否为NaT
pd.isnull(idx)
打印结果：
array([False, False,  True])


打印结果：
array([False, False,  True])


mask = pd.isnull(idx)
mask
打印结果：
array([False, False,  True])


idx[~mask]
打印结果：
DatetimeIndex(['2011-07-06', '2022-08-06'], dtype='datetime64[ns]', freq=None)


# 用pandas创建一组时间日期数据
dates = [datetime(2011,1,2),datetime(2012,3,3),datetime(2014,5,6),datetime(2017,7,8),datetime(2012,11,2),datetime(2001,10,2)]
ts  = Series(np.random.randn(6),index=dates)
ts
打印结果：
2011-01-02   -0.842407
2012-03-03    0.312516
2014-05-06    0.114702
2017-07-08   -1.109900
2012-11-02    0.625863
2001-10-02   -0.176027
dtype: float64


stamp = ts.index[2]
stamp
打印结果：
Timestamp('2014-05-06 00:00:00')


ts[stamp]
打印结果：
0.11470208297245334


ts["1/2/2011"]
打印结果：
2011-01-02   -0.842407
dtype: float64


# 只获取某个月或某年
ts["2011-1"]
打印结果：
2011-01-02   -0.842407
dtype: float64


#通过pd.date_range()方法创建一组日期时间序列
long_str = Series(np.random.randn(1000),index=pd.date_range("2000-1-1",periods=1000))
long_str

#只获取2001年的数据
long_str["2002"]


# 只获取某月的数据
long_str["2002-2"]


# 创建指定范围内的时间日期序列
index=pd.date_range("4/1/2012","2012-6-2")
index
打印结果：
DatetimeIndex(['2012-04-01', '2012-04-02', '2012-04-03', '2012-04-04',
               '2012-04-05', '2012-04-06', '2012-04-07', '2012-04-08',
               '2012-04-09', '2012-04-10', '2012-04-11', '2012-04-12',
               '2012-04-13', '2012-04-14', '2012-04-15', '2012-04-16',
               '2012-04-17', '2012-04-18', '2012-04-19', '2012-04-20',
               '2012-04-21', '2012-04-22', '2012-04-23', '2012-04-24',
               '2012-04-25', '2012-04-26', '2012-04-27', '2012-04-28',
               '2012-04-29', '2012-04-30', '2012-05-01', '2012-05-02',
               '2012-05-03', '2012-05-04', '2012-05-05', '2012-05-06',
               '2012-05-07', '2012-05-08', '2012-05-09', '2012-05-10',
               '2012-05-11', '2012-05-12', '2012-05-13', '2012-05-14',
               '2012-05-15', '2012-05-16', '2012-05-17', '2012-05-18',
               '2012-05-19', '2012-05-20', '2012-05-21', '2012-05-22',
               '2012-05-23', '2012-05-24', '2012-05-25', '2012-05-26',
               '2012-05-27', '2012-05-28', '2012-05-29', '2012-05-30',
               '2012-05-31', '2012-06-01', '2012-06-02'],
              dtype='datetime64[ns]', freq='D')


#指定pd.date_range()的开始和结束日期创建时间日期序列    start:从start开始，往后推，periods是周期
date_index = pd.date_range(start = "4/1/2011",periods=30)
date_index
打印结果：
DatetimeIndex(['2011-04-01', '2011-04-02', '2011-04-03', '2011-04-04',
               '2011-04-05', '2011-04-06', '2011-04-07', '2011-04-08',
               '2011-04-09', '2011-04-10', '2011-04-11', '2011-04-12',
               '2011-04-13', '2011-04-14', '2011-04-15', '2011-04-16',
               '2011-04-17', '2011-04-18', '2011-04-19', '2011-04-20',
               '2011-04-21', '2011-04-22', '2011-04-23', '2011-04-24',
               '2011-04-25', '2011-04-26', '2011-04-27', '2011-04-28',
               '2011-04-29', '2011-04-30'],
              dtype='datetime64[ns]', freq='D')


#指定pd.date_range()的开始和结束日期创建时间日期序列    end:从end开始，往前推，periods是周期
date_index = pd.date_range(start = "4/1/2011",periods=30)
date_index = pd.date_range(end = "4/1/2011",periods=30)
date_index
打印结果：
DatetimeIndex(['2011-03-03', '2011-03-04', '2011-03-05', '2011-03-06',
               '2011-03-07', '2011-03-08', '2011-03-09', '2011-03-10',
               '2011-03-11', '2011-03-12', '2011-03-13', '2011-03-14',
               '2011-03-15', '2011-03-16', '2011-03-17', '2011-03-18',
               '2011-03-19', '2011-03-20', '2011-03-21', '2011-03-22',
               '2011-03-23', '2011-03-24', '2011-03-25', '2011-03-26',
               '2011-03-27', '2011-03-28', '2011-03-29', '2011-03-30',
               '2011-03-31', '2011-04-01'],
              dtype='datetime64[ns]', freq='D')

平原2018

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
打赏
0
评论
Pandas进阶01--基础篇

一、Pandas简介　　pandas是一个开源的，BSD许可的库，为Python编程语言提供了高性能，易用的数据结构和数据分析工具。熊猫是NumFOCUS赞助的项目。二、pandas安装环境1、操作系统： windows 8.1 2、开发工具： • Anaconda 5.1 • Jupyter Notebook 3、 Python版本 • 3.6 4、第三方模块包...
复制链接

扫一扫