Data_analysis

xiongsheng666

已于 2022-03-28 17:08:23 修改

阅读量1.1k

点赞数

分类专栏： Python 文章标签：数据分析

于 2020-01-13 14:27:07 首次发布

本文链接：https://blog.csdn.net/weixin_45523107/article/details/103957325

版权

Python 专栏收录该内容

40 篇文章 5 订阅

订阅专栏

数据分析01

1.numpy模块

什么是数据分析

是把隐藏在一些看似杂乱无章的数据背后的信息提炼出来，总结出所研究对象的内在规律
数据分析是用适当的方法对收集来的大量数据进行分析，帮助人们做出判断，以便采取适当的行动
- 商品采购量的多少
- 总部向各个地区代理的发货量
- …

为什么学习数据分析

有岗位的需求
是Python数据科学的基础
是机器学习课程的基础

数据分析实现流程

提出问题
准备数据
分析数据
获得结论
成果可视化

数据分析三剑客

numpy
pandas
matplotlib

numpy模块：一维或者是多维的数组（低版本的列表）

NumPy(Numerical Python) 是 Python 语言中做科学计算的基础库。重在于数值计算，也是大部分Python科学计算库的基础，多用于在大型、多维数组上执行的数值运算。

numpy的创建

使用np.array()创建
使用plt创建
使用np的routines函数创建
使用array()创建一个一维数组

In [3]:

import numpy as np
arr = np.array([1,2,3,4,5,6])
arr

Out[3]:

array([1, 2, 3, 4, 5, 6])

使用array()创建一个多维数组

In [4]:

np.array([[1,2,3,4],[5,6,7,8],[9,9,9,9]])

Out[4]:

array([[1, 2, 3, 4],
       [5, 6, 7, 8],
       [9, 9, 9, 9]])

数组和列表的区别是什么？
- 数据中存储的数组元素的数据类型必须是统一
- 数据类型是有优先级：
  - str>float>int

In [6]:

arr = np.array([1,2.2,3,4,5,6])
arr

Out[6]:

array([1. , 2.2, 3. , 4. , 5. , 6. ])

将外部的一张图片读取加载到numpy数组中，然后尝试改变数组元素的数值查看对原始图片的影响

In [10]:

import matplotlib.pyplot as plt
img_arr = plt.imread('./1.jpg')
plt.imshow(img_arr)

Out[10]:

In [11]:

plt.imshow(img_arr-100)

Out[11]:

<matplotlib.image.AxesImage at 0x165794d4e48>

zeros()
ones()
linespace()
arange()
random系列

In [12]:

np.zeros((3,4))

Out[12]:

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

In [13]:

np.linspace(0,100,num=20)

Out[13]:

array([  0.        ,   5.26315789,  10.52631579,  15.78947368,
        21.05263158,  26.31578947,  31.57894737,  36.84210526,
        42.10526316,  47.36842105,  52.63157895,  57.89473684,
        63.15789474,  68.42105263,  73.68421053,  78.94736842,
        84.21052632,  89.47368421,  94.73684211, 100.        ])

In [14]:

np.arange(0,100,step=3)

Out[14]:

array([ 0,  3,  6,  9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45, 48,
       51, 54, 57, 60, 63, 66, 69, 72, 75, 78, 81, 84, 87, 90, 93, 96, 99])

In [16]:

np.random.randint(0,100,size=(5,6))

Out[16]:

array([[71, 76, 47, 11,  7,  6],
       [47, 89, 70, 44, 41, 96],
       [58, 42, 36, 53, 49, 55],
       [13, 32, 64, 58, 15,  7],
       [78, 56, 40, 71, 45, 63]])

In [18]:

np.random.random((3,4))

Out[18]:

array([[0.24913375, 0.91988476, 0.36386714, 0.58404557],
       [0.15544885, 0.73892461, 0.82189615, 0.80368295],
       [0.07230386, 0.45535116, 0.75370029, 0.03377829]])

随机性：
- 随机因子：x(时间)

In [23]:

#固定随机性
np.random.seed(10)
np.random.randint(0,100,size=(5,6))

Out[23]:

array([[ 9, 15, 64, 28, 89, 93],
       [29,  8, 73,  0, 40, 36],
       [16, 11, 54, 88, 62, 33],
       [72, 78, 49, 51, 54, 77],
       [69, 13, 25, 13, 92, 86]])

numpy的常用属性

shape
ndim
size
dtype

In [30]:

img_arr.shape
img_arr.ndim
img_arr.size
img_arr.dtype
type(img_arr)

Out[30]:

numpy.ndarray

In [32]:

arr = np.array([1,2,3],dtype='uint8')

numpy的数据类型

array(dtype=?):可以设定数据类型
arr.dtype = ‘?’:可以修改数据类型![image.png]

In [38]:

arr = np.array([1,2,3])

In [39]:

arr.dtype = 'int32'

numpy的索引和切片操作（重点）

索引操作和列表同理

In [40]:

arr = np.random.randint(0,100,size=(6,8))
arr

Out[40]:

array([[30, 30, 89, 12, 65, 31, 57, 36],
       [27, 18, 93, 77, 22, 23, 94, 11],
       [28, 74, 88,  9, 15, 18, 80, 71],
       [88, 11, 17, 46,  7, 75, 28, 33],
       [84, 96, 88, 44,  5,  4, 71, 88],
       [88, 50, 54, 34, 15, 77, 88, 15]])

In [41]:

arr[1]

Out[41]:

array([27, 18, 93, 77, 22, 23, 94, 11])

切片操作
- 切出前两列数据
- 切出前两行数据
- 切出前两行的前两列的数据
- 数组数据翻转
- 练习：将一张图片上下左右进行翻转操作
- 练习：将图片进行指定区域的裁剪

In [44]:

arr.shape

Out[44]:

(6, 8)

In [43]:

#切出前两行
arr[0:2]

Out[43]:

array([[30, 30, 89, 12, 65, 31, 57, 36],
       [27, 18, 93, 77, 22, 23, 94, 11]])

In [45]:

#切出前两列arr[hang,lie]
arr[:,0:2]

Out[45]:

array([[30, 30],
       [27, 18],
       [28, 74],
       [88, 11],
       [84, 96],
       [88, 50]])

In [46]:

#切出前两行的前两列的数据
arr[0:2,0:2]

Out[46]:

array([[30, 30],
       [27, 18]])

In [49]:

#数组数据翻转
plt.imshow(img_arr)

Out[49]:

<matplotlib.image.AxesImage at 0x16578fb6a20>

In [50]:

img_arr.shape #前两个维度表示的是像素，最后一个维度表示颜色

Out[50]:

(426, 640, 3)

In [51]:

#将图片进行上下翻转
plt.imshow(img_arr[::-1,:,:])

Out[51]:

<matplotlib.image.AxesImage at 0x16578b53080>

In [52]:

plt.imshow(img_arr[:,::-1,:])

Out[52]:

<matplotlib.image.AxesImage at 0x16579085f28>

In [53]:

plt.imshow(img_arr[::-1,::-1,::-1])

Out[53]:

<matplotlib.image.AxesImage at 0x16578bfc588>

In [54]:

#裁剪
plt.imshow(img_arr)

Out[54]:

<matplotlib.image.AxesImage at 0x165791662b0>

In [55]:

plt.imshow(img_arr[50:200,50:300,:])

Out[55]:

<matplotlib.image.AxesImage at 0x16578f80c88>

切片汇总：
- 切行：arr[index1:index3]
- 切列：arr[行切片,列切片]
- 翻转：arr[::-1]

变形reshape

变形前和变形后对应的数组元素个数是一致

In [57]:

arr = np.array([1,2,3,4,5,6])
arr

Out[57]:

array([1, 2, 3, 4, 5, 6])

In [60]:

#将一维数组变形成二维
arr.reshape((2,3))

Out[60]:

array([[1, 2, 3],
       [4, 5, 6]])

In [61]:

arr.reshape((-1,2))

Out[61]:

array([[1, 2],
       [3, 4],
       [5, 6]])

级联操作：concatenate

是对numpy数组进行横向或者纵向的拼接
axis轴向的理解
- 0:列
- 1：行

In [64]:

arr1 = np.array([[1,2,3],[4,5,6]])
arr1

Out[64]:

array([[1, 2, 3],
       [4, 5, 6]])

In [68]:

np.concatenate((arr1,arr1),axis=1)

Out[68]:

array([[1, 2, 3, 1, 2, 3],
       [4, 5, 6, 4, 5, 6]])

In [69]:

arr2 = np.array([[1,2,3,3],[4,5,6,6]])
arr2

Out[69]:

array([[1, 2, 3, 3],
       [4, 5, 6, 6]])

匹配级联
- 级联的多个数组的形状是一样
不匹配级联
- 级联的多个数组的形状是不一样（维度必须一样）
  - 多个数组的行数一样进行行级联
  - 多个数组的列数一样进行列级联

In [71]:

#讲arr1和arr2进行级联
np.concatenate((arr1,arr2),axis=1)

Out[71]:

array([[1, 2, 3, 1, 2, 3, 3],
       [4, 5, 6, 4, 5, 6, 6]])

常用的聚合操作

sum,max,min,mean

In [74]:

arr = np.random.randint(0,10,size=(4,5))
arr

Out[74]:

array([[6, 6, 5, 6, 0],
       [0, 6, 9, 1, 8],
       [9, 1, 2, 8, 9],
       [9, 5, 0, 2, 7]])

In [77]:

arr.sum(axis=1)

Out[77]:

array([23, 24, 29, 23])

常用的数学函数

NumPy 提供了标准的三角函数：sin()、cos()、tan()
numpy.around(a,decimals) 函数返回指定数字的四舍五入值。
- 参数说明：
  - a: 数组
  - decimals: 舍入的小数位数。默认值为0。如果为负，整数将四舍五入到小数点左侧的位置

In [78]:

np.sin(arr)

Out[78]:

array([[-0.2794155 , -0.2794155 , -0.95892427, -0.2794155 ,  0.        ],
       [ 0.        , -0.2794155 ,  0.41211849,  0.84147098,  0.98935825],
       [ 0.41211849,  0.84147098,  0.90929743,  0.98935825,  0.41211849],
       [ 0.41211849, -0.95892427,  0.        ,  0.90929743,  0.6569866 ]])

In [81]:

arr = np.random.random(size=(3,4))
arr

Out[81]:

array([[0.07961309, 0.30545992, 0.33071931, 0.7738303 ],
       [0.03995921, 0.42949218, 0.31492687, 0.63649114],
       [0.34634715, 0.04309736, 0.87991517, 0.76324059]])

In [83]:

np.around(arr,decimals=2)

Out[83]:

array([[0.08, 0.31, 0.33, 0.77],
       [0.04, 0.43, 0.31, 0.64],
       [0.35, 0.04, 0.88, 0.76]])

常用的统计函数

numpy.amin() 和 numpy.amax()，用于计算数组中的元素沿指定轴的最小、最大值。
numpy.ptp():计算数组中元素最大值与最小值的差（最大值 - 最小值）。
numpy.median() 函数用于计算数组 a 中元素的中位数（中值）
标准差std():标准差是一组数据平均值分散程度的一种度量。
- 公式：std = sqrt(mean((x - x.mean())**2))
- 如果数组是 [1，2，3，4]，则其平均值为 2.5。因此，差的平方是 [2.25,0.25,0.25,2.25]，并且其平均值的平方根除以 4，即 sqrt(5/4) ，结果为 1.1180339887498949。
方差var()：统计中的方差（样本方差）是每个样本值与全体样本值的平均数之差的平方值的平均数，即 mean((x - x.mean())** 2)。换句话说，标准差是方差的平方根。

In [85]:

arr = np.random.randint(0,20,size=(5,3))
arr

Out[85]:

array([[12, 18, 17],
       [17, 16,  0],
       [ 5,  9,  0],
       [ 6,  0,  2],
       [ 3,  3, 18]])

In [86]:

np.amin(arr,axis=0)

Out[86]:

array([3, 0, 0])

In [87]:

np.ptp(arr,axis=0)

Out[87]:

array([14, 18, 18])

In [88]:

np.median(arr,axis=0)

Out[88]:

array([6., 9., 2.])

In [93]:

np.std(arr,axis=0)

Out[93]:

array([5.16139516, 7.02566723, 8.28492607])

In [94]:

np.var(arr,axis=0)

Out[94]:

array([26.64, 49.36, 68.64])

矩阵相关

NumPy 中包含了一个矩阵库 numpy.matlib，该模块中的函数返回的是一个矩阵，而不是 ndarray 对象。一个的矩阵是一个由行（row）列（column）元素排列成的矩形阵列。
matlib.empty() 函数返回一个新的矩阵，语法格式为：numpy.matlib.empty(shape, dtype)，填充为随机数据
- 参数介绍：
  - shape: 定义新矩阵形状的整数或整数元组
  - Dtype: 可选，数据类型

In [98]:

import numpy.matlib as matlib
matlib.empty(shape=(4,5))

Out[98]:

matrix([[-0.2794155 , -0.2794155 , -0.95892427, -0.2794155 ,  0.        ],
        [ 0.        , -0.2794155 ,  0.41211849,  0.84147098,  0.98935825],
        [ 0.41211849,  0.84147098,  0.90929743,  0.98935825,  0.41211849],
        [ 0.41211849, -0.95892427,  0.        ,  0.90929743,  0.6569866 ]])

numpy.matlib.zeros()，numpy.matlib.ones()返回填充为0或者1的矩阵

In [ ]:

numpy.matlib.eye() 函数返回一个矩阵，对角线元素为 1，其他位置为零。
- numpy.matlib.eye(n, M,k, dtype)
  - n: 返回矩阵的行数
  - M: 返回矩阵的列数，默认为 n
  - k: 对角线的索引
  - dtype: 数据类型

In [101]:

matlib.eye(5,5,1)

Out[101]:

matrix([[0., 1., 0., 0., 0.],
        [0., 0., 1., 0., 0.],
        [0., 0., 0., 1., 0.],
        [0., 0., 0., 0., 1.],
        [0., 0., 0., 0., 0.]])

numpy.matlib.identity() 函数返回给定大小的单位矩阵。单位矩阵是个方阵，从左上角到右下角的对角线（称为主对角线）上的元素均为 1，除此以外全都为 0。

In [99]:

matlib.identity(6)

Out[99]:

matrix([[1., 0., 0., 0., 0., 0.],
        [0., 1., 0., 0., 0., 0.],
        [0., 0., 1., 0., 0., 0.],
        [0., 0., 0., 1., 0., 0.],
        [0., 0., 0., 0., 1., 0.],
        [0., 0., 0., 0., 0., 1.]])

转置矩阵
- .T

In [103]:

arr = matlib.identity(6)
arr

Out[103]:

matrix([[1., 0., 0., 0., 0., 0.],
        [0., 1., 0., 0., 0., 0.],
        [0., 0., 1., 0., 0., 0.],
        [0., 0., 0., 1., 0., 0.],
        [0., 0., 0., 0., 1., 0.],
        [0., 0., 0., 0., 0., 1.]])

In [106]:

a = np.array([[1,2,3],[4,5,6]])
a

Out[106]:

array([[1, 2, 3],
       [4, 5, 6]])

In [107]:

a.T

Out[107]:

array([[1, 4],
       [2, 5],
       [3, 6]])

矩阵相乘
- numpy.dot(a, b, out=None)
  - a : ndarray 数组
  - b : ndarray 数组
- 第一个矩阵第一行的每个数字（2和1），各自乘以第二个矩阵第一列对应位置的数字（1和1），然后将乘积相加（ 2 x 1 + 1 x 1），得到结果矩阵左上角的那个值3。也就是说，结果矩阵第m行与第n列交叉位置的那个值，等于第一个矩阵第m行与第二个矩阵第n列，对应位置的每个值的乘积之和。
- 线性代数基于矩阵的推导：
  - https://www.cnblogs.com/alantu2018/p/8528299.html

In [109]:

arr_1 = np.array([[1,2,3],[4,5,6]])  #2行3列
arr_2 = np.array([[1,2,3],[4,5,6]]) 
arr_2 = arr_2.T

In [110]:

arr_1

Out[110]:

array([[1, 2, 3],
       [4, 5, 6]])

In [111]:

arr_2

Out[111]:

array([[1, 4],
       [2, 5],
       [3, 6]])

In [112]:

np.dot(arr_1,arr_2)

Out[112]:

array([[14, 32],
       [32, 77]])

重点掌握：
- 数组的创建
- 数组的索引和切片
- 数据的级联，变形
- numpy的聚合（sum,max,mean）和统计函数(std())
- 矩阵的乘法原理

2.pandas基础操作

为什么学习pandas

numpy已经可以帮助我们进行数据的处理了，那么学习pandas的目的是什么呢？
- numpy能够帮助我们处理的是数值型的数据，当然在数据分析中除了数值型的数据还有好多其他类型的数据（字符串，时间序列），那么pandas就可以帮我们很好的处理除了数值型的其他数据！

什么是pandas？

首先先来认识pandas中的两个常用的类
- Series
- DataFrame

In [8]:

import pandas as pd
from pandas import Series,DataFrame
import numpy as np

Series

Series是一种类似与一维数组的对象，由下面两个部分组成：
- values：一组数据（ndarray类型）
- index：相关的数据索引标签
Series的创建
- 由列表或numpy数组创建
- 由字典创建

In [3]:

s = Series(data=[1,2,3,4,5])
s

Out[3]:

0    1
1    2
2    3
3    4
4    5
dtype: int64

Series的索引
- 隐事索引：默认
- 显式索引：增强数据的可读性
  - index的参数指定

In [4]:

s1 = Series(data=[1,2,3],index=['a','b','c'])
s1

Out[4]:

a    1
b    2
c    3
dtype: int64

In [7]:

dic = {
    '数学':100,
    '理综':188
}
s3 = Series(data=dic)
s3

Out[7]:

数学    100
理综    188
dtype: int64

In [10]:

s4 = Series(data=np.random.randint(0,100,size=(3,)))
s4

Out[10]:

0     9
1    91
2    24
dtype: int32

Series的索引和切片

In [12]:

s1

Out[12]:

a    1
b    2
c    3
dtype: int64

In [16]:

s1['a']
s1[0]
s1.a

Out[16]:

In [19]:

s1[0:2]
s1['a':'c']

Out[19]:

a    1
b    2
c    3
dtype: int64

Series的常用属性
- shape
- size
- index
- values

In [24]:

s1.shape
s1.size
s1.index
s1.values

Out[24]:

array([1, 2, 3], dtype=int64)

Series的常用方法
- head(),tail()
- unique()
- isnull(),notnull()
- add() sub() mul() div()

In [26]:

s1.head(2)#只显示前两个数
s1.tail(2)

Out[26]:

b    2
c    3
dtype: int64

Series的算术运算

In [27]:

s1 = Series(data=[1,2,3,4],index=['a','b','c','d'])
s2 = Series(data=[1,2,3,4],index=['a','b','e','d'])
s1

Out[27]:

a    1
b    2
c    3
d    4
dtype: int64

In [28]:

s2

Out[28]:

a    1
b    2
e    3
d    4
dtype: int64

Series的运算法则：
- 索引一致的元素值进行算数运算，否则补空

In [29]:

s = s1+s2
s

Out[29]:

a    2.0
b    4.0
c    NaN
d    8.0
e    NaN
dtype: float64

基于Series的空值（缺失值）过滤
- isnull,notnull:判断某些元素是否为空值

In [30]:

s.isnull()

Out[30]:

a    False
b    False
c     True
d    False
e     True
dtype: bool

In [33]:

#使用隐事和显示索引
s[[0,1,2]]
s[['a','c']]

Out[33]:

a    2.0
c    NaN
dtype: float64

In [35]:

Out[35]:

a    2.0
b    4.0
c    NaN
d    8.0
e    NaN
dtype: float64

In [36]:

#使用布尔值充当索引
s[[True,True,False,True,False]]

Out[36]:

a    2.0
b    4.0
d    8.0
dtype: float64

In [37]:

s.notnull()

Out[37]:

a     True
b     True
c    False
d     True
e    False
dtype: bool

In [38]:

s[s.notnull()]

Out[38]:

a    2.0
b    4.0
d    8.0
dtype: float64

DataFrame

DataFrame是一个【表格型】的数据结构。DataFrame由按一定顺序排列的多列数据组成。设计初衷是将Series的使用场景从一维拓展到多维。DataFrame既有行索引，也有列索引。
- 行索引：index
- 列索引：columns
- 值：values
DataFrame的创建
- ndarray创建
- 字典创建

In [39]:

DataFrame(data=np.random.randint(0,100,size=(4,6)))

Out[39]:

	0	1	2	3	4	5
0	93	24	73	95	46	36
1	17	98	7	13	79	34
2	82	51	52	21	4	50
3	77	23	91	31	6	12

In [42]:

dic = {
    'name':['张三','李四','王老五'],
    'salary':[10000,20000,15555]
}
df = DataFrame(data=dic,index=['a','b','c'])
df

Out[42]:

	name	salary
a	张三	10000
b	李四	20000
c	王老五	15555

DataFrame的属性
- values、columns、index、shape

In [46]:

df.values
df.columns
df.index
df.shape

Out[46]:

(3, 2)

============================================

练习4：

根据以下考试成绩表，创建一个DataFrame，命名为df：

    张三  李四  
语文 150  0
数学 150  0
英语 150  0
理综 300  0

============================================

Type Markdown and LaTeX: α2α2

DataFrame索引操作
- 对行进行索引
- 队列进行索引
- 对元素进行索引

In [49]:

df

Out[49]:

	name	salary
a	张三	10000
b	李四	20000
c	王老五	15555

In [51]:

#取出第一列
df['name']

Out[51]:

a     张三
b     李四
c    王老五
Name: name, dtype: object

In [52]:

#取出多列
df[['name','salary']]

Out[52]:

	name	salary
a	张三	10000
b	李四	20000
c	王老五	15555

In [56]:

#取出一行
df.loc['a']

Out[56]:

name         张三
salary    10000
Name: a, dtype: object

In [54]:

#取多行
df.loc[['a','c']]

Out[54]:

	name	salary
a	张三	10000
c	王老五	15555

In [58]:

df.iloc[[1,2]]

Out[58]:

	name	salary
b	李四	20000
c	王老五	15555

loc[‘显示索引’]
iloc[隐事索引]

In [59]:

df

Out[59]:

	name	salary
a	张三	10000
b	李四	20000
c	王老五	15555

In [61]:

#取单个的元素（李四的薪资取出）
df.iloc[1,1]
df.loc['b','salary']

Out[61]:

In [62]:

#取多个个的元素
df.loc[['a','c'],'salary']

Out[62]:

a    10000
c    15555
Name: salary, dtype: int64

DataFrame的切片操作
- 对行进行切片
- 对列进行切片

In [63]:

#切出前两行
df[0:2]

Out[63]:

	name	salary
a	张三	10000
b	李四	20000

In [65]:

#切出前两列
df.iloc[:,0:2]

Out[65]:

	name	salary
a	张三	10000
b	李四	20000
c	王老五	15555

索引和切片的汇总
- 索引：
  - df[col]:取单列
  - df[[col1,col2]]:取多列
  - df.loc[row]:取单行
  - df.loc[[row1,row2]]:取多行
  - df.loc[row,col]:取元素
- 切片
  - 切行：df[row1:row3]
  - 切列：df.loc[:,col1:col3]

In [ ]:

DataFrame的运算:和Series的运算法则一样

Type Markdown and LaTeX: α2α2

============================================

练习：

假设ddd是期中考试成绩，ddd2是期末考试成绩，请自由创建ddd2，并将其与ddd相加，求期中期末平均值。
假设张三期中考试数学被发现作弊，要记为0分，如何实现？
李四因为举报张三作弊立功，期中考试所有科目加100分，如何实现？
后来老师发现有一道题出错了，为了安抚学生情绪，给每位学生每个科目都加10分，如何实现？

============================================

In [ ]:

时间数据类型的转换
- pd.to_datetime(col)
将某一列设置为行索引
- df.set_index()
股票：
- 使用tushare包获取某股票的历史行情数据。
  - tushre财经数据接口包：提供了各种财经历史交易数据
  - 下载tushare：pip install tushare
- 输出该股票所有收盘比开盘上涨3%以上的日期。
- 输出该股票所有开盘比前日收盘跌幅超过2%的日期。
- 假如我从2010年1月1日开始，每月第一个交易日买入1手股票，每年最后一个交易日卖出所有股票，到今天为止，我的收益如何？

In [70]:

import tushare as ts
df = ts.get_k_data('600519',start='2000-01-01')

In [72]:

#写入到文件
df.to_csv('./maotai.csv')

In [74]:

#将本地的数据读取到df
df = pd.read_csv('./maotai.csv')
df.head(5)

Out[74]:

	Unnamed: 0	date	open	close	high	low	volume	code
0	0	2001-08-27	5.392	5.554	5.902	5.132	406318.00	600519
1	1	2001-08-28	5.467	5.759	5.781	5.407	129647.79	600519
2	2	2001-08-29	5.777	5.684	5.781	5.640	53252.75	600519
3	3	2001-08-30	5.668	5.796	5.860	5.624	48013.06	600519
4	4	2001-08-31	5.804	5.782	5.877	5.749	23231.48	600519

In [79]:

#将无用的列删除.drop系列的函数中axis=0行，1列
df.drop(labels='Unnamed: 0',axis=1,inplace=True) #inplace=True把数据从原始数据中删除

df.info():
- 返回df中一些原始信息
  - 数据的行数
  - 每一列元素的数据类型
  - 检测列中是否有缺失数据

In [83]:

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4385 entries, 0 to 4384
Data columns (total 7 columns):
date      4385 non-null object
open      4385 non-null float64
close     4385 non-null float64
high      4385 non-null float64
low       4385 non-null float64
volume    4385 non-null float64
code      4385 non-null int64
dtypes: float64(5), int64(1), object(1)
memory usage: 239.9+ KB

In [88]:

##将date列中的数据类型转换成时间序列类型
df['date'] = pd.to_datetime(df['date'])

In [90]:

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4385 entries, 0 to 4384
Data columns (total 7 columns):
date      4385 non-null datetime64[ns]
open      4385 non-null float64
close     4385 non-null float64
high      4385 non-null float64
low       4385 non-null float64
volume    4385 non-null float64
code      4385 non-null int64
dtypes: datetime64[ns](1), float64(5), int64(1)
memory usage: 239.9 KB

In [94]:

#将date列作为源数据的行索引
df.set_index('date',inplace=True)

     0 | 2001-08-27 | 5.392 | 5.554 | 5.902 | 5.132 | 406318.00 | 600519 |
|    1 |          1 | 2001-08-28 | 5.467 | 5.759 | 5.781 | 5.407 | 129647.79 | 600519 |
|    2 |          2 | 2001-08-29 | 5.777 | 5.684 | 5.781 | 5.640 |  53252.75 | 600519 |
|    3 |          3 | 2001-08-30 | 5.668 | 5.796 | 5.860 | 5.624 |  48013.06 | 600519 |
|    4 |          4 | 2001-08-31 | 5.804 | 5.782 | 5.877 | 5.749 |  23231.48 | 600519 |

In [79]:

#将无用的列删除.drop系列的函数中axis=0行，1列
df.drop(labels='Unnamed: 0',axis=1,inplace=True) #inplace=True把数据从原始数据中删除

df.info():
- 返回df中一些原始信息
  - 数据的行数
  - 每一列元素的数据类型
  - 检测列中是否有缺失数据

In [83]:

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4385 entries, 0 to 4384
Data columns (total 7 columns):
date      4385 non-null object
open      4385 non-null float64
close     4385 non-null float64
high      4385 non-null float64
low       4385 non-null float64
volume    4385 non-null float64
code      4385 non-null int64
dtypes: float64(5), int64(1), object(1)
memory usage: 239.9+ KB

In [88]:

##将date列中的数据类型转换成时间序列类型
df['date'] = pd.to_datetime(df['date'])

In [90]:

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4385 entries, 0 to 4384
Data columns (total 7 columns):
date      4385 non-null datetime64[ns]
open      4385 non-null float64
close     4385 non-null float64
high      4385 non-null float64
low       4385 non-null float64
volume    4385 non-null float64
code      4385 non-null int64
dtypes: datetime64[ns](1), float64(5), int64(1)
memory usage: 239.9 KB

In [94]:

#将date列作为源数据的行索引
df.set_index('date',inplace=True)

数据分析 02

3.DataFrame基础操作巩固-股票分析

In [42]:

import numpy as np
import pandas as pd
from pandas import Series,DataFrame
import tushare as ts#财经数据接口包
import matplotlib.pyplot as plt

需求：股票分析

使用tushare包获取某股票的历史行情数据。
输出该股票所有收盘比开盘上涨3%以上的日期。
输出该股票所有开盘比前日收盘跌幅超过2%的日期。
假如我从2010年1月1日开始，每月第一个交易日买入1手股票，每年最后一个交易日卖出所有股票，到今天为止，我的收益如何？

In [4]:

df = ts.get_k_data('600519',start='2000-01-01')
df.to_csv('./maotai.csv')

In [50]:

df = pd.read_csv('./maotai.csv')
df.head()

Out[50]:

	Unnamed: 0	date	open	close	high	low	volume	code
0	0	2001-08-27	5.392	5.554	5.902	5.132	406318.00	600519
1	1	2001-08-28	5.467	5.759	5.781	5.407	129647.79	600519
2	2	2001-08-29	5.777	5.684	5.781	5.640	53252.75	600519
3	3	2001-08-30	5.668	5.796	5.860	5.624	48013.06	600519
4	4	2001-08-31	5.804	5.782	5.877	5.749	23231.48	600519

In [51]:

df.drop(labels='Unnamed: 0',axis=1,inplace=True)

In [7]:

df.head(3)

Out[7]:

	date	open	close	high	low	volume	code
0	2001-08-27	5.392	5.554	5.902	5.132	406318.00	600519
1	2001-08-28	5.467	5.759	5.781	5.407	129647.79	600519
2	2001-08-29	5.777	5.684	5.781	5.640	53252.75	600519

In [8]:

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4386 entries, 0 to 4385
Data columns (total 7 columns):
date      4386 non-null object
open      4386 non-null float64
close     4386 non-null float64
high      4386 non-null float64
low       4386 non-null float64
volume    4386 non-null float64
code      4386 non-null int64
dtypes: float64(5), int64(1), object(1)
memory usage: 239.9+ KB

In [10]:

df.describe()#聚合操作

Out[10]:

	open	close	high	low	volume	code
count	4386.000000	4386.000000	4386.000000	4386.000000	4386.000000	4386.0
mean	198.346553	198.656232	201.125280	195.931863	27921.072905	600519.0
std	260.048946	260.403673	263.249482	257.143597	24503.505290	0.0
min	4.049000	4.045000	4.068000	4.012000	238.100000	600519.0
25%	27.526000	27.536750	27.820250	27.093000	11310.195000	600519.0
50%	113.967500	113.987500	115.515500	112.401000	23793.000000	600519.0
75%	194.410000	193.878000	197.158500	191.120750	37651.250000	600519.0
max	1231.000000	1233.750000	1241.610000	1228.060000	406318.000000	600519.0

In [52]:

#将date列的数据转成时间序列且将其作为源数据的行索引
df['date'] = pd.to_datetime(df['date'])

In [53]:

df.set_index('date',inplace=True)

In [54]:

df.head()

Out[54]:

	open	close	high	low	volume	code
date
2001-08-27	5.392	5.554	5.902	5.132	406318.00	600519
2001-08-28	5.467	5.759	5.781	5.407	129647.79	600519
2001-08-29	5.777	5.684	5.781	5.640	53252.75	600519
2001-08-30	5.668	5.796	5.860	5.624	48013.06	600519
2001-08-31	5.804	5.782	5.877	5.749	23231.48	600519

输出该股票所有收盘比开盘上涨3%以上的日期。
输出该股票所有开盘比前日收盘跌幅超过2%的日期。

In [18]:

df.loc[(df['close'] - df['open'])/df['open'] >  0.03 ].index

. . .

In [21]:

df.loc[(df['open'] - df['close'].shift(1))/df['close'].shift(1) < -0.02].index

. . .

假如我从2010年1月1日开始，每月第一个交易日买入1手股票，每年最后一个交易日卖出所有股票，到今天为止，我的收益如何？

In [22]:

data = df['2010':'2020']
data.head()

. . .

In [31]:

data_monthly = data.resample('M').first()
cost_money = data_monthly['open'].sum()*100

In [32]:

data_yearly = data.resample('A').last()[:-1]
recv_money = data_yearly['open'].sum()*1200

In [33]:

last_money = data['open'][-1] * 100

In [34]:

last_money+recv_money-cost_money

Out[34]:

567728.6999999997

需求：双均线策略制定

使用tushare包获取某股票的历史行情数据

In [ ]:

计算该股票历史数据的5日均线和30日均线
- 什么是均线？
  - 对于每一个交易日，都可以计算出前N天的移动平均值，然后把这些移动平均值连起来，成为一条线，就叫做N日移动平均线。移动平均线常用线有5天、10天、30天、60天、120天和240天的指标。
    - 5天和10天的是短线操作的参照指标，称做日均线指标；
    - 30天和60天的是中期均线指标，称做季均线指标；
    - 120天和240天的是长期均线指标，称做年均线指标。
- 均线计算方法：MA=（C1+C2+C3+…+Cn)/N C:某日收盘价 N:移动平均周期（天数）

In [64]:

#ma表示的是均线
ma5 = df['close'].rolling(5).mean()
ma30 = df['close'].rolling(30).mean()

In [65]:

#将ma5和ma30汇总到源数据中
df['ma5'] = ma5
df['ma30'] = ma30
df

. . .

可视化历史数据的收盘价和两条均线

In [46]:

plt.plot(ma5[50:100],c='red')
plt.plot(ma30[50:100],c='blue')

Out[46]:

[<matplotlib.lines.Line2D at 0x1b084f37550>]

分析输出所有金叉日期和死叉日期
- 股票分析技术中的金叉和死叉，可以简单解释为：
  - 分析指标中的两根线，一根为短时间内的指标线，另一根为较长时间的指标线。
  - 如果短时间的指标线方向拐头向上，并且穿过了较长时间的指标线，这种状态叫“金叉”；
  - 如果短时间的指标线方向拐头向下，并且穿过了较长时间的指标线，这种状态叫“死叉”；
  - 一般情况下，出现金叉后，操作趋向买入；死叉则趋向卖出。当然，金叉和死叉只是分析指标之一，要和其他很多指标配合使用，才能增加操作的准确性。


- 如果我从假如我从2010年1月1日开始，初始资金为100000元，金叉尽量买入，死叉全部卖出，则到今天为止，我的炒股收益率如何？

In [68]:

df = df[‘2010’:‘2020’]
df


. . .

In [74]:

sr1 = df[‘ma5’] < df[‘ma30’]
sr2 = df[‘ma5’] >= df[‘ma30’]


- 让sr1和sr2.shift(1)进行与操作或者或操作，返回的结果定位到金叉和死叉

In [77]:

df.loc[sr1 & sr2.shift(1)] #死叉对应的行数据
death_dates = df.loc[sr1 & sr2.shift(1)].index


In [79]:

df.loc[~(sr1 | sr2.shift(1))]#金叉对应的行数据
golden_dates = df.loc[~(sr1 | sr2.shift(1))].index


In [80]:

golden_dates


. . .

In [96]:

#基于金叉和死叉买卖股票计算收益
first_money = 100000
money = first_money
hold = 0 #持有股票的数量（股）

s1 = Series(1,index=golden_dates)#1标识金叉日期
s2 = Series(0,index=death_dates)#0表示死叉日期
s = s1.append(s2) #存储的是所有的金叉和死叉日期
s = s.sort_index() #根据索引排序

for i in s.index:
#开盘价作为买卖的单价
price = df.loc[i][‘open’]
if s[i] == 1:#金叉：买入
hand_cost = 100 * price#1手股票花费的钱数
hand_count = money // hand_cost #最多买入了多少手股票
hold = hand_count * 100 #买入的多少只股票
money -= hold*price
else:
money += hold * price
hold = 0

#如果最后一天为金叉，最后一天买入股票，没有卖出。剩余的股票也要计算到总收益中
last_money = hold * df[‘open’][-1]
print(money + last_money - first_money)
1501254.9999999995
``

数据分析03

4.基于pandas的数据清洗

Python 3

Not Trusted

Run

处理丢失数据

有两种丢失数据：
- None
- np.nan(NaN)

In [1]:

import numpy as np
import pandas as pd
from pandas import Series,DataFrame
import tushare as ts#财经数据接口包
import matplotlib.pyplot as plt

两种丢失数据的区别

In [2]:

type(np.nan)

Out[2]:

float

In [5]:

np.nan + 3

Out[5]:

nan

In [3]:

type(None)

Out[3]:

NoneType

pandas中的None和NAN

In [10]:

df = DataFrame(data=np.random.randint(0,100,size=(8,5)))
df

Out[10]:

	0	1	2	3	4
0	44	91	92	51	55
1	23	22	92	35	83
2	21	52	40	63	29
3	94	51	24	70	59
4	27	78	1	21	17
5	94	57	5	43	22
6	87	31	58	30	82
7	93	28	54	7	93

In [12]:

df.iloc[1,2] = None
df.iloc[3,4] = None
df.iloc[4,1] = None
df.iloc[7,4] = np.nan

In [13]:

df

Out[13]:

	0	1	2	3	4
0	44	91.0	92.0	51	55.0
1	23	22.0	NaN	35	83.0
2	21	52.0	40.0	63	29.0
3	94	51.0	24.0	70	NaN
4	27	NaN	1.0	21	17.0
5	94	57.0	5.0	43	22.0
6	87	31.0	58.0	30	82.0
7	93	28.0	54.0	7	NaN

pandas处理空值操作

isnull
notnull
any
all
dropna
fillna

In [16]:

df.isnull()

Out[16]:

	0	1	2	3	4
0	False	False	False	False	False
1	False	False	True	False	False
2	False	False	False	False	False
3	False	False	False	False	True
4	False	True	False	False	False
5	False	False	False	False	False
6	False	False	False	False	False
7	False	False	False	False	True

检测出原始数据中哪些行中存在空值

In [20]:

df.isnull()

Out[20]:

	0	1	2	3	4
0	False	False	False	False	False
1	False	False	True	False	False
2	False	False	False	False	False
3	False	False	False	False	True
4	False	True	False	False	False
5	False	False	False	False	False
6	False	False	False	False	False
7	False	False	False	False	True

any和all可以帮我们检测df中哪些行列中存在空值
isnull->any(axis=1)
notnull->all(axis=1)

In [24]:

~df.isnull().any(axis=1)
df.loc[~df.isnull().any(axis=1)]

Out[24]:

	0	1	2	3	4
0	44	91.0	92.0	51	55.0
2	21	52.0	40.0	63	29.0
5	94	57.0	5.0	43	22.0
6	87	31.0	58.0	30	82.0

In [28]:

df.notnull().all(axis=1)
df.loc[df.notnull().all(axis=1)]

Out[28]:

	0	1	2	3	4
0	44	91.0	92.0	51	55.0
2	21	52.0	40.0	63	29.0
5	94	57.0	5.0	43	22.0
6	87	31.0	58.0	30	82.0

In [29]:

df.dropna(axis=0)  #将空值对应的行数据删除

Out[29]:

	0	1	2	3	4
0	44	91.0	92.0	51	55.0
2	21	52.0	40.0	63	29.0
5	94	57.0	5.0	43	22.0
6	87	31.0	58.0	30	82.0

In [32]:

df

Out[32]:

	0	1	2	3	4
0	44	91.0	92.0	51	55.0
1	23	22.0	NaN	35	83.0
2	21	52.0	40.0	63	29.0
3	94	51.0	24.0	70	NaN
4	27	NaN	1.0	21	17.0
5	94	57.0	5.0	43	22.0
6	87	31.0	58.0	30	82.0
7	93	28.0	54.0	7	NaN

In [33]:

#fillna将空值进行覆盖
df.fillna(method='ffill',axis=0) #使用紧邻值填充空值

Out[33]:

	0	1	2	3	4
0	44	91.0	92.0	51	55.0
1	23	22.0	92.0	35	83.0
2	21	52.0	40.0	63	29.0
3	94	51.0	24.0	70	29.0
4	27	51.0	1.0	21	17.0
5	94	57.0	5.0	43	22.0
6	87	31.0	58.0	30	82.0
7	93	28.0	54.0	7	82.0

面试题

数据说明：
- 数据是1个冷库的温度数据，1-7对应7个温度采集设备，1分钟采集一次。
数据处理目标：
- 用1-4对应的4个必须设备，通过建立冷库的温度场关系模型，预估出5-7对应的数据。
- 最后每个冷库中仅需放置4个设备，取代放置7个设备。
- f(1-4) --> y(5-7)
数据处理过程：
- 1、原始数据中有丢帧现象，需要做预处理；
- 2、matplotlib 绘图；
- 3、建立逻辑回归模型。
无标准答案，按个人理解操作即可，请把自己的操作过程以文字形式简单描述一下，谢谢配合。
测试数据为testData.xlsx

处理重复数据

处理异常数据

自定义一个1000行3列（A，B，C）取值范围为0-1的数据源，然后将C列中的值大于其两倍标准差的异常值进行清洗

5.DataFrame的级联and合并操作

级联操作

pd.concat, pd.append

pandas使用pd.concat函数，与np.concatenate函数类似，只是多了一些参数：

objs
axis=0
keys
join='outer' / 'inner':表示的是级联的方式，outer会将所有的项进行级联（忽略匹配和不匹配），而inner只会将匹配的项级联到一起，不匹配的不级联
ignore_index=False

匹配级联
不匹配级联
- 不匹配指的是级联的维度的索引不一致。例如纵向级联时列索引不一致，横向级联时行索引不一致
- 有2种连接方式：
  - 外连接：补NaN（默认模式）
  - 内连接：只连接匹配的项
append函数的使用

合并操作

merge与concat的区别在于，merge需要依据某一共同列来进行合并
使用pd.merge()合并时，会自动根据两者相同column名称的那一列，作为key来进行合并。
注意每一列元素的顺序不要求一致

一对一合并

In [ ]:

df1 = DataFrame({'employee':['Bob','Jake','Lisa'],
                'group':['Accounting','Engineering','Engineering'],
                })

In [ ]:

df2 = DataFrame({'employee':['Lisa','Bob','Jake'],
                'hire_date':[2004,2008,2012],
                })

一对多合并

In [ ]:

df3 = DataFrame({
    'employee':['Lisa','Jake'],
    'group':['Accounting','Engineering'],
    'hire_date':[2004,2016]})

In [ ]:

df4 = DataFrame({'group':['Accounting','Engineering','Engineering'],
                       'supervisor':['Carly','Guido','Steve']
                })

多对多合并

In [ ]:

df1 = DataFrame({'employee':['Bob','Jake','Lisa'],
                 'group':['Accounting','Engineering','Engineering']})

In [ ]:

df5 = DataFrame({'group':['Engineering','Engineering','HR'],
                'supervisor':['Carly','Guido','Steve']
                })

key的规范化

当列冲突时，即有多个列名称相同时，需要使用on=来指定哪一个列作为key，配合suffixes指定冲突列名

In [ ]:

df1 = DataFrame({'employee':['Jack',"Summer","Steve"],
                 'group':['Accounting','Finance','Marketing']})

In [ ]:

df2 = DataFrame({'employee':['Jack','Bob',"Jake"],
                 'hire_date':[2003,2009,2012],
                'group':['Accounting','sell','ceo']})

当两张表没有可进行连接的列时，可使用left_on和right_on手动指定merge中左右两边的哪一列列作为连接的列

In [ ]:

df1 = DataFrame({'employee':['Bobs','Linda','Bill'],
                'group':['Accounting','Product','Marketing'],
               'hire_date':[1998,2017,2018]})

In [ ]:

df5 = DataFrame({'name':['Lisa','Bobs','Bill'],
                'hire_dates':[1998,2016,2007]})

内合并与外合并:out取并集 inner取交集

In [ ]:

df6 = DataFrame({'name':['Peter','Paul','Mary'],
               'food':['fish','beans','bread']}
               )
df7 = DataFrame({'name':['Mary','Joseph'],
                'drink':['wine','beer']})

In [ ]:

df6 = DataFrame({'name':['Peter','Paul','Mary'],
               'food':['fish','beans','bread']}
               )
df7 = DataFrame({'name':['Mary','Joseph'],
                'drink':['wine','beer']})

In [ ]:

#合并df1和df2
dic1={
    
    'name':['tom','jay','helly'],
    'age':[11,12,33],
    'classRoom':[1,2,3]
}
df1=DataFrame(data=dic1)
df2=DataFrame(data=np.random.randint(60,100,size=(3,3)),
              index=['jay','tom','helly'],
             columns=['java','python','c'])

内合并与外合并:out取并集 inner取交集

In [ ]:

df6 = DataFrame({'name':['Peter','Paul','Mary'],
               'food':['fish','beans','bread']}
               )
df7 = DataFrame({'name':['Mary','Joseph'],
                'drink':['wine','beer']})

 
df6 = DataFrame({'name':['Peter','Paul','Mary'],
               'food':['fish','beans','bread']}
               )
df7 = DataFrame({'name':['Mary','Joseph'],
                'drink':['wine','beer']})

 
#合并df1和df2
dic1={
    
    'name':['tom','jay','helly'],
    'age':[11,12,33],
    'classRoom':[1,2,3]
}
df1=DataFrame(data=dic1)
df2=DataFrame(data=np.random.randint(60,100,size=(3,3)),
              index=['jay','tom','helly'],
             columns=['java','python','c'])



### 数据分析day03 

### 4.基于pandas的数据清洗 

### 处理丢失数据

- 有两种丢失数据：
  - None
  - np.nan(NaN)

In [1]:

import numpy as np
import pandas as pd
from pandas import Series,DataFrame
import tushare as ts#财经数据接口包
import matplotlib.pyplot as plt


- 两种丢失数据的区别

In [2]:

type(np.nan)


Out[2]:

float


In [5]:

np.nan + 3


Out[5]:

nan


In [3]:

type(None)


Out[3]:

NoneType


- pandas中的None和NAN

In [10]:

df = DataFrame(data=np.random.randint(0,100,size=(8,5)))
df


Out[10]:

|      |    0 |    1 |    2 |    3 |    4 |
| ---: | ---: | ---: | ---: | ---: | ---: |
|    0 |   44 |   91 |   92 |   51 |   55 |
|    1 |   23 |   22 |   92 |   35 |   83 |
|    2 |   21 |   52 |   40 |   63 |   29 |
|    3 |   94 |   51 |   24 |   70 |   59 |
|    4 |   27 |   78 |    1 |   21 |   17 |
|    5 |   94 |   57 |    5 |   43 |   22 |
|    6 |   87 |   31 |   58 |   30 |   82 |
|    7 |   93 |   28 |   54 |    7 |   93 |

In [12]:

df.iloc[1,2] = None
df.iloc[3,4] = None
df.iloc[4,1] = None
df.iloc[7,4] = np.nan


In [13]:


Out[13]:

|      |    0 |    1 |    2 |    3 |    4 |
| ---: | ---: | ---: | ---: | ---: | ---: |
|    0 |   44 | 91.0 | 92.0 |   51 | 55.0 |
|    1 |   23 | 22.0 |  NaN |   35 | 83.0 |
|    2 |   21 | 52.0 | 40.0 |   63 | 29.0 |
|    3 |   94 | 51.0 | 24.0 |   70 |  NaN |
|    4 |   27 |  NaN |  1.0 |   21 | 17.0 |
|    5 |   94 | 57.0 |  5.0 |   43 | 22.0 |
|    6 |   87 | 31.0 | 58.0 |   30 | 82.0 |
|    7 |   93 | 28.0 | 54.0 |    7 |  NaN |

### pandas处理空值操作

- isnull
- notnull
- any
- all
- dropna
- fillna

In [16]:

df.isnull()


Out[16]:

|      |     0 |     1 |     2 |     3 |     4 |
| ---: | ----: | ----: | ----: | ----: | ----: |
|    0 | False | False | False | False | False |
|    1 | False | False |  True | False | False |
|    2 | False | False | False | False | False |
|    3 | False | False | False | False |  True |
|    4 | False |  True | False | False | False |
|    5 | False | False | False | False | False |
|    6 | False | False | False | False | False |
|    7 | False | False | False | False |  True |

- 检测出原始数据中哪些行中存在空值

In [20]:

df.isnull()


Out[20]:

|      |     0 |     1 |     2 |     3 |     4 |
| ---: | ----: | ----: | ----: | ----: | ----: |
|    0 | False | False | False | False | False |
|    1 | False | False |  True | False | False |
|    2 | False | False | False | False | False |
|    3 | False | False | False | False |  True |
|    4 | False |  True | False | False | False |
|    5 | False | False | False | False | False |
|    6 | False | False | False | False | False |
|    7 | False | False | False | False |  True |

- any和all可以帮我们检测df中哪些行列中存在空值
- isnull->any(axis=1)
- notnull->all(axis=1)

In [24]:

~df.isnull().any(axis=1)
df.loc[~df.isnull().any(axis=1)]


Out[24]:

|      |    0 |    1 |    2 |    3 |    4 |
| ---: | ---: | ---: | ---: | ---: | ---: |
|    0 |   44 | 91.0 | 92.0 |   51 | 55.0 |
|    2 |   21 | 52.0 | 40.0 |   63 | 29.0 |
|    5 |   94 | 57.0 |  5.0 |   43 | 22.0 |
|    6 |   87 | 31.0 | 58.0 |   30 | 82.0 |

In [28]:

df.notnull().all(axis=1)
df.loc[df.notnull().all(axis=1)]


Out[28]:

|      |    0 |    1 |    2 |    3 |    4 |
| ---: | ---: | ---: | ---: | ---: | ---: |
|    0 |   44 | 91.0 | 92.0 |   51 | 55.0 |
|    2 |   21 | 52.0 | 40.0 |   63 | 29.0 |
|    5 |   94 | 57.0 |  5.0 |   43 | 22.0 |
|    6 |   87 | 31.0 | 58.0 |   30 | 82.0 |

In [29]:

df.dropna(axis=0) #将空值对应的行数据删除


Out[29]:

|      |    0 |    1 |    2 |    3 |    4 |
| ---: | ---: | ---: | ---: | ---: | ---: |
|    0 |   44 | 91.0 | 92.0 |   51 | 55.0 |
|    2 |   21 | 52.0 | 40.0 |   63 | 29.0 |
|    5 |   94 | 57.0 |  5.0 |   43 | 22.0 |
|    6 |   87 | 31.0 | 58.0 |   30 | 82.0 |

In [32]:


Out[32]:

|      |    0 |    1 |    2 |    3 |    4 |
| ---: | ---: | ---: | ---: | ---: | ---: |
|    0 |   44 | 91.0 | 92.0 |   51 | 55.0 |
|    1 |   23 | 22.0 |  NaN |   35 | 83.0 |
|    2 |   21 | 52.0 | 40.0 |   63 | 29.0 |
|    3 |   94 | 51.0 | 24.0 |   70 |  NaN |
|    4 |   27 |  NaN |  1.0 |   21 | 17.0 |
|    5 |   94 | 57.0 |  5.0 |   43 | 22.0 |
|    6 |   87 | 31.0 | 58.0 |   30 | 82.0 |
|    7 |   93 | 28.0 | 54.0 |    7 |  NaN |

In [33]:

#fillna将空值进行覆盖
df.fillna(method=‘bfill’,axis=0) #使用紧邻值填充空值


Out[33]:

|      |    0 |    1 |    2 |    3 |    4 |
| ---: | ---: | ---: | ---: | ---: | ---: |
|    0 |   44 | 91.0 | 92.0 |   51 | 55.0 |
|    1 |   23 | 22.0 | 92.0 |   35 | 83.0 |
|    2 |   21 | 52.0 | 40.0 |   63 | 29.0 |
|    3 |   94 | 51.0 | 24.0 |   70 | 29.0 |
|    4 |   27 | 51.0 |  1.0 |   21 | 17.0 |
|    5 |   94 | 57.0 |  5.0 |   43 | 22.0 |
|    6 |   87 | 31.0 | 58.0 |   30 | 82.0 |
|    7 |   93 | 28.0 | 54.0 |    7 | 82.0 |

### 面试题

- 数据说明：
  - 数据是1个冷库的温度数据，1-7对应7个温度采集设备，1分钟采集一次。
- 数据处理目标：
  - 用1-4对应的4个必须设备，通过建立冷库的温度场关系模型，预估出5-7对应的数据。
  - 最后每个冷库中仅需放置4个设备，取代放置7个设备。
  - f(1-4) --> y(5-7)
- 数据处理过程：
  - 1、原始数据中有丢帧现象，需要做预处理；
  - 2、matplotlib 绘图；
  - 3、建立逻辑回归模型。
- 无标准答案，按个人理解操作即可，请把自己的操作过程以文字形式简单描述一下，谢谢配合。
- 测试数据为testData.xlsx

In [5]:

df = pd.read_excel(’./testData.xlsx’)
df.drop(labels=[‘none’,‘none1’],axis=1,inplace=True)


In [7]:

df.dropna(axis=0)


. . .

In [10]:

#isnull notnull any all
df.notnull().all(axis=1)
df.loc[df.notnull().all(axis=1)]


. . .

In [15]:

df.fillna(method=‘ffill’,axis=0).fillna(method=‘bfill’,axis=0)


. . .

In [ ]:


### 处理重复数据

In [20]:

df = DataFrame(data=np.random.randint(0,100,size=(8,5)))
df.iloc[1] = [6,6,6,6,6]
df.iloc[3] = [6,6,6,6,6]
df.iloc[5] = [6,6,6,6,6]
df


Out[20]:

|      |    0 |    1 |    2 |    3 |    4 |
| ---: | ---: | ---: | ---: | ---: | ---: |
|    0 |   44 |   68 |   53 |   32 |   24 |
|    1 |    6 |    6 |    6 |    6 |    6 |
|    2 |   79 |   86 |   73 |   14 |   25 |
|    3 |    6 |    6 |    6 |    6 |    6 |
|    4 |   72 |    1 |   73 |   67 |   89 |
|    5 |    6 |    6 |    6 |    6 |    6 |
|    6 |   69 |   32 |   94 |   91 |   18 |
|    7 |   47 |    7 |   77 |   11 |   67 |

In [25]:

df.drop_duplicates(keep=‘first’)


Out[25]:

|      |    0 |    1 |    2 |    3 |    4 |
| ---: | ---: | ---: | ---: | ---: | ---: |
|    0 |   44 |   68 |   53 |   32 |   24 |
|    1 |    6 |    6 |    6 |    6 |    6 |
|    2 |   79 |   86 |   73 |   14 |   25 |
|    4 |   72 |    1 |   73 |   67 |   89 |
|    6 |   69 |   32 |   94 |   91 |   18 |
|    7 |   47 |    7 |   77 |   11 |   67 |

### 处理异常数据

- 自定义一个1000行3列（A，B，C）取值范围为0-1的数据源，然后将C列中的值大于其两倍标准差的异常值进行清洗

In [27]:

df = DataFrame(data=np.random.random(size=(1000,3)),columns=[‘A’,‘B’,‘C’])
df.head()


Out[27]:

|      |        A |        B |        C |
| ---: | -------: | -------: | -------: |
|    0 | 0.886005 | 0.831529 | 0.822956 |
|    1 | 0.391742 | 0.104039 | 0.434260 |
|    2 | 0.539745 | 0.950540 | 0.948526 |
|    3 | 0.392029 | 0.904831 | 0.680343 |
|    4 | 0.513764 | 0.979957 | 0.600518 |

In [30]:

#指定一个判定异常值的条件
twice_std = df[‘C’].std() * 2
twice_std


Out[30]:

0.570731429850527


In [34]:

#判断C列中的哪些值为异常值
df[‘C’] > twice_std
df.loc[df[‘C’] > twice_std]
indexs = df.loc[df[‘C’] > twice_std].index #异常值对应的行索引


In [35]:

df.drop(labels=indexs)


Out[35]:

|      |        A |        B |        C |
| ---: | -------: | -------: | -------: |
|    1 | 0.391742 | 0.104039 | 0.434260 |
|    5 | 0.512951 | 0.897948 | 0.245320 |
|    6 | 0.473968 | 0.979213 | 0.271424 |
|    7 | 0.843319 | 0.038657 | 0.184559 |
|    8 | 0.982931 | 0.881284 | 0.208147 |
|   10 | 0.784656 | 0.314898 | 0.089802 |
|   13 | 0.624428 | 0.252411 | 0.327818 |
|   15 | 0.213042 | 0.969693 | 0.567275 |
|   18 | 0.710010 | 0.534330 | 0.559441 |
|   19 | 0.026479 | 0.736259 | 0.120431 |
|   20 | 0.990039 | 0.982449 | 0.017151 |
|   23 | 0.158157 | 0.183678 | 0.428155 |
|   25 | 0.604838 | 0.950466 | 0.294742 |
|   26 | 0.304136 | 0.822809 | 0.388579 |
|   28 | 0.671559 | 0.726631 | 0.196907 |
|   30 | 0.811249 | 0.751182 | 0.467697 |
|   31 | 0.376243 | 0.805516 | 0.287484 |
|   33 | 0.570442 | 0.797945 | 0.026182 |
|   35 | 0.467125 | 0.062123 | 0.439725 |
|   36 | 0.861741 | 0.413997 | 0.543973 |
|   38 | 0.955328 | 0.817003 | 0.293787 |
|   47 | 0.458014 | 0.228608 | 0.285172 |
|   49 | 0.931513 | 0.403981 | 0.239329 |
|   51 | 0.008178 | 0.484172 | 0.021373 |
|   53 | 0.253882 | 0.300069 | 0.561118 |
|   55 | 0.752559 | 0.685649 | 0.451692 |
|   56 | 0.003363 | 0.486893 | 0.154598 |
|   57 | 0.859653 | 0.569252 | 0.007432 |
|   58 | 0.327716 | 0.419704 | 0.452710 |
|   59 | 0.068403 | 0.029346 | 0.226587 |
|  ... |      ... |      ... |      ... |
|  953 | 0.247954 | 0.072558 | 0.038834 |
|  954 | 0.199553 | 0.193049 | 0.027725 |
|  956 | 0.513195 | 0.175896 | 0.254432 |
|  957 | 0.080261 | 0.476756 | 0.521142 |
|  958 | 0.944795 | 0.550317 | 0.336043 |
|  961 | 0.464895 | 0.592027 | 0.195383 |
|  962 | 0.127469 | 0.300982 | 0.309427 |
|  963 | 0.595242 | 0.139702 | 0.450026 |
|  964 | 0.520342 | 0.639537 | 0.209403 |
|  965 | 0.372687 | 0.117984 | 0.262849 |
|  966 | 0.007270 | 0.044250 | 0.533105 |
|  967 | 0.854830 | 0.512720 | 0.173844 |
|  968 | 0.247666 | 0.972284 | 0.227422 |
|  970 | 0.047074 | 0.714412 | 0.392280 |
|  974 | 0.112649 | 0.483324 | 0.125105 |
|  975 | 0.307405 | 0.875641 | 0.432340 |
|  978 | 0.520662 | 0.003040 | 0.412422 |
|  979 | 0.337178 | 0.540283 | 0.257443 |
|  981 | 0.877978 | 0.842195 | 0.448030 |
|  982 | 0.273752 | 0.063285 | 0.291012 |
|  985 | 0.765849 | 0.974933 | 0.253099 |
|  988 | 0.139305 | 0.570496 | 0.535778 |
|  989 | 0.597190 | 0.973190 | 0.177517 |
|  990 | 0.817945 | 0.183825 | 0.330112 |
|  991 | 0.738457 | 0.578425 | 0.032489 |
|  992 | 0.159229 | 0.544980 | 0.242586 |
|  994 | 0.300998 | 0.352331 | 0.434336 |
|  996 | 0.609123 | 0.491735 | 0.045738 |
|  998 | 0.839935 | 0.181189 | 0.121180 |
|  999 | 0.798840 | 0.939869 | 0.150332 |







### 5.DataFrame的级联and合并操作 

### 级联操作

- pd.concat, pd.append

pandas使用pd.concat函数，与np.concatenate函数类似，只是多了一些参数：

objs
axis=0
keys
join=‘outer’ / ‘inner’:表示的是级联的方式，outer会将所有的项进行级联（忽略匹配和不匹配），而inner只会将匹配的项级联到一起，不匹配的不级联
ignore_index=False


- 匹配级联

In [1]:

import numpy as np
import pandas as pd
from pandas import DataFrame


In [2]:

df1 = DataFrame({‘employee’:[‘Bob’,‘Jake’,‘Lisa’],
‘group’:[‘Accounting’,‘Engineering’,‘Engineering’],
})
df1


Out[2]:

|      | employee |       group |
| ---: | -------: | ----------: |
|    0 |      Bob |  Accounting |
|    1 |     Jake | Engineering |
|    2 |     Lisa | Engineering |

In [6]:

pd.concat((df1,df1,df1),axis=1)


Out[6]:

|      | employee |       group | employee |       group | employee |       group |
| ---: | -------: | ----------: | -------: | ----------: | -------: | ----------: |
|    0 |      Bob |  Accounting |      Bob |  Accounting |      Bob |  Accounting |
|    1 |     Jake | Engineering |     Jake | Engineering |     Jake | Engineering |
|    2 |     Lisa | Engineering |     Lisa | Engineering |     Lisa | Engineering |

- 不匹配级联
  - 不匹配指的是级联的维度的索引不一致。例如纵向级联时列索引不一致，横向级联时行索引不一致
  - 有2种连接方式：
    - 外连接：补NaN（默认模式）
    - 内连接：只连接匹配的项

In [7]:

df1


Out[7]:

|      | employee |       group |
| ---: | -------: | ----------: |
|    0 |      Bob |  Accounting |
|    1 |     Jake | Engineering |
|    2 |     Lisa | Engineering |

In [12]:

df2 = df1.copy()
df2.columns = [‘employee’,‘groups’]
df2


Out[12]:

|      | employee |      groups |
| ---: | -------: | ----------: |
|    0 |      Bob |  Accounting |
|    1 |     Jake | Engineering |
|    2 |     Lisa | Engineering |

In [14]:

pd.concat((df1,df2),axis=0)


Out[14]:

|      | employee |       group |      groups |
| ---: | -------: | ----------: | ----------: |
|    0 |      Bob |  Accounting |         NaN |
|    1 |     Jake | Engineering |         NaN |
|    2 |     Lisa | Engineering |         NaN |
|    0 |      Bob |         NaN |  Accounting |
|    1 |     Jake |         NaN | Engineering |
|    2 |     Lisa |         NaN | Engineering |

In [16]:

pd.concat((df1,df2),axis=0,join=‘inner’)


Out[16]:

|      | employee |
| ---: | -------: |
|    0 |      Bob |
|    1 |     Jake |
|    2 |     Lisa |
|    0 |      Bob |
|    1 |     Jake |
|    2 |     Lisa |

- append函数的使用

In [17]:

df1.append(df2)


Out[17]:

|      | employee |       group |      groups |
| ---: | -------: | ----------: | ----------: |
|    0 |      Bob |  Accounting |         NaN |
|    1 |     Jake | Engineering |         NaN |
|    2 |     Lisa | Engineering |         NaN |
|    0 |      Bob |         NaN |  Accounting |
|    1 |     Jake |         NaN | Engineering |
|    2 |     Lisa |         NaN | Engineering |

### 合并操作:级联是对表格做拼接，合并是对表格中的数据进行汇总

- merge与concat的区别在于，merge需要依据某一共同列来进行合并
- 使用pd.merge()合并时，会自动根据两者相同column名称的那一列，作为key来进行合并。
- 注意每一列元素的顺序不要求一致

#### 一对一合并

In [18]:

df1 = DataFrame({‘employee’:[‘Bob’,‘Jake’,‘Lisa’],
‘group’:[‘Accounting’,‘Engineering’,‘Engineering’],
})
df1


Out[18]:

|      | employee |       group |
| ---: | -------: | ----------: |
|    0 |      Bob |  Accounting |
|    1 |     Jake | Engineering |
|    2 |     Lisa | Engineering |

In [19]:

df2 = DataFrame({‘employee’:[‘Lisa’,‘Bob’,‘Jake’],
‘hire_date’:[2004,2008,2012],
})
df2


Out[19]:

|      | employee | hire_date |
| ---: | -------: | --------: |
|    0 |     Lisa |      2004 |
|    1 |      Bob |      2008 |
|    2 |     Jake |      2012 |

In [20]:

pd.merge(df1,df2)


Out[20]:

|      | employee |       group | hire_date |
| ---: | -------: | ----------: | --------: |
|    0 |      Bob |  Accounting |      2008 |
|    1 |     Jake | Engineering |      2012 |
|    2 |     Lisa | Engineering |      2004 |

#### 一对多合并

In [21]:

df3 = DataFrame({
‘employee’:[‘Lisa’,‘Jake’],
‘group’:[‘Accounting’,‘Engineering’],
‘hire_date’:[2004,2016]})
df3


Out[21]:

|      | employee |       group | hire_date |
| ---: | -------: | ----------: | --------: |
|    0 |     Lisa |  Accounting |      2004 |
|    1 |     Jake | Engineering |      2016 |

In [22]:

df4 = DataFrame({‘group’:[‘Accounting’,‘Engineering’,‘Engineering’],
‘supervisor’:[‘Carly’,‘Guido’,‘Steve’]
})
df4


Out[22]:

|      |       group | supervisor |
| ---: | ----------: | ---------: |
|    0 |  Accounting |      Carly |
|    1 | Engineering |      Guido |
|    2 | Engineering |      Steve |

In [23]:

pd.merge(df3,df4)


Out[23]:

|      | employee |       group | hire_date | supervisor |
| ---: | -------: | ----------: | --------: | ---------: |
|    0 |     Lisa |  Accounting |      2004 |      Carly |
|    1 |     Jake | Engineering |      2016 |      Guido |
|    2 |     Jake | Engineering |      2016 |      Steve |

#### 多对多合并

In [24]:

df1 = DataFrame({‘employee’:[‘Bob’,‘Jake’,‘Lisa’],
‘group’:[‘Accounting’,‘Engineering’,‘Engineering’]})
df1


Out[24]:

|      | employee |       group |
| ---: | -------: | ----------: |
|    0 |      Bob |  Accounting |
|    1 |     Jake | Engineering |
|    2 |     Lisa | Engineering |

In [25]:

df5 = DataFrame({‘group’:[‘Engineering’,‘Engineering’,‘HR’],
‘supervisor’:[‘Carly’,‘Guido’,‘Steve’]
})
df5


Out[25]:

|      |       group | supervisor |
| ---: | ----------: | ---------: |
|    0 | Engineering |      Carly |
|    1 | Engineering |      Guido |
|    2 |          HR |      Steve |

In [28]:

pd.merge(df1,df5)


Out[28]:

|      | employee |       group | supervisor |
| ---: | -------: | ----------: | ---------: |
|    0 |     Jake | Engineering |      Carly |
|    1 |     Jake | Engineering |      Guido |
|    2 |     Lisa | Engineering |      Carly |
|    3 |     Lisa | Engineering |      Guido |

#### key的规范化

- 当列冲突时，即有多个列名称相同时，需要使用on=来指定哪一个列作为key，配合suffixes指定冲突列名

In [29]:

df1 = DataFrame({‘employee’:[‘Jack’,“Summer”,“Steve”],
‘group’:[‘Accounting’,‘Finance’,‘Marketing’]})
df1


Out[29]:

|      | employee |      group |
| ---: | -------: | ---------: |
|    0 |     Jack | Accounting |
|    1 |   Summer |    Finance |
|    2 |    Steve |  Marketing |

In [30]:

df2 = DataFrame({‘employee’:[‘Jack’,‘Bob’,“Jake”],
‘hire_date’:[2003,2009,2012],
‘group’:[‘Accounting’,‘sell’,‘ceo’]})
df2


Out[30]:

|      | employee |      group | hire_date |
| ---: | -------: | ---------: | --------: |
|    0 |     Jack | Accounting |      2003 |
|    1 |      Bob |       sell |      2009 |
|    2 |     Jake |        ceo |      2012 |

In [32]:

pd.merge(df1,df2,on=‘group’)


Out[32]:

|      | employee_x |      group | employee_y | hire_date |
| ---: | ---------: | ---------: | ---------: | --------: |
|    0 |       Jack | Accounting |       Jack |      2003 |

- 当两张表没有可进行连接的列时，可使用left_on和right_on手动指定merge中左右两边的哪一列列作为连接的列

In [33]:

df1 = DataFrame({‘employee’:[‘Bobs’,‘Linda’,‘Bill’],
‘group’:[‘Accounting’,‘Product’,‘Marketing’],
‘hire_date’:[1998,2017,2018]})
df1


Out[33]:

|      | employee |      group | hire_date |
| ---: | -------: | ---------: | --------: |
|    0 |     Bobs | Accounting |      1998 |
|    1 |    Linda |    Product |      2017 |
|    2 |     Bill |  Marketing |      2018 |

In [34]:

df5 = DataFrame({‘name’:[‘Lisa’,‘Bobs’,‘Bill’],
‘hire_dates’:[1998,2016,2007]})
df5


Out[34]:

|      | hire_dates | name |
| ---: | ---------: | ---: |
|    0 |       1998 | Lisa |
|    1 |       2016 | Bobs |
|    2 |       2007 | Bill |

In [35]:

pd.merge(df1,df5,left_on=‘employee’,right_on=‘name’)


Out[35]:

|      | employee |      group | hire_date | hire_dates | name |
| ---: | -------: | ---------: | --------: | ---------: | ---: |
|    0 |     Bobs | Accounting |      1998 |       2016 | Bobs |
|    1 |     Bill |  Marketing |      2018 |       2007 | Bill |

#### 内合并与外合并:out取并集 inner取交集

In [37]:

df6 = DataFrame({‘name’:[‘Peter’,‘Paul’,‘Mary’],
‘food’:[‘fish’,‘beans’,‘bread’]}
)
df7 = DataFrame({‘name’:[‘Mary’,‘Joseph’],
‘drink’:[‘wine’,‘beer’]})


In [38]:

df6


Out[38]:

|      |  food |  name |
| ---: | ----: | ----: |
|    0 |  fish | Peter |
|    1 | beans |  Paul |
|    2 | bread |  Mary |

In [39]:

df7


Out[39]:

|      | drink |   name |
| ---: | ----: | -----: |
|    0 |  wine |   Mary |
|    1 |  beer | Joseph |

In [43]:

pd.merge(df6,df7,how=‘right’)


Out[43]:

|      |  food |   name | drink |
| ---: | ----: | -----: | ----: |
|    0 | bread |   Mary |  wine |
|    1 |   NaN | Joseph |  beer |

In [ ]:


In [ ]:


In [ ]:


In [ ]:

#合并df1和df2
dic1={

'name':['tom','jay','helly'],
'age':[11,12,33],
'classRoom':[1,2,3]

}
df1=DataFrame(data=dic1)
df2=DataFrame(data=np.random.randint(60,100,size=(3,3)),
index=[‘jay’,‘tom’,‘helly’],
columns=[‘java’,‘python’,‘c’])






 

### 6.人口分析案例 

- 需求：
  - 导入文件，查看原始数据
  - 将人口数据和各州简称数据进行汇总
  - 将汇总的数据中重复的abbreviation列进行删除
  - 查看存在缺失数据的列
  - 找到有哪些state/region使得state的值为NaN，进行去重操作
  - 为找到的这些state/region的state项补上正确的值，从而去除掉state这一列的所有NaN
  - 合并各州面积数据areas
  - 我们会发现area(sq.mi)这一列有缺失数据，找出是哪些行
  - 去除含有缺失数据的行
  - 找出2010年的全民人口数据
  - 计算各州的人口密度
  - 排序，并找出人口密度最高的五个州

In [1]:

import pandas as pd
from pandas import DataFrame
import numpy as np


In [10]:

abb = pd.read_csv(’./data/state-abbrevs.csv’) #存储的是各州的全程和简称数据
area = pd.read_csv(’./data/state-areas.csv’) #存储的是各州的全程和面积
pop = pd.read_csv(’./data/state-population.csv’) #人口数据


In [7]:

pop.head(2)


Out[7]:

|      | state/region |    ages | year | population |
| ---: | -----------: | ------: | ---: | ---------: |
|    0 |           AL | under18 | 2012 |  1117489.0 |
|    1 |           AL |   total | 2012 |  4817528.0 |

In [8]:

abb.head(2)


Out[8]:

|      |   state | abbreviation |
| ---: | ------: | -----------: |
|    0 | Alabama |           AL |
|    1 |  Alaska |           AK |

In [12]:

#数据汇总（合并）
abb_pop = pd.merge(abb,pop,how=‘outer’,left_on=‘abbreviation’,right_on=‘state/region’)
abb_pop.head()


Out[12]:

|      |   state | abbreviation | state/region |    ages | year | population |
| ---: | ------: | -----------: | -----------: | ------: | ---: | ---------: |
|    0 | Alabama |           AL |           AL | under18 | 2012 |  1117489.0 |
|    1 | Alabama |           AL |           AL |   total | 2012 |  4817528.0 |
|    2 | Alabama |           AL |           AL | under18 | 2010 |  1130966.0 |
|    3 | Alabama |           AL |           AL |   total | 2010 |  4785570.0 |
|    4 | Alabama |           AL |           AL | under18 | 2011 |  1125763.0 |

In [13]:

abb_pop.drop(labels=‘abbreviation’,axis=1,inplace=True)


In [15]:

#查看存在缺失数据的列
abb_pop.info()
<class ‘pandas.core.frame.DataFrame’>
Int64Index: 2544 entries, 0 to 2543
Data columns (total 5 columns):
state 2448 non-null object
state/region 2544 non-null object
ages 2544 non-null object
year 2544 non-null int64
population 2524 non-null float64
dtypes: float64(1), int64(1), object(3)
memory usage: 119.2+ KB


In [16]:

##查看存在缺失数据的列
abb_pop.isnull().any(axis=0)


Out[16]:

state True
state/region False
ages False
year False
population True
dtype: bool


In [19]:

#找到有哪些state/region(洲的简称)使得state（洲的全程）的值为NaN，进行（简称）去重操作
abb_pop.head()


Out[19]:

|      |   state | state/region |    ages | year | population |
| ---: | ------: | -----------: | ------: | ---: | ---------: |
|    0 | Alabama |           AL | under18 | 2012 |  1117489.0 |
|    1 | Alabama |           AL |   total | 2012 |  4817528.0 |
|    2 | Alabama |           AL | under18 | 2010 |  1130966.0 |
|    3 | Alabama |           AL |   total | 2010 |  4785570.0 |
|    4 | Alabama |           AL | under18 | 2011 |  1125763.0 |

In [24]:

#1.找出全程的空值都有哪些
abb_pop[‘state’].isnull()
#2.将空值对应的行数据取出
abb_pop.loc[abb_pop[‘state’].isnull()]
#3.将空值对应的简称取出
abb_pop.loc[abb_pop[‘state’].isnull()][‘state/region’]
#4,对取出的简称对应的Series进行去重操作
abb_pop.loc[abb_pop[‘state’].isnull()][‘state/region’].unique()#unique()是用来将Series中的元素进行去重操作


Out[24]:

array([‘PR’, ‘USA’], dtype=object)


In [ ]:

#为找到的这些state/region的state项补上正确的值，从而去除掉state这一列的所有NaN
#将PR简称对应的全程的空值定位到，将这些空值赋值成PR的全程PUERTO RICO


In [29]:

abb_pop[‘state/region’] == ‘PR’
#取出PR简称对应的行数据
abb_pop.loc[abb_pop[‘state/region’] == ‘PR’]
#将定位到的行数据中state列中的空值统一赋值成PR的简称
indexs = abb_pop.loc[abb_pop[‘state/region’] == ‘PR’].index#PR简称对应全程的空值的行索引
abb_pop.loc[indexs,‘state’] = ‘PUERTO RICO’


In [31]:

#同理可以将USA对应全程的值赋值成United States
abb_pop[‘state/region’] == ‘USA’
abb_pop.loc[abb_pop[‘state/region’] == ‘USA’]
indexs = abb_pop.loc[abb_pop[‘state/region’] == ‘USA’].index
abb_pop.loc[indexs,‘state’] = ‘United States’


In [39]:

#合并各州面积数据areas
abb_pop_area = pd.merge(abb_pop,area,how=‘outer’)
abb_pop_area.head()


Out[39]:

|      |   state | state/region |    ages |   year | population | area (sq. mi) |
| ---: | ------: | -----------: | ------: | -----: | ---------: | ------------: |
|    0 | Alabama |           AL | under18 | 2012.0 |  1117489.0 |       52423.0 |
|    1 | Alabama |           AL |   total | 2012.0 |  4817528.0 |       52423.0 |
|    2 | Alabama |           AL | under18 | 2010.0 |  1130966.0 |       52423.0 |
|    3 | Alabama |           AL |   total | 2010.0 |  4785570.0 |       52423.0 |
|    4 | Alabama |           AL | under18 | 2011.0 |  1125763.0 |       52423.0 |

In [45]:

#我们会发现area(sq.mi)这一列有缺失数据，找出是哪些行
indexs = abb_pop_area.loc[abb_pop_area[‘area (sq. mi)’].isnull()].index
indexs


. . .

In [43]:

abb_pop_area.drop(labels=indexs,inplace=True)


In [48]:

#找出2010年的全民人口数据 query(‘查询条件’)
abb_pop_area.query(‘year==2010 & ages == “total”’)


. . .

In [ ]:

#计算各州的人口密度


In [51]:

abb_pop_area[‘midu’] = abb_pop_area[‘population’] / abb_pop_area[‘area (sq. mi)’]
abb_pop_area.head()


Out[51]:

|      |   state | state/region |    ages |   year | population | area (sq. mi) |      midu |
| ---: | ------: | -----------: | ------: | -----: | ---------: | ------------: | --------: |
|    0 | Alabama |           AL | under18 | 2012.0 |  1117489.0 |       52423.0 | 21.316769 |
|    1 | Alabama |           AL |   total | 2012.0 |  4817528.0 |       52423.0 | 91.897221 |
|    2 | Alabama |           AL | under18 | 2010.0 |  1130966.0 |       52423.0 | 21.573851 |
|    3 | Alabama |           AL |   total | 2010.0 |  4785570.0 |       52423.0 | 91.287603 |
|    4 | Alabama |           AL | under18 | 2011.0 |  1125763.0 |       52423.0 | 21.474601 |

In [53]:

#排序，并找出人口密度最高的五个州
abb_pop_area.sort_values(by=‘midu’,axis=0,ascending=False)


Out[53]:

|      |                state | state/region |    ages |   year | population | area (sq. mi) |        midu |
| ---: | -------------------: | -----------: | ------: | -----: | ---------: | ------------: | ----------: |
|  391 | District of Columbia |           DC |   total | 2013.0 |   646449.0 |          68.0 | 9506.602941 |
|  385 | District of Columbia |           DC |   total | 2012.0 |   633427.0 |          68.0 | 9315.102941 |
|  387 | District of Columbia |           DC |   total | 2011.0 |   619624.0 |          68.0 | 9112.117647 |
|  431 | District of Columbia |           DC |   total | 1990.0 |   605321.0 |          68.0 | 8901.779412 |
|  389 | District of Columbia |           DC |   total | 2010.0 |   605125.0 |          68.0 | 8898.897059 |
|  426 | District of Columbia |           DC |   total | 1991.0 |   600870.0 |          68.0 | 8836.323529 |
|  429 | District of Columbia |           DC |   total | 1992.0 |   597567.0 |          68.0 | 8787.750000 |
|  422 | District of Columbia |           DC |   total | 1993.0 |   595302.0 |          68.0 | 8754.441176 |
|  392 | District of Columbia |           DC |   total | 2009.0 |   592228.0 |          68.0 | 8709.235294 |
|  425 | District of Columbia |           DC |   total | 1994.0 |   589240.0 |          68.0 | 8665.294118 |
|  420 | District of Columbia |           DC |   total | 1995.0 |   580519.0 |          68.0 | 8537.044118 |
|  396 | District of Columbia |           DC |   total | 2008.0 |   580236.0 |          68.0 | 8532.882353 |
|  406 | District of Columbia |           DC |   total | 2001.0 |   574504.0 |          68.0 | 8448.588235 |
|  394 | District of Columbia |           DC |   total | 2007.0 |   574404.0 |          68.0 | 8447.117647 |
|  408 | District of Columbia |           DC |   total | 2002.0 |   573158.0 |          68.0 | 8428.794118 |
|  419 | District of Columbia |           DC |   total | 1996.0 |   572379.0 |          68.0 | 8417.338235 |
|  412 | District of Columbia |           DC |   total | 2000.0 |   572046.0 |          68.0 | 8412.441176 |
|  400 | District of Columbia |           DC |   total | 2006.0 |   570681.0 |          68.0 | 8392.367647 |
|  410 | District of Columbia |           DC |   total | 1999.0 |   570220.0 |          68.0 | 8385.588235 |
|  402 | District of Columbia |           DC |   total | 2003.0 |   568502.0 |          68.0 | 8360.323529 |
|  404 | District of Columbia |           DC |   total | 2004.0 |   567754.0 |          68.0 | 8349.323529 |
|  417 | District of Columbia |           DC |   total | 1997.0 |   567739.0 |          68.0 | 8349.102941 |
|  398 | District of Columbia |           DC |   total | 2005.0 |   567136.0 |          68.0 | 8340.235294 |
|  415 | District of Columbia |           DC |   total | 1998.0 |   565232.0 |          68.0 | 8312.235294 |
|  421 | District of Columbia |           DC | under18 | 1995.0 |   123620.0 |          68.0 | 1817.941176 |
|  424 | District of Columbia |           DC | under18 | 1994.0 |   122170.0 |          68.0 | 1796.617647 |
|  418 | District of Columbia |           DC | under18 | 1996.0 |   121210.0 |          68.0 | 1782.500000 |
|  423 | District of Columbia |           DC | under18 | 1993.0 |   120471.0 |          68.0 | 1771.632353 |
|  416 | District of Columbia |           DC | under18 | 1997.0 |   119531.0 |          68.0 | 1757.808824 |
|  428 | District of Columbia |           DC | under18 | 1992.0 |   118636.0 |          68.0 | 1744.647059 |
|  ... |                  ... |          ... |     ... |    ... |        ... |           ... |         ... |
|   53 |               Alaska |           AK |   total | 1994.0 |   603308.0 |      656425.0 |    0.919081 |
|   56 |               Alaska |           AK |   total | 1993.0 |   599434.0 |      656425.0 |    0.913180 |
|   50 |               Alaska |           AK |   total | 1992.0 |   588736.0 |      656425.0 |    0.896882 |
|   55 |               Alaska |           AK |   total | 1991.0 |   570193.0 |      656425.0 |    0.868634 |
|   48 |               Alaska |           AK |   total | 1990.0 |   553290.0 |      656425.0 |    0.842884 |
|   63 |               Alaska |           AK | under18 | 1998.0 |   192636.0 |      656425.0 |    0.293462 |
|   66 |               Alaska |           AK | under18 | 1999.0 |   191422.0 |      656425.0 |    0.291613 |
|   69 |               Alaska |           AK | under18 | 2000.0 |   190615.0 |      656425.0 |    0.290384 |
|   71 |               Alaska |           AK | under18 | 2001.0 |   188771.0 |      656425.0 |    0.287574 |
|   73 |               Alaska |           AK | under18 | 2002.0 |   188482.0 |      656425.0 |    0.287134 |
|   92 |               Alaska |           AK | under18 | 2011.0 |   188329.0 |      656425.0 |    0.286901 |
|   62 |               Alaska |           AK | under18 | 1997.0 |   188280.0 |      656425.0 |    0.286826 |
|   94 |               Alaska |           AK | under18 | 2012.0 |   188162.0 |      656425.0 |    0.286647 |
|   86 |               Alaska |           AK | under18 | 2013.0 |   188132.0 |      656425.0 |    0.286601 |
|   90 |               Alaska |           AK | under18 | 2010.0 |   187902.0 |      656425.0 |    0.286251 |
|   54 |               Alaska |           AK | under18 | 1994.0 |   187439.0 |      656425.0 |    0.285545 |
|   57 |               Alaska |           AK | under18 | 1993.0 |   187190.0 |      656425.0 |    0.285166 |
|   75 |               Alaska |           AK | under18 | 2003.0 |   186843.0 |      656425.0 |    0.284637 |
|   89 |               Alaska |           AK | under18 | 2009.0 |   186351.0 |      656425.0 |    0.283888 |
|   77 |               Alaska |           AK | under18 | 2004.0 |   186335.0 |      656425.0 |    0.283863 |
|   81 |               Alaska |           AK | under18 | 2006.0 |   185580.0 |      656425.0 |    0.282713 |
|   61 |               Alaska |           AK | under18 | 1996.0 |   185360.0 |      656425.0 |    0.282378 |
|   79 |               Alaska |           AK | under18 | 2005.0 |   185304.0 |      656425.0 |    0.282293 |
|   59 |               Alaska |           AK | under18 | 1995.0 |   184990.0 |      656425.0 |    0.281814 |
|   52 |               Alaska |           AK | under18 | 1992.0 |   184878.0 |      656425.0 |    0.281644 |
|   83 |               Alaska |           AK | under18 | 2007.0 |   184344.0 |      656425.0 |    0.280830 |
|   85 |               Alaska |           AK | under18 | 2008.0 |   183124.0 |      656425.0 |    0.278972 |
|   51 |               Alaska |           AK | under18 | 1991.0 |   182180.0 |      656425.0 |    0.277534 |
|   49 |               Alaska |           AK | under18 | 1990.0 |   177502.0 |      656425.0 |    0.270407 |
| 2544 |          Puerto Rico |          NaN |     NaN |    NaN |        NaN |        3515.0 |         NaN |





 

### 7.pandas高级操作 

In [3]:

import pandas as pd
from pandas import DataFrame
import numpy as np


### 替换操作

- 替换操作可以同步作用于Series和DataFrame中
- 单值替换
  - 普通替换： 替换所有符合要求的元素:to_replace=15,value='e'
  - 按列指定单值替换： to_replace={列标签：替换值} value='value'

- 多值替换
  - 列表替换: to_replace=[] value=[]
  - 字典替换（推荐） to_replace={to_replace:value,to_replace:value}

In [4]:

df = DataFrame(data=np.random.randint(0,100,size=(6,7)))
df


Out[4]:

|      |    0 |    1 |    2 |    3 |    4 |    5 |    6 |
| ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
|    0 |   44 |   62 |    3 |   85 |   26 |   47 |   14 |
|    1 |   15 |   78 |   32 |   98 |   85 |    4 |   51 |
|    2 |   53 |   75 |   87 |   21 |   45 |    8 |   18 |
|    3 |   54 |   31 |   67 |   49 |   77 |   25 |   49 |
|    4 |   18 |   21 |   18 |   31 |   93 |   11 |    0 |
|    5 |   21 |   54 |   76 |   95 |   70 |   77 |   49 |

In [5]:

df.replace(to_replace=3,value=‘Three’)


Out[5]:

|      |    0 |    1 |     2 |    3 |    4 |    5 |    6 |
| ---: | ---: | ---: | ----: | ---: | ---: | ---: | ---: |
|    0 |   44 |   62 | Three |   85 |   26 |   47 |   14 |
|    1 |   15 |   78 |    32 |   98 |   85 |    4 |   51 |
|    2 |   53 |   75 |    87 |   21 |   45 |    8 |   18 |
|    3 |   54 |   31 |    67 |   49 |   77 |   25 |   49 |
|    4 |   18 |   21 |    18 |   31 |   93 |   11 |    0 |
|    5 |   21 |   54 |    76 |   95 |   70 |   77 |   49 |

In [6]:

df.replace(to_replace={3:‘aaa’})


Out[6]:

|      |    0 |    1 |    2 |    3 |    4 |    5 |    6 |
| ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
|    0 |   44 |   62 |  aaa |   85 |   26 |   47 |   14 |
|    1 |   15 |   78 |   32 |   98 |   85 |    4 |   51 |
|    2 |   53 |   75 |   87 |   21 |   45 |    8 |   18 |
|    3 |   54 |   31 |   67 |   49 |   77 |   25 |   49 |
|    4 |   18 |   21 |   18 |   31 |   93 |   11 |    0 |
|    5 |   21 |   54 |   76 |   95 |   70 |   77 |   49 |

In [8]:

#替换指定列中的值
df.replace(to_replace={5:77},value=‘6666666’)


Out[8]:

|      |    0 |    1 |    2 |    3 |    4 |       5 |    6 |
| ---: | ---: | ---: | ---: | ---: | ---: | ------: | ---: |
|    0 |   44 |   62 |    3 |   85 |   26 |      47 |   14 |
|    1 |   15 |   78 |   32 |   98 |   85 |       4 |   51 |
|    2 |   53 |   75 |   87 |   21 |   45 |       8 |   18 |
|    3 |   54 |   31 |   67 |   49 |   77 |      25 |   49 |
|    4 |   18 |   21 |   18 |   31 |   93 |      11 |    0 |
|    5 |   21 |   54 |   76 |   95 |   70 | 6666666 |   49 |

### 映射操作

- 概念：创建一个映射关系列表，把values元素和一个特定的标签或者字符串绑定（给一个元素值提供不同的表现形式）

- 创建一个df，两列分别是姓名和薪资，然后给其名字起对应的中文名

In [10]:

dic = {
‘name’:[‘jay’,‘tom’,‘jay’],
‘salary’:[10000,20000,10000]
}
df = DataFrame(data=dic)
df


Out[10]:

|      | name | salary |
| ---: | ---: | -----: |
|    0 |  jay |  10000 |
|    1 |  tom |  20000 |
|    2 |  jay |  10000 |

In [14]:

#映射关系表
dic = {
‘jay’:‘张三’,
‘tom’:‘李四’
}
df[‘c_name’] = df[‘name’].map(dic)
df


Out[14]:

|      | name | salary | c_name |
| ---: | ---: | -----: | -----: |
|    0 |  jay |  10000 |   张三 |
|    1 |  tom |  20000 |   李四 |
|    2 |  jay |  10000 |   张三 |

### 运算工具

- 超过3000部分的钱缴纳50%的税，计算每个人的税后薪资

In [16]:

def after_sal(s):
return s - (s-3000)*0.5


In [18]:

df[‘after_salary’] = df[‘salary’].map(after_sal)
df


Out[18]:

|      | name | salary | c_name | after_salary |
| ---: | ---: | -----: | -----: | -----------: |
|    0 |  jay |  10000 |   张三 |       6500.0 |
|    1 |  tom |  20000 |   李四 |      11500.0 |
|    2 |  jay |  10000 |   张三 |       6500.0 |

### 映射索引

- 使用rename()函数替换行列索引
- 参数介绍：
  - index 替换行索引
  - columns 替换列索引

In [19]:

df4 = DataFrame({‘color’:[‘white’,‘gray’,‘purple’,‘blue’,‘green’],‘value’:np.random.randint(10,size = 5)})
df4


Out[19]:

|      |  color | value |
| ---: | -----: | ----: |
|    0 |  white |     2 |
|    1 |   gray |     5 |
|    2 | purple |     9 |
|    3 |   blue |     0 |
|    4 |  green |     1 |

In [20]:

new_index = {0:‘first’,1:‘two’,2:‘three’,3:‘four’,4:‘five’}
new_col={‘color’:‘cc’,‘value’:‘vv’}
df4.rename(new_index,columns=new_col)


Out[20]:

|       |     cc |   vv |
| ----: | -----: | ---: |
| first |  white |    2 |
|   two |   gray |    5 |
| three | purple |    9 |
|  four |   blue |    0 |
|  five |  green |    1 |

### 排序实现的随机抽样

- take()
- np.random.permutation()

In [22]:

df = DataFrame(data=np.random.randint(0,100,size=(100,3)),columns=[‘A’,‘B’,‘C’])
df


. . .

In [24]:

df.take([‘B’,‘A’,‘C’],axis=1)

df.take([1,0,2],axis=1)


. . .

In [32]:

np.random.permutation(3) #返回随机序列


Out[32]:

array([0, 1, 2])


In [31]:

#将行列索引打乱
df.take(np.random.permutation(100),axis=0).take(np.random.permutation(3),axis=1)


. . .

In [35]:

df.take(np.random.permutation(100),axis=0).take(np.random.permutation(3),axis=1)[0:50]


. . .

### 数据的分类处理

- 数据分类处理的核心：
  - groupby()函数
  - groups属性查看分组情况

In [36]:

df = DataFrame({‘item’:[‘Apple’,‘Banana’,‘Orange’,‘Banana’,‘Orange’,‘Apple’],
‘price’:[4,3,3,2.5,4,2],
‘color’:[‘red’,‘yellow’,‘yellow’,‘green’,‘green’,‘green’],
‘weight’:[12,20,50,30,20,44]})
df


Out[36]:

|      |  color |   item | price | weight |
| ---: | -----: | -----: | ----: | -----: |
|    0 |    red |  Apple |   4.0 |     12 |
|    1 | yellow | Banana |   3.0 |     20 |
|    2 | yellow | Orange |   3.0 |     50 |
|    3 |  green | Banana |   2.5 |     30 |
|    4 |  green | Orange |   4.0 |     20 |
|    5 |  green |  Apple |   2.0 |     44 |

In [37]:

#根据水果的种类进行分组
df.groupby(by=‘item’)


Out[37]:

<pandas.core.groupby.DataFrameGroupBy object at 0x0000019782507F60>


In [38]:

#调用groups查看分组情况
df.groupby(by=‘item’).groups


Out[38]:

{‘Apple’: Int64Index([0, 5], dtype=‘int64’),
‘Banana’: Int64Index([1, 3], dtype=‘int64’),
‘Orange’: Int64Index([2, 4], dtype=‘int64’)}


In [40]:

#计算出每一种水果的平均价格
df.groupby(by=‘item’).mean()[‘price’]


Out[40]:

item
Apple 3.00
Banana 2.75
Orange 3.50
Name: price, dtype: float64


In [41]:

df.groupby(by=‘item’)[‘price’].mean() #推荐


Out[41]:

item
Apple 3.00
Banana 2.75
Orange 3.50
Name: price, dtype: float64


In [42]:

#计算不同颜色水果的平均重量
df.groupby(by=‘color’)[‘weight’].mean()


Out[42]:

color
green 31.333333
red 12.000000
yellow 35.000000
Name: weight, dtype: float64


In [44]:

#将每一种水果的平均价格计算出来且汇总到原数据中
df


Out[44]:

|      |  color |   item | price | weight |
| ---: | -----: | -----: | ----: | -----: |
|    0 |    red |  Apple |   4.0 |     12 |
|    1 | yellow | Banana |   3.0 |     20 |
|    2 | yellow | Orange |   3.0 |     50 |
|    3 |  green | Banana |   2.5 |     30 |
|    4 |  green | Orange |   4.0 |     20 |
|    5 |  green |  Apple |   2.0 |     44 |

In [47]:

series_price = df.groupby(by=‘item’)[‘price’].mean()
dic = series_price.to_dict()
dic #映射关系表


Out[47]:

{‘Apple’: 3.0, ‘Banana’: 2.75, ‘Orange’: 3.5}


In [49]:

df[‘mean_price’] = df[‘item’].map(dic)
df


Out[49]:

|      |  color |   item | price | weight | mean_price |
| ---: | -----: | -----: | ----: | -----: | ---------: |
|    0 |    red |  Apple |   4.0 |     12 |       3.00 |
|    1 | yellow | Banana |   3.0 |     20 |       2.75 |
|    2 | yellow | Orange |   3.0 |     50 |       3.50 |
|    3 |  green | Banana |   2.5 |     30 |       2.75 |
|    4 |  green | Orange |   4.0 |     20 |       3.50 |
|    5 |  green |  Apple |   2.0 |     44 |       3.00 |

### 高级数据聚合

- 使用groupby分组后，也可以使用transform和apply提供自定义函数实现更多的运算
- df.groupby('item')['price'].sum() <==> df.groupby('item')['price'].apply(sum)
- transform和apply都会进行运算，在transform或者apply中传入函数即可
- transform和apply也可以传入一个lambda表达式

In [56]:

def myMean(s):
sum = 0
for i in s:
sum += i
return sum/len(s)


In [57]:

df.groupby(by=‘item’)[‘price’].apply(myMean) #apply充当聚合的运算工具


Out[57]:

item
Apple 3.00
Banana 2.75
Orange 3.50
Name: price, dtype: float64


In [58]:

df.groupby(by=‘item’)[‘price’].transform(myMean) #apply充当聚合的运算工具


Out[58]:

0 3.00
1 2.75
2 3.50
3 2.75
4 3.50
5 3.00
Name: price, dtype: float64


### 数据加载

- 读取type-.txt文件数据

In [1]:

import pandas as pd
from pandas import DataFrame,Series
data=pd.read_csv(’./data/type-.txt’)
data


Out[1]:

|      | 你好-我好-他也好 |
| ---: | ---------------: |
|    0 | 也许-大概-有可能 |
|    1 | 然而-未必-不见得 |

In [2]:

data.shape


Out[2]:

(2, 1)


- 将文件中每一个词作为元素存放在DataFrame中

In [4]:

data=pd.read_csv(’./data/type-.txt’,sep=’-’,header=None)
data


Out[4]:

|      |    0 |    1 |      2 |
| ---: | ---: | ---: | -----: |
|    0 | 你好 | 我好 | 他也好 |
|    1 | 也许 | 大概 | 有可能 |
|    2 | 然而 | 未必 | 不见得 |

- 读取数据库中的数据

In [6]:

#连接数据库，获取连接对象
import sqlite3 as sqlite3
conn=sqlite3.connect(’./data/weather_2012.sqlite’)


In [7]:

#读取库表中的数据值
sql_df=pd.read_sql(‘select * from weather_2012’,conn)
sql_df


. . .

In [ ]:

#将一个df中的数据值写入存储到db
data_1.to_sql(‘sql_data’,conn)


### 透视表

- 透视表是一种可以对数据动态排布并且分类汇总的表格格式。或许大多数人都在Excel使用过数据透视表，也体会到它的强大功能，而在pandas中它被称作pivot_table。

- 透视表的优点：
  - 灵活性高，可以随意定制你的分析计算要求
  - 脉络清晰易于理解数据
  - 操作性强，报表神器

In [16]:

import pandas as pd
import numpy as np
df = pd.read_csv(’./data/透视表-篮球赛.csv’,encoding=‘utf8’)
df.head()


Out[16]:

|      | 对手 | 胜负 | 主客场 | 命中 | 投篮数 | 投篮命中率 | 3分命中率 | 篮板 | 助攻 | 得分 |
| ---: | ---: | ---: | -----: | ---: | -----: | ---------: | --------: | ---: | ---: | ---: |
|    0 | 勇士 |   胜 |     客 |   10 |     23 |      0.435 |     0.444 |    6 |   11 |   27 |
|    1 | 国王 |   胜 |     客 |    8 |     21 |      0.381 |     0.286 |    3 |    9 |   27 |
|    2 | 小牛 |   胜 |     主 |   10 |     19 |      0.526 |     0.462 |    3 |    7 |   29 |
|    3 | 灰熊 |   负 |     主 |    8 |     20 |      0.400 |     0.250 |    5 |    8 |   22 |
|    4 | 76人 |   胜 |     客 |   10 |     20 |      0.500 |     0.250 |    3 |   13 |   27 |

#### pivot_table有四个最重要的参数index、values、columns、aggfunc

- index参数：分类汇总的分类条件
  - 每个pivot_table必须拥有一个index。如果想查看哈登对阵每个队伍的得分则需要对每一个对进行分类并计算其各类得分的平均值：

In [14]:

df.pivot_table(index=‘对手’)


. . .

- 想看看对阵同一对手在不同主客场下的数据，分类条件为对手和主客场

In [17]:

df.pivot_table(index=[‘对手’,‘主客场’])


. . .

- values参数：需要对计算的数据进行筛选
  - 如果我们只需要哈登在主客场和不同胜负情况下的得分、篮板与助攻三项数据：

In [18]:

df.pivot_table(index=[‘主客场’,‘胜负’],values=[‘得分’,‘助攻’,‘篮板’])


Out[18]:

|        |          |      助攻 |      得分 |     篮板 |
| -----: | -------: | --------: | --------: | -------: |
| 主客场 |     胜负 |           |           |          |
|     主 |       胜 | 10.555556 | 34.222222 | 5.444444 |
|     负 | 8.666667 | 29.666667 |  5.000000 |          |
|     客 |       胜 |  9.000000 | 32.000000 | 4.916667 |
|     负 | 8.000000 | 20.000000 |  4.000000 |          |

- Aggfunc参数：设置我们对数据聚合时进行的函数操作
  - 当我们未设置aggfunc时，它默认aggfunc='mean'计算均值。

- 还想获得james harden在主客场和不同胜负情况下的总得分、总篮板、总助攻时：

In [20]:

df.pivot_table(index=[‘主客场’,‘胜负’],values=[‘得分’,‘助攻’,‘篮板’],aggfunc=[‘sum’,‘mean’])


Out[20]:

|        |      |  sum | mean |          |           |           |          |
| -----: | ---: | ---: | ---: | -------: | --------: | --------: | -------: |
|        |      | 助攻 | 得分 |     篮板 |      助攻 |      得分 |     篮板 |
| 主客场 | 胜负 |      |      |          |           |           |          |
|     主 |   胜 |   95 |  308 |       49 | 10.555556 | 34.222222 | 5.444444 |
|     负 |   26 |   89 |   15 | 8.666667 | 29.666667 |  5.000000 |          |
|     客 |   胜 |  108 |  384 |       59 |  9.000000 | 32.000000 | 4.916667 |
|     负 |    8 |   20 |    4 | 8.000000 | 20.000000 |  4.000000 |          |

- Columns:可以设置列层次字段
  - 对values字段进行分类

In [33]:

#获取所有队主客场的总得分
df.pivot_table(index=‘主客场’,values=‘得分’,aggfunc=‘sum’,fill_value=0)


Out[33]:

|        | 得分 |
| -----: | ---: |
| 主客场 |      |
|     主 |  397 |
|     客 |  404 |

In [34]:

#获取每个队主客场的总得分（在总得分的基础上又进行了对手的分类）
df.pivot_table(index=‘主客场’,values=‘得分’,columns=‘对手’,aggfunc=‘sum’,fill_value=0)


Out[34]:

|   对手 | 76人 | 勇士 | 国王 | 太阳 | 小牛 | 尼克斯 | 开拓者 | 掘金 | 步行者 | 湖人 | 灰熊 | 爵士 | 猛龙 | 篮网 | 老鹰 | 骑士 | 鹈鹕 | 黄蜂 |
| -----: | ---: | ---: | ---: | ---: | ---: | -----: | -----: | ---: | -----: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| 主客场 |      |      |      |      |      |        |        |      |        |      |      |      |      |      |      |      |      |      |
|     主 |   29 |    0 |    0 |    0 |   29 |     37 |      0 |   21 |     29 |    0 |   60 |   56 |   38 |   37 |    0 |   35 |   26 |    0 |
|     客 |   27 |   27 |   27 |   48 |    0 |     31 |     48 |    0 |     26 |   36 |   49 |   29 |    0 |    0 |   29 |    0 |    0 |   27 |

### 交叉表

- 是一种用于计算分组的特殊透视图,对数据进行汇总
- pd.crosstab(index,colums)
  - index:分组数据，交叉表的行索引
  - columns:交叉表的列索引

In [35]:

df = DataFrame({‘sex’:[‘man’,‘man’,‘women’,‘women’,‘man’,‘women’,‘man’,‘women’,‘women’],
‘age’:[15,23,25,17,35,57,24,31,22],
‘smoke’:[True,False,False,True,True,False,False,True,False],
‘height’:[168,179,181,166,173,178,188,190,160]})


- 求出各个性别抽烟的人数

In [36]:

pd.crosstab(df.smoke,df.sex)


Out[36]:

|   sex |  man | women |
| ----: | ---: | ----: |
| smoke |      |       |
| False |    2 |     3 |
|  True |    2 |     2 |

- 求出各个年龄段抽烟人情况

In [14]:

pd.crosstab(df.age,df.smoke)


Out[14]:

| smoke | False | True |
| ----: | ----: | ---: |
|   age |       |      |
|    15 |     0 |    1 |
|    17 |     0 |    1 |
|    22 |     1 |    0 |
|    23 |     1 |    0 |
|    24 |     1 |    0 |
|    25 |     1 |    0 |
|    31 |     0 |    1 |
|    35 |     0 |    1 |
|    57 |     1 |    0 |

数据分析04

7.pandas高级操作

In [53]:

import pandas as pd
from pandas import DataFrame
import numpy as np

替换操作

替换操作可以同步作用于Series和DataFrame中
单值替换
- 普通替换：替换所有符合要求的元素:to_replace=15,value=‘e’
- 按列指定单值替换： to_replace={列标签：替换值} value=‘value’
多值替换
- 列表替换: to_replace=[] value=[]
- 字典替换（推荐） to_replace={to_replace:value,to_replace:value}

In [4]:

df = DataFrame(data=np.random.randint(0,100,size=(6,7)))
df

Out[4]:

	0	1	2	3	4	5	6
0	44	62	3	85	26	47	14
1	15	78	32	98	85	4	51
2	53	75	87	21	45	8	18
3	54	31	67	49	77	25	49
4	18	21	18	31	93	11	0
5	21	54	76	95	70	77	49

In [5]:

df.replace(to_replace=3,value='Three')

Out[5]:

	0	1	2	3	4	5	6
0	44	62	Three	85	26	47	14
1	15	78	32	98	85	4	51
2	53	75	87	21	45	8	18
3	54	31	67	49	77	25	49
4	18	21	18	31	93	11	0
5	21	54	76	95	70	77	49

In [6]:

df.replace(to_replace={3:'aaa'})

Out[6]:

	0	1	2	3	4	5	6
0	44	62	aaa	85	26	47	14
1	15	78	32	98	85	4	51
2	53	75	87	21	45	8	18
3	54	31	67	49	77	25	49
4	18	21	18	31	93	11	0
5	21	54	76	95	70	77	49

In [8]:

#替换指定列中的值
df.replace(to_replace={5:77},value='6666666')

Out[8]:

	0	1	2	3	4	5	6
0	44	62	3	85	26	47	14
1	15	78	32	98	85	4	51
2	53	75	87	21	45	8	18
3	54	31	67	49	77	25	49
4	18	21	18	31	93	11	0
5	21	54	76	95	70	6666666	49

映射操作

概念：创建一个映射关系列表，把values元素和一个特定的标签或者字符串绑定（给一个元素值提供不同的表现形式）
创建一个df，两列分别是姓名和薪资，然后给其名字起对应的中文名

In [10]:

dic = {
    'name':['jay','tom','jay'],
    'salary':[10000,20000,10000]
}
df = DataFrame(data=dic)
df

Out[10]:

	name	salary
0	jay	10000
1	tom	20000
2	jay	10000

In [14]:

#映射关系表
dic = {
    'jay':'张三',
    'tom':'李四'
}
df['c_name'] = df['name'].map(dic)
df

Out[14]:

	name	salary	c_name
0	jay	10000	张三
1	tom	20000	李四
2	jay	10000	张三

运算工具

超过3000部分的钱缴纳50%的税，计算每个人的税后薪资

In [16]:

def after_sal(s):
    return s - (s-3000)*0.5

In [18]:

df['after_salary'] = df['salary'].map(after_sal)
df

Out[18]:

	name	salary	c_name	after_salary
0	jay	10000	张三	6500.0
1	tom	20000	李四	11500.0
2	jay	10000	张三	6500.0

映射索引

使用rename()函数替换行列索引
参数介绍：
- index 替换行索引
- columns 替换列索引

In [19]:

df4 = DataFrame({'color':['white','gray','purple','blue','green'],'value':np.random.randint(10,size = 5)})
df4

Out[19]:

	color	value
0	white	2
1	gray	5
2	purple	9
3	blue	0
4	green	1

In [20]:

new_index = {0:'first',1:'two',2:'three',3:'four',4:'five'}
new_col={'color':'cc','value':'vv'}
df4.rename(new_index,columns=new_col)

Out[20]:

	cc	vv
first	white	2
two	gray	5
three	purple	9
four	blue	0
five	green	1

排序实现的随机抽样

take()
np.random.permutation()

In [22]:

df = DataFrame(data=np.random.randint(0,100,size=(100,3)),columns=['A','B','C'])
df

. . .

In [24]:

# df.take(['B','A','C'],axis=1)
df.take([1,0,2],axis=1)

. . .

In [32]:

np.random.permutation(3) #返回随机序列

Out[32]:

array([0, 1, 2])

In [31]:

#将行列索引打乱
df.take(np.random.permutation(100),axis=0).take(np.random.permutation(3),axis=1)

. . .

In [35]:

df.take(np.random.permutation(100),axis=0).take(np.random.permutation(3),axis=1)[0:50]

. . .

数据的分类处理

数据分类处理的核心：
- groupby()函数
- groups属性查看分组情况

In [36]:

df = DataFrame({'item':['Apple','Banana','Orange','Banana','Orange','Apple'],
                'price':[4,3,3,2.5,4,2],
               'color':['red','yellow','yellow','green','green','green'],
               'weight':[12,20,50,30,20,44]})
df

Out[36]:

	color	item	price	weight
0	red	Apple	4.0	12
1	yellow	Banana	3.0	20
2	yellow	Orange	3.0	50
3	green	Banana	2.5	30
4	green	Orange	4.0	20
5	green	Apple	2.0	44

In [37]:

#根据水果的种类进行分组
df.groupby(by='item')

Out[37]:

<pandas.core.groupby.DataFrameGroupBy object at 0x0000019782507F60>

In [38]:

#调用groups查看分组情况
df.groupby(by='item').groups

Out[38]:

{'Apple': Int64Index([0, 5], dtype='int64'),
 'Banana': Int64Index([1, 3], dtype='int64'),
 'Orange': Int64Index([2, 4], dtype='int64')}

In [40]:

#计算出每一种水果的平均价格
df.groupby(by='item').mean()['price']

Out[40]:

item
Apple     3.00
Banana    2.75
Orange    3.50
Name: price, dtype: float64

In [41]:

df.groupby(by='item')['price'].mean() #推荐

Out[41]:

item
Apple     3.00
Banana    2.75
Orange    3.50
Name: price, dtype: float64

In [42]:

#计算不同颜色水果的平均重量
df.groupby(by='color')['weight'].mean()

Out[42]:

color
green     31.333333
red       12.000000
yellow    35.000000
Name: weight, dtype: float64

In [44]:

#将每一种水果的平均价格计算出来且汇总到原数据中
df

Out[44]:

	color	item	price	weight
0	red	Apple	4.0	12
1	yellow	Banana	3.0	20
2	yellow	Orange	3.0	50
3	green	Banana	2.5	30
4	green	Orange	4.0	20
5	green	Apple	2.0	44

In [47]:

series_price = df.groupby(by='item')['price'].mean() 
dic = series_price.to_dict()
dic #映射关系表

Out[47]:

{'Apple': 3.0, 'Banana': 2.75, 'Orange': 3.5}

In [49]:

df['mean_price'] = df['item'].map(dic)
df

Out[49]:

	color	item	price	weight	mean_price
0	red	Apple	4.0	12	3.00
1	yellow	Banana	3.0	20	2.75
2	yellow	Orange	3.0	50	3.50
3	green	Banana	2.5	30	2.75
4	green	Orange	4.0	20	3.50
5	green	Apple	2.0	44	3.00

高级数据聚合

使用groupby分组后，也可以使用transform和apply提供自定义函数实现更多的运算
df.groupby(‘item’)[‘price’].sum() <==> df.groupby(‘item’)[‘price’].apply(sum)
transform和apply都会进行运算，在transform或者apply中传入函数即可
transform和apply也可以传入一个lambda表达式

In [56]:

def myMean(s):
    sum = 0
    for i in s:
        sum += i
    return sum/len(s)

In [57]:

df.groupby(by='item')['price'].apply(myMean) #apply充当聚合的运算工具

Out[57]:

item
Apple     3.00
Banana    2.75
Orange    3.50
Name: price, dtype: float64

In [58]:

df.groupby(by='item')['price'].transform(myMean) #apply充当聚合的运算工具

Out[58]:

0    3.00
1    2.75
2    3.50
3    2.75
4    3.50
5    3.00
Name: price, dtype: float64

数据加载

读取type-.txt文件数据

In [50]:

data_1 = pd.read_csv('./data/type-.txt',sep='-',header=None)

将文件中每一个词作为元素存放在DataFrame中

In [ ]:

读取数据库中的数据

In [46]:

#连接数据库，获取连接对象
import sqlite3 as sqlite3
conn=sqlite3.connect('./data/weather_2012.sqlite')

In [47]:

#读取库表中的数据值
sql_df=pd.read_sql('select * from weather_2012',conn)
sql_df

. . .

In [51]:

#将一个df中的数据值写入存储到db
data_1.to_sql('sql_data123',conn)

In [52]:

pd.read_sql('select * from sql_data123',conn)

Out[52]:

	index	0	1	2
0	0	你好	我好	他也好
1	1	也许	大概	有可能
2	2	然而	未必	不见得

透视表

透视表是一种可以对数据动态排布并且分类汇总的表格格式。或许大多数人都在Excel使用过数据透视表，也体会到它的强大功能，而在pandas中它被称作pivot_table。
透视表的优点：
- 灵活性高，可以随意定制你的分析计算要求
- 脉络清晰易于理解数据
- 操作性强，报表神器

In [6]:

import pandas as pd
import numpy as np

In [15]:

df = pd.read_csv('./data/games.csv',encoding='utf-8')
df.head()

Out[15]:

	对手	胜负	主客场	命中	投篮数	投篮命中率	3分命中率	篮板	助攻	得分
0	勇士	胜	客	10	23	0.435	0.444	6	11	27
1	国王	胜	客	8	21	0.381	0.286	3	9	27
2	小牛	胜	主	10	19	0.526	0.462	3	7	29
3	灰熊	负	主	8	20	0.400	0.250	5	8	22
4	76人	胜	客	10	20	0.500	0.250	3	13	27

pivot_table有四个最重要的参数index、values、columns、aggfunc

index参数：分类汇总的分类条件
- 每个pivot_table必须拥有一个index。如果想查看哈登对阵每个队伍的得分则需要对每一个队进行分类并计算其各类得分的平均值：

In [16]:

df.pivot_table(index='对手')

. . .

想看看对阵同一对手在不同主客场下的数据，分类条件为对手和主客场

In [17]:

df.pivot_table(index=['对手','主客场'])

. . .

values参数：需要对计算的数据进行筛选
- 如果我们只需要哈登在主客场和不同胜负情况下的得分、篮板与助攻三项数据：

In [19]:

df.pivot_table(index=['主客场','胜负'],values=['得分','篮板','助攻'])

Out[19]:

		助攻	得分	篮板
主客场	胜负
主	胜	10.555556	34.222222	5.444444
负	8.666667	29.666667	5.000000
客	胜	9.000000	32.000000	4.916667
负	8.000000	20.000000	4.000000

Aggfunc参数：设置我们对数据聚合时进行的函数操作
- 当我们未设置aggfunc时，它默认aggfunc='mean’计算均值。
还想获得james harden在主客场和不同胜负情况下的总得分、总篮板、总助攻时：

In [23]:

df.pivot_table(index=['主客场','胜负'],values=['得分','篮板','助攻'],aggfunc='sum')

Out[23]:

		助攻	得分	篮板
主客场	胜负
主	胜	95	308	49
负	26	89	15
客	胜	108	384	59
负	8	20	4

In [24]:

#还想获得james harden在主客场和不同胜负情况下的总得分、平均篮板、最大助攻时
df.pivot_table(index=['主客场','胜负'],aggfunc={'得分':'sum','篮板':'mean','助攻':'max'})

Out[24]:

		助攻	得分	篮板
主客场	胜负
主	胜	17	308	5.444444
负	11	89	5.000000
客	胜	15	384	4.916667
负	8	20	4.000000

Columns:可以设置列层次字段
- 对values字段进行分类

In [35]:

df.pivot_table(index='主客场',values='得分',aggfunc='sum',columns='对手')

Out[35]:

对手	76人	勇士	国王	太阳	小牛	尼克斯	开拓者	掘金	步行者	湖人	灰熊	爵士	猛龙	篮网	老鹰	骑士	鹈鹕	黄蜂
主客场
主	29.0	NaN	NaN	NaN	29.0	37.0	NaN	21.0	29.0	NaN	60.0	56.0	38.0	37.0	NaN	35.0	26.0	NaN
客	27.0	27.0	27.0	48.0	NaN	31.0	48.0	NaN	26.0	36.0	49.0	29.0	NaN	NaN	29.0	NaN	NaN	27.0

交叉表

是一种用于计算分组的特殊透视图,对数据进行汇总
pd.crosstab(index,colums)
- index:分组数据，交叉表的行索引
- columns:交叉表的列索引

In [36]:

df = DataFrame({'sex':['man','man','women','women','man','women','man','women','women'],
               'age':[15,23,25,17,35,57,24,31,22],
               'smoke':[True,False,False,True,True,False,False,True,False],
               'height':[168,179,181,166,173,178,188,190,160]})
df

Out[36]:

	age	height	sex	smoke
0	15	168	man	True
1	23	179	man	False
2	25	181	women	False
3	17	166	women	True
4	35	173	man	True
5	57	178	women	False
6	24	188	man	False
7	31	190	women	True
8	22	160	women	False

求出各个性别抽烟的人数

In [37]:

pd.crosstab(df.smoke,df.sex)

Out[37]:

sex	man	women
smoke
False	2	3
True	2	2

In [38]:

pd.crosstab(df.sex,df.smoke)

Out[38]:

smoke	False	True
sex
man	2	2
women	3	2

求出各个年龄段抽烟人情况

In [41]:

pd.crosstab(df.age,df.smoke)

Out[41]:

smoke	False	True
age
15	0	1
17	0	1
22	1	0
23	1	0
24	1	0
25	1	0
31	0	1
35	0	1
57	1	0

8. 2012美国大选献金项目数据分析

In [51]:

import pandas as pd
from pandas import DataFrame
import numpy as np

In [52]:

#方便大家操作，将月份和参选人以及所在政党进行定义：
months = {'JAN' : 1, 'FEB' : 2, 'MAR' : 3, 'APR' : 4, 'MAY' : 5, 'JUN' : 6,
          'JUL' : 7, 'AUG' : 8, 'SEP' : 9, 'OCT': 10, 'NOV': 11, 'DEC' : 12}
parties = {
  'Bachmann, Michelle': 'Republican',
  'Romney, Mitt': 'Republican',
  'Obama, Barack': 'Democrat',
  "Roemer, Charles E. 'Buddy' III": 'Reform',
  'Pawlenty, Timothy': 'Republican',
  'Johnson, Gary Earl': 'Libertarian',
  'Paul, Ron': 'Republican',
  'Santorum, Rick': 'Republican',
  'Cain, Herman': 'Republican',
  'Gingrich, Newt': 'Republican',
  'McCotter, Thaddeus G': 'Republican',
  'Huntsman, Jon': 'Republican',
  'Perry, Rick': 'Republican'           
 }

需求

加载数据
查看数据的基本信息
指定数据截取，将如下字段的数据进行提取，其他数据舍弃
- cand_nm ：候选人姓名
- contbr_nm ：捐赠人姓名
- contbr_st ：捐赠人所在州
- contbr_employer ：捐赠人所在公司
- contbr_occupation ：捐赠人职业
- contb_receipt_amt ：捐赠数额（美元）
- contb_receipt_dt ：捐款的日期
对新数据进行总览df.info(),查看是否存在缺失数据
用统计学指标快速描述数值型属性的概要。df.describe()
空值处理。可能因为忘记填写或者保密等等原因，相关字段出现了空值，将其填充为NOT PROVIDE
异常值处理。将捐款金额<=0的数据删除
新建一列为各个候选人所在党派party
查看party这一列中有哪些不同的元素
统计party列中各个元素出现次数
查看各个党派收到的政治献金总数contb_receipt_amt
查看具体每天各个党派收到的政治献金总数contb_receipt_amt
将表中日期格式转换为’yyyy-mm-dd’。
查看老兵(捐献者职业)DISABLED VETERAN主要支持谁
找出各个候选人的捐赠者中，捐赠金额最大的人的职业以及捐献额

In [53]:

#加载数据，查看数据的基本信息
df = pd.read_csv('./data/usa_election.txt')
df.head()
C:\Users\laonanhai\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:2728: DtypeWarning: Columns (6) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

Out[53]:

	cmte_id	cand_id	cand_nm	contbr_nm	contbr_city	contbr_st	contbr_zip	contbr_employer	contbr_occupation	contb_receipt_amt	contb_receipt_dt	receipt_desc	memo_cd	memo_text	form_tp	file_num
0	C00410118	P20002978	Bachmann, Michelle	HARVEY, WILLIAM	MOBILE	AL	3.6601e+08	RETIRED	RETIRED	250.0	20-JUN-11	NaN	NaN	NaN	SA17A	736166
1	C00410118	P20002978	Bachmann, Michelle	HARVEY, WILLIAM	MOBILE	AL	3.6601e+08	RETIRED	RETIRED	50.0	23-JUN-11	NaN	NaN	NaN	SA17A	736166
2	C00410118	P20002978	Bachmann, Michelle	SMITH, LANIER	LANETT	AL	3.68633e+08	INFORMATION REQUESTED	INFORMATION REQUESTED	250.0	05-JUL-11	NaN	NaN	NaN	SA17A	749073
3	C00410118	P20002978	Bachmann, Michelle	BLEVINS, DARONDA	PIGGOTT	AR	7.24548e+08	NONE	RETIRED	250.0	01-AUG-11	NaN	NaN	NaN	SA17A	749073
4	C00410118	P20002978	Bachmann, Michelle	WARDENBURG, HAROLD	HOT SPRINGS NATION	AR	7.19016e+08	NONE	RETIRED	300.0	20-JUN-11	NaN	NaN	NaN	SA17A	736166

In [54]:

#查看原始数据中是否存在缺失数据
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 536041 entries, 0 to 536040
Data columns (total 16 columns):
cmte_id              536041 non-null object
cand_id              536041 non-null object
cand_nm              536041 non-null object
contbr_nm            536041 non-null object
contbr_city          536026 non-null object
contbr_st            536040 non-null object
contbr_zip           535973 non-null object
contbr_employer      525088 non-null object
contbr_occupation    530520 non-null object
contb_receipt_amt    536041 non-null float64
contb_receipt_dt     536041 non-null object
receipt_desc         8479 non-null object
memo_cd              49718 non-null object
memo_text            52740 non-null object
form_tp              536041 non-null object
file_num             536041 non-null int64
dtypes: float64(1), int64(1), object(14)
memory usage: 65.4+ MB

In [55]:

df.describe()

Out[55]:

	contb_receipt_amt	file_num
count	5.360410e+05	536041.000000
mean	3.750373e+02	761472.107800
std	3.564436e+03	5148.893508
min	-3.080000e+04	723511.000000
25%	5.000000e+01	756218.000000
50%	1.000000e+02	763233.000000
75%	2.500000e+02	763621.000000
max	1.944042e+06	767394.000000

In [56]:

#空值处理。可能因为忘记填写或者保密等等原因，相关字段出现了空值，将其填充为NOT PROVIDE
df.fillna(value='NOT PROVIDE',inplace=True)

In [57]:

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 536041 entries, 0 to 536040
Data columns (total 16 columns):
cmte_id              536041 non-null object
cand_id              536041 non-null object
cand_nm              536041 non-null object
contbr_nm            536041 non-null object
contbr_city          536041 non-null object
contbr_st            536041 non-null object
contbr_zip           536041 non-null object
contbr_employer      536041 non-null object
contbr_occupation    536041 non-null object
contb_receipt_amt    536041 non-null float64
contb_receipt_dt     536041 non-null object
receipt_desc         536041 non-null object
memo_cd              536041 non-null object
memo_text            536041 non-null object
form_tp              536041 non-null object
file_num             536041 non-null int64
dtypes: float64(1), int64(1), object(14)
memory usage: 65.4+ MB

In [58]:

#异常值处理。将捐款金额<=0的数据删除
df = df.loc[~(df['contb_receipt_amt'] <= 0)]

In [59]:

#查看当前有多少人参与了竞选
df['cand_nm'].unique()

Out[59]:

array(['Bachmann, Michelle', 'Romney, Mitt', 'Obama, Barack',
       "Roemer, Charles E. 'Buddy' III", 'Pawlenty, Timothy',
       'Johnson, Gary Earl', 'Paul, Ron', 'Santorum, Rick',
       'Cain, Herman', 'Gingrich, Newt', 'McCotter, Thaddeus G',
       'Huntsman, Jon', 'Perry, Rick'], dtype=object)

In [60]:

df['cand_nm'].nunique()

Out[60]:

In [61]:

#新建一列为各个候选人所在党派party
df['party'] = df['cand_nm'].map(parties)
df.head()

Out[61]:

	cmte_id	cand_id	cand_nm	contbr_nm	contbr_city	contbr_st	contbr_zip	contbr_employer	contbr_occupation	contb_receipt_amt	contb_receipt_dt	receipt_desc	memo_cd	memo_text	form_tp	file_num	party
0	C00410118	P20002978	Bachmann, Michelle	HARVEY, WILLIAM	MOBILE	AL	3.6601e+08	RETIRED	RETIRED	250.0	20-JUN-11	NOT PROVIDE	NOT PROVIDE	NOT PROVIDE	SA17A	736166	Republican
1	C00410118	P20002978	Bachmann, Michelle	HARVEY, WILLIAM	MOBILE	AL	3.6601e+08	RETIRED	RETIRED	50.0	23-JUN-11	NOT PROVIDE	NOT PROVIDE	NOT PROVIDE	SA17A	736166	Republican
2	C00410118	P20002978	Bachmann, Michelle	SMITH, LANIER	LANETT	AL	3.68633e+08	INFORMATION REQUESTED	INFORMATION REQUESTED	250.0	05-JUL-11	NOT PROVIDE	NOT PROVIDE	NOT PROVIDE	SA17A	749073	Republican
3	C00410118	P20002978	Bachmann, Michelle	BLEVINS, DARONDA	PIGGOTT	AR	7.24548e+08	NONE	RETIRED	250.0	01-AUG-11	NOT PROVIDE	NOT PROVIDE	NOT PROVIDE	SA17A	749073	Republican
4	C00410118	P20002978	Bachmann, Michelle	WARDENBURG, HAROLD	HOT SPRINGS NATION	AR	7.19016e+08	NONE	RETIRED	300.0	20-JUN-11	NOT PROVIDE	NOT PROVIDE	NOT PROVIDE	SA17A	736166	Republican

In [62]:

#查看party这一列中有哪些不同的元素,统计party列中各个元素出现次数
df['party'].unique()

Out[62]:

array(['Republican', 'Democrat', 'Reform', 'Libertarian'], dtype=object)

In [63]:

df['party'].value_counts()#value_counts()统计Serise中每一元素出现的次数

Out[63]:

Democrat       289999
Republican     234300
Reform           5313
Libertarian       702
Name: party, dtype: int64

In [64]:

#查看各个党派收到的政治献金总数contb_receipt_amt
df.groupby(by='party',axis=0)['contb_receipt_amt'].sum()

Out[64]:

party
Democrat       8.259441e+07
Libertarian    4.132769e+05
Reform         3.429658e+05
Republican     1.251181e+08
Name: contb_receipt_amt, dtype: float64

In [65]:

#查看具体每天各个党派收到的政治献金总数contb_receipt_amt
df.groupby(by=['contb_receipt_dt','party'])['contb_receipt_amt'].sum()

Out[65]:

contb_receipt_dt  party      
01-APR-11         Reform              50.00
                  Republican       12635.00
01-AUG-11         Democrat        182198.00
                  Libertarian       1000.00
                  Reform            1847.00
                  Republican      268903.02
01-DEC-11         Democrat        651982.82
                  Libertarian        725.00
                  Reform             875.00
                  Republican      505255.96
01-FEB-11         Republican         250.00
01-JAN-11         Republican        8600.00
01-JAN-12         Democrat         74303.80
                  Reform             515.00
                  Republican       76804.72
01-JUL-11         Democrat        175364.00
                  Libertarian       2000.00
                  Reform             100.00
                  Republican      125973.72
01-JUN-11         Democrat        148409.00
                  Libertarian        500.00
                  Reform              50.00
                  Republican      435609.20
01-MAR-11         Republican        1000.00
01-MAY-11         Democrat         82644.00
                  Reform             480.00
                  Republican       28663.87
01-NOV-11         Democrat        129309.87
                  Libertarian       3000.00
                  Reform            1792.00
                                    ...    
30-OCT-11         Reform            3910.00
                  Republican       46413.16
30-SEP-11         Democrat       3409587.24
                  Libertarian        550.00
                  Reform            2050.00
                  Republican     5094824.20
31-AUG-11         Democrat        375487.44
                  Libertarian      10750.00
                  Reform             450.00
                  Republican     1038330.90
31-DEC-11         Democrat       3571793.57
                  Reform             695.00
                  Republican     1165777.72
31-JAN-11         Republican        6000.00
31-JAN-12         Democrat       1421887.31
                  Reform             150.00
                  Republican      963681.41
31-JUL-11         Democrat         20305.00
                  Reform            1066.00
                  Republican       12781.02
31-MAR-11         Reform             200.00
                  Republican       74575.00
31-MAY-11         Democrat        352005.66
                  Libertarian        250.00
                  Reform             100.00
                  Republican      313839.80
31-OCT-11         Democrat        216971.87
                  Libertarian       4250.00
                  Reform            3205.00
                  Republican      751542.36
Name: contb_receipt_amt, Length: 1183, dtype: float64

In [66]:

df.columns

Out[66]:

Index(['cmte_id', 'cand_id', 'cand_nm', 'contbr_nm', 'contbr_city',
       'contbr_st', 'contbr_zip', 'contbr_employer', 'contbr_occupation',
       'contb_receipt_amt', 'contb_receipt_dt', 'receipt_desc', 'memo_cd',
       'memo_text', 'form_tp', 'file_num', 'party'],
      dtype='object')

In [67]:

#将表中日期格式转换为'yyyy-mm-dd'。
def tranformDate(d):
    day,month,year = d.split('-')
    month = months[month]
    return '20'+year+'-'+str(month)+'-'+day
df['contb_receipt_dt'] = df['contb_receipt_dt'].map(tranformDate)
df.head()

Out[67]:

	cmte_id	cand_id	cand_nm	contbr_nm	contbr_city	contbr_st	contbr_zip	contbr_employer	contbr_occupation	contb_receipt_amt	contb_receipt_dt	receipt_desc	memo_cd	memo_text	form_tp	file_num	party
0	C00410118	P20002978	Bachmann, Michelle	HARVEY, WILLIAM	MOBILE	AL	3.6601e+08	RETIRED	RETIRED	250.0	2011-6-20	NOT PROVIDE	NOT PROVIDE	NOT PROVIDE	SA17A	736166	Republican
1	C00410118	P20002978	Bachmann, Michelle	HARVEY, WILLIAM	MOBILE	AL	3.6601e+08	RETIRED	RETIRED	50.0	2011-6-23	NOT PROVIDE	NOT PROVIDE	NOT PROVIDE	SA17A	736166	Republican
2	C00410118	P20002978	Bachmann, Michelle	SMITH, LANIER	LANETT	AL	3.68633e+08	INFORMATION REQUESTED	INFORMATION REQUESTED	250.0	2011-7-05	NOT PROVIDE	NOT PROVIDE	NOT PROVIDE	SA17A	749073	Republican
3	C00410118	P20002978	Bachmann, Michelle	BLEVINS, DARONDA	PIGGOTT	AR	7.24548e+08	NONE	RETIRED	250.0	2011-8-01	NOT PROVIDE	NOT PROVIDE	NOT PROVIDE	SA17A	749073	Republican
4	C00410118	P20002978	Bachmann, Michelle	WARDENBURG, HAROLD	HOT SPRINGS NATION	AR	7.19016e+08	NONE	RETIRED	300.0	2011-6-20	NOT PROVIDE	NOT PROVIDE	NOT PROVIDE	SA17A	736166	Republican

In [68]:

#查看老兵(捐献者职业)DISABLED VETERAN主要支持谁(捐钱数量)

#1.将老兵对应的行数据取出
old_bing_df = df.loc[df['contbr_occupation'] == 'DISABLED VETERAN']

#2.对候选人分组钱数的聚合
old_bing_df.groupby(by='cand_nm')['contb_receipt_amt'].sum()

Out[68]:

cand_nm
Cain, Herman       300.00
Obama, Barack     4205.00
Paul, Ron         2425.49
Santorum, Rick     250.00
Name: contb_receipt_amt, dtype: float64

In [69]:

#找出各个候选人的捐赠者中，捐赠金额最大的人的职业以及捐献额
s = df.groupby(by=['cand_nm'])['contb_receipt_amt'].max()
s

Out[69]:

cand_nm
Bachmann, Michelle                   3022.00
Cain, Herman                        10000.00
Gingrich, Newt                       5100.00
Huntsman, Jon                        5000.00
Johnson, Gary Earl                   2500.00
McCotter, Thaddeus G                 4000.00
Obama, Barack                     1944042.43
Paul, Ron                            5000.00
Pawlenty, Timothy                   10000.00
Perry, Rick                         10000.00
Roemer, Charles E. 'Buddy' III        200.00
Romney, Mitt                        12700.00
Santorum, Rick                       5000.00
Name: contb_receipt_amt, dtype: float64

In [70]:

s.index[0]

Out[70]:

'Bachmann, Michelle'

In [71]:

for i in range(len(s)):
    q_str = 'cand_nm == "%s" & contb_receipt_amt==%d'%(s.index[i],s.values[i])
    display(df.query(q_str))

. . .

9.matplotlib绘图

plt.plot()绘制线性图

绘制单条线形图
绘制多条线形图
多个曲线图绘制在一个table区域中（subplot()函数）
设置坐标系的比例plt.figure(figsize=(a,b))
设置图例legend()
设置轴的标识
图例保存
- fig = plt.figure()
- plt.plot(x,y)
- figure.savefig（）
曲线的样式和风格（自学）

In [6]:

import numpy as np

In [4]:

import matplotlib.pyplot as plt
%matplotlib inline #保证绘制的图像可以被正常的显示加载出来
UsageError: unrecognized arguments: #保证绘制的图像可以被正常的显示加载出来

In [5]:

x = [1,2,3,4,5]
y = [5,4,3,2,1]

plt.plot(x,y)

Out[5]:

#在一个坐标系中绘制两条线段
xx = np.linspace(-np.pi,np.pi,num=20)
yy = xx ** 2
plt.plot(x,y)  #plot多次被调用，绘制多条线段
plt.plot(xx,yy)

Out[8]:

In [15]:

#将多个坐标放置在一个表格中
ax1 = plt.subplot(2,2,1) #表格大小和坐标存放的位置
ax1.plot(x,y)


ax2 = plt.subplot(2,2,2)
ax2.plot(xx,yy)


ax3 = plt.subplot(2,2,3)
ax3.plot(xx,yy)


ax4 = plt.subplot(2,2,4)
ax4.plot(x,y)

Out[15]:

``
In [17]:

#plt.figure(figsize=(a,b))
plt.figure(figsize=(4,8))
plt.plot(x,y)

. . .

In [23]:

#图例的设定
plt.plot(xx,yy,label='aaa')
plt.plot(xx-1,yy+1,label='bbb')
plt.legend(loc=1)

Out[23]:

In [25]:

#给坐标轴设定标识
plt.plot(xx-1,yy+1,label=‘bbb’)
plt.xlabel(‘distence’)
plt.ylabel(‘temp’)
plt.title(‘aaa’)


Out[25]:

Text(0.5,1,‘aaa’)
``

柱状图：plt.bar()

参数：第一个参数是索引。第二个参数是数据值。第三个参数是条形的宽度

In [33]:

plt.bar(x,y)

Out[33]:

``
In [29]:

plt.barh(x,y)

Out[29]:

直方图

是一个特殊的柱状图，又叫做密度图
plt.hist()的参数
- bins
  可以是一个bin数量的整数值，也可以是表示bin的一个序列。默认值为10
- normed
  如果值为True，直方图的值将进行归一化处理，形成概率密度，默认值为False
- color
  指定直方图的颜色。可以是单一颜色值或颜色的序列。如果指定了多个数据集合,例如DataFrame对象，颜色序列将会设置为相同的顺序。如果未指定，将会使用一个默认的线条颜色
- orientation
  通过设置orientation为horizontal创建水平直方图。默认值为vertical

In [40]:

x = [1,1,2,3,4,5,5,5,6,7,7,7,7,7,7,7,8]
plt.hist(x,bins=20)

Out[40]:

(array([2., 0., 1., 0., 0., 1., 0., 0., 1., 0., 0., 3., 0., 0., 1., 0., 0.,
        7., 0., 1.]),
 array([1.  , 1.35, 1.7 , 2.05, 2.4 , 2.75, 3.1 , 3.45, 3.8 , 4.15, 4.5 ,
        4.85, 5.2 , 5.55, 5.9 , 6.25, 6.6 , 6.95, 7.3 , 7.65, 8.  ]),
 <a list of 20 Patch objects>)

饼图

pie()，饼图也只有一个参数x
饼图适合展示各部分占总体的比例，条形图适合比较各部分的大小

In [41]:

arr=[11,22,31,15]
plt.pie(arr)

Out[41]:

([<matplotlib.patches.Wedge at 0x1a72dae13c8>,
  <matplotlib.patches.Wedge at 0x1a72dae1748>,
  <matplotlib.patches.Wedge at 0x1a72c37ea58>,
  <matplotlib.patches.Wedge at 0x1a72bc76ac8>],
 [Text(0.996424,0.465981,''),
  Text(-0.195798,1.08243,''),
  Text(-0.830021,-0.721848,''),
  Text(0.910034,-0.61793,'')])

In [42]:

arr=[0.2,0.3,0.1]
plt.pie(arr)

Out[42]:

([<matplotlib.patches.Wedge at 0x1a72ef21630>,
  <matplotlib.patches.Wedge at 0x1a72ef21b00>,
  <matplotlib.patches.Wedge at 0x1a72ef28080>],
 [Text(0.889919,0.646564,''),
  Text(-0.646564,0.889919,''),
  Text(-1.04616,-0.339919,'')])

In [43]:

arr=[11,22,31,15]
plt.pie(arr,labels=['a','b','c','d'])

Out[43]:

([<matplotlib.patches.Wedge at 0x1a72ef61d68>,
  <matplotlib.patches.Wedge at 0x1a72ef69278>,
  <matplotlib.patches.Wedge at 0x1a72ef697b8>,
  <matplotlib.patches.Wedge at 0x1a72ef69cf8>],
 [Text(0.996424,0.465981,'a'),
  Text(-0.195798,1.08243,'b'),
  Text(-0.830021,-0.721848,'c'),
  Text(0.910034,-0.61793,'d')])

In [44]:

arr=[11,22,31,15]
plt.pie(arr,labels=['a','b','c','d'],labeldistance=0.3)

Out[44]:

([<matplotlib.patches.Wedge at 0x1a72efb18d0>,
  <matplotlib.patches.Wedge at 0x1a72efb1da0>,
  <matplotlib.patches.Wedge at 0x1a72efbb320>,
  <matplotlib.patches.Wedge at 0x1a72efbb860>],
 [Text(0.271752,0.127086,'a'),
  Text(-0.0533994,0.295209,'b'),
  Text(-0.226369,-0.196868,'c'),
  Text(0.248191,-0.168526,'d')])

In [45]:

arr=[11,22,31,15]
plt.pie(arr,labels=['a','b','c','d'],labeldistance=0.3,autopct='%.6f%%')

Out[45]:

([<matplotlib.patches.Wedge at 0x1a72f0024a8>,
  <matplotlib.patches.Wedge at 0x1a72f002ba8>,
  <matplotlib.patches.Wedge at 0x1a72f00a358>,
  <matplotlib.patches.Wedge at 0x1a72f00aac8>],
 [Text(0.271752,0.127086,'a'),
  Text(-0.0533994,0.295209,'b'),
  Text(-0.226369,-0.196868,'c'),
  Text(0.248191,-0.168526,'d')],
 [Text(0.543504,0.254171,'13.924050%'),
  Text(-0.106799,0.590419,'27.848101%'),
  Text(-0.452739,-0.393735,'39.240506%'),
  Text(0.496382,-0.337053,'18.987341%')])
`` 
In [46]:

arr=[11,22,31,15]
plt.pie(arr,labels=[‘a’,‘b’,‘c’,‘d’],labeldistance=0.3,shadow=True,explode=[0.2,0.3,0.2,0.4])


Out[46]:

([<matplotlib.patches.Wedge at 0x1a72f04e940>,
<matplotlib.patches.Wedge at 0x1a72f056128>,
<matplotlib.patches.Wedge at 0x1a72f056940>,
<matplotlib.patches.Wedge at 0x1a72f062198>],
[Text(0.45292,0.21181,‘a’),
Text(-0.106799,0.590419,‘b’),
Text(-0.377282,-0.328113,‘c’),
Text(0.579113,-0.393228,‘d’)])


[外 
### 散点图scatter()

- 因变量随自变量而变化的大致趋势

In [49]:

x = np.array([1,2,3,4,5])
y = x ** 2
plt.scatter(x,y)


Out[49]:

<matplotlib.collections.PathCollection at 0x1a72f089438>


[外 

In [51]:

x = np.random.random(size=(20,))
y = np.random.random(size=(20,))
plt.scatter(x,y)


Out[51]:

`` 

temp dist

Type *Markdown* and LaTeX: α2α2

Type *Markdown* and LaTeX: α2α2





 

###  项目需求 



### 第一部分：数据类型处理

- 数据加载
  - 字段含义：
    - user_id:用户ID
    - order_dt:购买日期
    - order_product:购买产品的数量
    - order_amount:购买金额
- 观察数据
  - 查看数据的数据类型
  - 数据中是否存储在缺失值
  - 将order_dt转换成时间类型
  - 查看数据的统计描述
    - 计算所有用户购买商品的平均数量
    - 计算所有用户购买商品的平均花费
  - 在源数据中添加一列表示月份

### 第二部分：按月数据分析

- 用户每月购买的总金额
  - 绘制曲线图展示
- 所有用户每月的消费总次数
- 所有用户每月的产品购买量
- 统计每月的消费人数

### 第三部分：用户个体消费数据分析

- 用户消费总金额和消费总次数的统计描述
- 用户消费金额和消费次数的散点图
- 用户消费金额的分布图
- 用户消费次数的分布图(消费次数在100次之内的分布)

### 第四部分：用户消费行为分析

- 用户第一次消费的时间分布，和人数统计
  - 绘制线形图
- 用户最后一次消费的时间分布，和人数统计
  - 绘制线形图
- 新老客户的占比
  - 消费一次为新用户
  - 消费多次为老用户
- 用户分层
  - 分析得出每个用户的总购买量和总消费金额and最近一次消费的时间

- 用户的生命周期
  - 将用户划分为活跃用户和其他用户
hes.Wedge at 0x1a72f04e940>,
  <matplotlib.patches.Wedge at 0x1a72f056128>,
  <matplotlib.patches.Wedge at 0x1a72f056940>,
  <matplotlib.patches.Wedge at 0x1a72f062198>],
 [Text(0.45292,0.21181,'a'),
  Text(-0.106799,0.590419,'b'),
  Text(-0.377282,-0.328113,'c'),
  Text(0.579113,-0.393228,'d')])

[外链图片转存中…(img-1n2QMOi4-1579945470965)]

散点图scatter()

因变量随自变量而变化的大致趋势

In [49]:

x = np.array([1,2,3,4,5])
y = x ** 2
plt.scatter(x,y)

Out[49]:

<matplotlib.collections.PathCollection at 0x1a72f089438>

[外链图片转存中…(img-d64W4SwQ-1579945470965)]

In [51]:

x = np.random.random(size=(20,))
y = np.random.random(size=(20,))
plt.scatter(x,y)

Out[51]:

<matplotlib.collections.PathCollection at 0x1a72fe59e10>

项目需求

第一部分：数据类型处理

数据加载
- 字段含义：
  - user_id:用户ID
  - order_dt:购买日期
  - order_product:购买产品的数量
  - order_amount:购买金额
观察数据
- 查看数据的数据类型
- 数据中是否存储在缺失值
- 将order_dt转换成时间类型
- 查看数据的统计描述
  - 计算所有用户购买商品的平均数量
  - 计算所有用户购买商品的平均花费
- 在源数据中添加一列表示月份

第二部分：按月数据分析

用户每月购买的总金额
- 绘制曲线图展示
所有用户每月的消费总次数
所有用户每月的产品购买量
统计每月的消费人数

第三部分：用户个体消费数据分析

用户消费总金额和消费总次数的统计描述
用户消费金额和消费次数的散点图
用户消费金额的分布图
用户消费次数的分布图(消费次数在100次之内的分布)

第四部分：用户消费行为分析

用户第一次消费的时间分布，和人数统计
- 绘制线形图
用户最后一次消费的时间分布，和人数统计
- 绘制线形图
新老客户的占比
- 消费一次为新用户
- 消费多次为老用户
用户分层
- 分析得出每个用户的总购买量和总消费金额and最近一次消费的时间
用户的生命周期
- 将用户划分为活跃用户和其他用户
- 查看每月活跃用户和其他用户的占比

数据分析05

数据分析回顾

numpy

基于一维或者多维的数组
如何创建numpy数组
- np.array()
- plt.imread()
- random
- linspace
- range
数组的索引和切片
- 索引：
  - arr[0]：取出第一行数据
- 切片：
  - arr[行,列]
- 翻转：
  - arr[::-1,::-1]
级联
- 条件：
  - 必须保证维度一致形状相符
变形：
- reshape():修改数组的形状
基于聚合、统计的函数
- std（标准差），val（方差）
矩阵：
- 矩阵乘法

pandas

Series：类似于一维数组的数据结构
创建方式
索引和切片
运算法则：
- 索引与之一致的元素可以进行算术运算否则补空
isnull，notnull，unique，nunique
DataFrame
- df是由Series组成
  - df中如果单独取出一行或者一列返回的一定是一个Series
- df的创建方式
- 索引和切片
  - 索引：
    - df[‘列索引’]
    - df.iloc[‘行索引’]
    - df.loc[行,列]
  - 切片
    - df[row1:row4]
    - df.loc[:,col1:col4]
- 股票案例：
  - read_xxx():将外部文件中的数据读取到df
  - to_xxx():将df中的数据写入到文件中
  - tushare：财经数据接口包
  - Series中有一个方法：shift(x)，将Series中的元素上下移动x个位置
  - 数据的重新取样：
    - df.resameple(‘A/M’).first()/.last()
  - 设置时间序列类型
    - pd.to_datetime(df[‘col’])
  - 设置指定的列作为源数据的行索引：
    - set_index(df[‘col’])
  - df运算的过程中如果返回了一组boolean，则该boolean马上需要作为源数据的行索引，取出True所对应的行数据
  - Series中有一个函数：rolling（n），将Series中的前n个数汇总为一组值，通常rolling后面需要进行聚合操作
- 数据清洗
  - 缺失值
    - isnull-》any，notnull-》all
    - dropna()
  - 重复值
    - drop_duplicates(kee=='first)
  - 异常值
    - 判定异常值的条件
  - None和NAN的区别：
    - NAN是float可以参与运算
- replace():df元素的替换
- map映射
- map运算工具：map==apply
  - map只可以基于Series进行运算或者映射
- 随机抽样
  - take():打乱df的中行列索引
  - random.permutation(n):返回0到n-1的一个随机乱序的序列
- 级联&合并
  - 级联：将多个df进行横向或者纵向的拼接
  - 合并：根据一个或者多个合并条件进行数据的汇总
    - 内，外（推荐），左，右
- 人口分析：
  - query：df进行条件查询
  - value_counts：统计Series中每一个元素出现的次数
  - info()
- 分组聚合
  - groupby（）
  - 高级聚合：作为分组后的运算工具
    - apply
    - transform
- 透视表：
  - index
  - value
  - aggfunc
  - columns
- apply
  - df.apply()对df中的行列进行某种形式的运算
- applymap：
  - 对df中的元素进行某种形式的运算

项目需求

第一部分：数据类型处理

数据加载
- 字段含义：
  - user_id:用户ID
  - order_dt:购买日期
  - order_product:购买产品的数量
  - order_amount:购买金额
观察数据
- 查看数据的数据类型
- 数据中是否存储在缺失值
- 将order_dt转换成时间类型
- 查看数据的统计描述
  - 计算所有用户购买商品的平均数量
  - 计算所有用户购买商品的平均花费
- 在源数据中添加一列表示月份:astype(‘datetime64[M]’)

In [2]:

import pandas as pd
import numpy as np
from pandas import DataFrame
import matplotlib.pyplot as plt
%matplotlib inline

In [17]:

df = pd.read_csv('./CDNOW_master.txt',header=None,sep='\s+',names=['user_id','order_dt','order_product','order_amount'])
df.head()

Out[17]:

	user_id	order_dt	order_product	order_amount
0	1	19970101	1	11.77
1	2	19970112	1	12.00
2	2	19970112	5	77.00
3	3	19970102	2	20.76
4	3	19970330	2	20.76

In [5]:

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 69659 entries, 0 to 69658
Data columns (total 4 columns):
user_id          69659 non-null int64
order_dt         69659 non-null int64
order_product    69659 non-null int64
order_amount     69659 non-null float64
dtypes: float64(1), int64(3)
memory usage: 2.1 MB

In [55]:

df.describe()

Out[55]:

	user_id	order_product	order_amount
count	69659.000000	69659.000000	69659.000000
mean	11470.854592	2.410040	35.893648
std	6819.904848	2.333924	36.281942
min	1.000000	1.000000	0.000000
25%	5506.000000	1.000000	14.490000
50%	11410.000000	2.000000	25.980000
75%	17273.000000	3.000000	43.700000
max	23570.000000	99.000000	1286.010000

In [18]:

#order_dt转换成时间序列
df['order_dt'] = pd.to_datetime(df['order_dt'],format='%Y%m%d')
df.head()

Out[18]:

	user_id	order_dt	order_product	order_amount
0	1	1997-01-01	1	11.77
1	2	1997-01-12	1	12.00
2	2	1997-01-12	5	77.00
3	3	1997-01-02	2	20.76
4	3	1997-03-30	2	20.76

In [10]:

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 69659 entries, 0 to 69658
Data columns (total 4 columns):
user_id          69659 non-null int64
order_dt         69659 non-null datetime64[ns]
order_product    69659 non-null int64
order_amount     69659 non-null float64
dtypes: datetime64[ns](1), float64(1), int64(2)
memory usage: 2.1 MB

In [19]:

#添加新的一列表示月份
#使用Series调用astype('数据类型')：将Serise的元素转换成指定的数据类型
df['month'] = df['order_dt'].values.astype('datetime64[M]')
df.head()

Out[19]:

	user_id	order_dt	order_product	order_amount	month
0	1	1997-01-01	1	11.77	1997-01-01
1	2	1997-01-12	1	12.00	1997-01-01
2	2	1997-01-12	5	77.00	1997-01-01
3	3	1997-01-02	2	20.76	1997-01-01
4	3	1997-03-30	2	20.76	1997-03-01

第二部分：按月数据分析

用户每月花费的总金额
- 绘制曲线图展示
所有用户每月的产品购买量
所有用户每月的消费总次数
统计每月的消费人数

In [23]:

#用户每月花费的总金额
month_amount_series = df.groupby(by='month')['order_amount'].sum()
month_amount_series

. . .

In [24]:

month_amount_series.plot()

Out[24]:

``
In [26]:

#所有用户每月的产品购买量
df.groupby(by='month')['order_product'].sum().plot()

Out[26]:

In [29]:

#所有用户每月的消费总次数
df.groupby(by=‘month’)[‘user_id’].count().plot()


Out[29]:

`` 
In [30]:

#统计每月的消费人数(去重)
df.groupby(by=‘month’)[‘user_id’].nunique()


. . .

In [36]:

#高级聚合操作
df.groupby(by=‘month’)[‘user_id’].apply(lambda x:len(x.drop_duplicates()))


Out[36]:

month
1997-01-01 7846
1997-02-01 9633
1997-03-01 9524
1997-04-01 2822
1997-05-01 2214
1997-06-01 2339
1997-07-01 2180
1997-08-01 1772
1997-09-01 1739
1997-10-01 1839
1997-11-01 2028
1997-12-01 1864
1998-01-01 1537
1998-02-01 1551
1998-03-01 2060
1998-04-01 1437
1998-05-01 1488
1998-06-01 1506
Name: user_id, dtype: int64


### 第三部分：用户个体消费数据分析

- 用户消费总金额和消费总次数的统计描述
- 用户消费金额和消费次数的散点图
- 各个用户消费总金额的直方分布图(消费金额在1000之内的分布)
- 各个用户消费的总数量的直方分布图(消费商品的数量在100次之内的分布)

In [37]:

#用户消费总金额和消费总次数的统计描述
df.describe()


Out[37]:

|       |      user_id | order_product | order_amount |
| ----: | -----------: | ------------: | -----------: |
| count | 69659.000000 |  69659.000000 | 69659.000000 |
|  mean | 11470.854592 |      2.410040 |    35.893648 |
|   std |  6819.904848 |      2.333924 |    36.281942 |
|   min |     1.000000 |      1.000000 |     0.000000 |
|   25% |  5506.000000 |      1.000000 |    14.490000 |
|   50% | 11410.000000 |      2.000000 |    25.980000 |
|   75% | 17273.000000 |      3.000000 |    43.700000 |
|   max | 23570.000000 |     99.000000 |  1286.010000 |

In [39]:

#用户消费金额和消费次数的散点图
user_amount = df.groupby(by=‘user_id’)[‘order_amount’].sum()
user_oder_count = df.groupby(by=‘user_id’)[‘order_product’].count()


In [41]:

#绘制散点图
plt.scatter(user_oder_count,user_amount)
plt.xlabel(‘count’)
plt.ylabel(‘amount’)


Out[41]:

Text(0,0.5,‘amount’)

 
In [47]:

#各个用户消费总金额的直方分布图(消费金额在1000之内的分布)
user_amount_1000 = df.groupby(by=‘user_id’).sum().query(‘order_amount <= 1000’)[‘order_amount’]
plt.hist(user_amount_1000)


. . .

In [51]:

#各个用户消费的总数量的直方分布图(消费商品的数量在100之内的分布)
user_product_count = df.groupby(by=‘user_id’).sum().query(‘order_product <= 100’)[‘order_product’]
plt.hist(user_product_count)


Out[51]:

(array([19543., 2330., 830., 328., 185., 116., 57., 42.,
39., 21.]),
array([ 1. , 10.7, 20.4, 30.1, 39.8, 49.5, 59.2, 68.9, 78.6, 88.3, 98. ]),
<a list of 10 Patch objects>)




### 第四部分：用户消费行为分析

- 用户第一次消费的月份分布，和人数统计
  - 绘制线形图
- 用户最后一次消费的时间分布，和人数统计
  - 绘制线形图
- 新老客户的占比
  - 消费一次为新用户
  - 消费多次为老用户
    - 分析出每一个用户的第一个消费和最后一次消费的时间
      - agg(['func1','func2']):对分组后的结果进行指定聚合
    - 分析出新老客户的消费比例
- 用户分层
  - 分析得出每个用户的总购买量和总消费金额and最近一次消费的时间的表格rfm
  - RFM模型设计
    - R表示客户最近一次交易时间的间隔。
      - /np.timedelta64(1,'D')：去除days
    - F表示客户购买商品的总数量,F值越大，表示客户交易越频繁，反之则表示客户交易不够活跃。
    - M表示客户交易的金额。M值越大，表示客户价值越高，反之则表示客户价值越低。
    - 将R，F，M作用到rfm表中
  - 根据价值分层，将用户分为：
    - 重要价值客户
    - 重要保持客户
    - 重要挽留客户
    - 重要发展客户
    - 一般价值客户
    - 一般保持客户
    - 一般挽留客户
    - 一般发展客户
      - 使用已有的分层模型即可rfm_func

In [60]:

#用户第一次消费的月份分布，和人数统计
#思路：找出用户购买月份的最小值，进行数量统计
df.groupby(by=‘user_id’)[‘month’].min().value_counts()


Out[60]:

1997-02-01 8476
1997-01-01 7846
1997-03-01 7248
Name: month, dtype: int64


In [62]:

df.groupby(by=‘user_id’)[‘month’].min().value_counts().plot()


Out[62]:

<matplotlib.axes._subplots.AxesSubplot at 0x232e7770e48>




In [65]:

#用户最后一次消费的月份分布，和人数统计
df.groupby(by=‘user_id’)[‘month’].max().value_counts()


. . .

In [66]:

df.groupby(by=‘user_id’)[‘month’].max().value_counts().plot()


Out[66]:

<matplotlib.axes._subplots.AxesSubplot at 0x232e7a78320>




In [ ]:

#分析出新老客户的消费比例


In [69]:

#新用户：用户的首次购买时间和最后一次购买时间，两个时间一样，则表示该用户只购买了一次为新用户
#老用户：不同上为老用户
df_dt_min_max = df.groupby(by=‘user_id’)[‘order_dt’].agg([‘min’,‘max’,])#agg([‘func1’,‘func2’]):对分组后的结果进行指定多种形式聚合
df_dt_min_max.head()


Out[69]:

|         |        min |        max |
| ------: | ---------: | ---------: |
| user_id |            |            |
|       1 | 1997-01-01 | 1997-01-01 |
|       2 | 1997-01-12 | 1997-01-12 |
|       3 | 1997-01-02 | 1998-05-28 |
|       4 | 1997-01-01 | 1997-12-12 |
|       5 | 1997-01-01 | 1998-01-03 |

In [71]:

(df_dt_min_max[‘min’] == df_dt_min_max[‘max’]).value_counts()


Out[71]:

True 12054
False 11516
dtype: int64


In [83]:

#分析得出每个用户的总购买量和总消费金额and最近（最后）一次消费的时间的表格rfm
rfm = df.pivot_table(index=‘user_id’,aggfunc={‘order_product’:‘sum’,‘order_amount’:‘sum’,‘order_dt’:‘max’})
rfm.head()


Out[83]:

|         | order_amount |   order_dt | order_product |
| ------: | -----------: | ---------: | ------------: |
| user_id |              |            |               |
|       1 |        11.77 | 1997-01-01 |             1 |
|       2 |        89.00 | 1997-01-12 |             6 |
|       3 |       156.46 | 1998-05-28 |            16 |
|       4 |       100.50 | 1997-12-12 |             7 |
|       5 |       385.61 | 1998-01-03 |            29 |

In [84]:

#R列表示客户最近一次交易时间的间隔。/np.timedelta64(1,‘D’)：去除days
rfm[‘R’] = -(rfm[‘order_dt’] - rfm[‘order_dt’].max())/np.timedelta64(1,‘D’)
rfm.head()


Out[84]:

|         | order_amount |   order_dt | order_product |     R |
| ------: | -----------: | ---------: | ------------: | ----: |
| user_id |              |            |               |       |
|       1 |        11.77 | 1997-01-01 |             1 | 545.0 |
|       2 |        89.00 | 1997-01-12 |             6 | 534.0 |
|       3 |       156.46 | 1998-05-28 |            16 |  33.0 |
|       4 |       100.50 | 1997-12-12 |             7 | 200.0 |
|       5 |       385.61 | 1998-01-03 |            29 | 178.0 |

In [85]:

#将rfm模型对应的数据整合出来
#r:购买的时间间隔
#f:购买商品的总量
#m：购买商品花费的总金额
rfm = rfm[[‘order_amount’,‘order_product’,‘R’]]
rfm.rename(columns={‘order_amount’:‘M’,‘order_product’:‘F’},inplace=True)
rfm.head()


Out[85]:

|         |      M |    F |     R |
| ------: | -----: | ---: | ----: |
| user_id |        |      |       |
|       1 |  11.77 |    1 | 545.0 |
|       2 |  89.00 |    6 | 534.0 |
|       3 | 156.46 |   16 |  33.0 |
|       4 | 100.50 |    7 | 200.0 |
|       5 | 385.61 |   29 | 178.0 |

In [86]:

def rfm_func(x):# -94.310426 -6.122656 177.778362
#存储存储的是三个字符串形式的0或者1
level = x.map(lambda x :‘1’ if x >= 0 else ‘0’)
label = level[‘R’]+ level[‘F’] + level.M #==>level[‘M’]
d = {
‘111’:‘重要价值客户’,
‘011’:‘重要保持客户’,
‘101’:‘重要挽留客户’,
‘001’:‘重要发展客户’,
‘110’:‘一般价值客户’,
‘010’:‘一般保持客户’,
‘100’:‘一般挽留客户’,
‘000’:‘一般发展客户’
}
result = d[label]
return result
#df.apply(func):可以对df中的行或者列进行某种（func）形式的运算
rfm[‘label’] = rfm.apply(lambda x : x - x.mean()).apply(rfm_func,axis = 1)
rfm.head()


Out[86]:

|         |      M |    F |     R |        label |
| ------: | -----: | ---: | ----: | -----------: |
| user_id |        |      |       |              |
|       1 |  11.77 |    1 | 545.0 | 一般挽留客户 |
|       2 |  89.00 |    6 | 534.0 | 一般挽留客户 |
|       3 | 156.46 |   16 |  33.0 | 重要保持客户 |
|       4 | 100.50 |    7 | 200.0 | 一般发展客户 |
|       5 | 385.61 |   29 | 178.0 | 重要保持客户 |

#### 用户的生命周期

- 将用户划分为活跃用户和其他用户
  - 统计每个用户每个月的消费次数
  - 统计每个用户每个月是否消费，消费记录为1否则记录为0
    - 知识点：DataFrame的apply和applymap的区别
      - applymap:返回df
      - 将函数做用于DataFrame中的所有元素(elements)
      - apply:返回Series
      - apply()将一个函数作用于DataFrame中的每个行或者列
  - 将用户按照每一个月份分成：
    - unreg:观望用户（前两月没买，第三个月才第一次买,则用户前两个月为观望用户）
    - unactive:首月购买后，后序月份没有购买则在没有购买的月份中该用户的为非活跃用户
    - new:当前月就进行首次购买的用户在当前月为新用户
    - active:连续月份购买的用户在这些月中为活跃用户
    - return:购买之后间隔n月再次购买的第一个月份为该月份的回头客

In [94]:

#统计每个用户每个月的消费次数
df_purchase = df.pivot_table(index=‘user_id’,aggfunc=‘count’,values=‘order_dt’,columns=‘month’).fillna(0)
df_purchase.head(5)


Out[94]:

|   month | 1997-01-01 00:00:00 | 1997-02-01 00:00:00 | 1997-03-01 00:00:00 | 1997-04-01 00:00:00 | 1997-05-01 00:00:00 | 1997-06-01 00:00:00 | 1997-07-01 00:00:00 | 1997-08-01 00:00:00 | 1997-09-01 00:00:00 | 1997-10-01 00:00:00 | 1997-11-01 00:00:00 | 1997-12-01 00:00:00 | 1998-01-01 00:00:00 | 1998-02-01 00:00:00 | 1998-03-01 00:00:00 | 1998-04-01 00:00:00 | 1998-05-01 00:00:00 | 1998-06-01 00:00:00 |
| ------: | ------------------: | ------------------: | ------------------: | ------------------: | ------------------: | ------------------: | ------------------: | ------------------: | ------------------: | ------------------: | ------------------: | ------------------: | ------------------: | ------------------: | ------------------: | ------------------: | ------------------: | ------------------: |
| user_id |                     |                     |                     |                     |                     |                     |                     |                     |                     |                     |                     |                     |                     |                     |                     |                     |                     |                     |
|       1 |                 1.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |
|       2 |                 2.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |
|       3 |                 1.0 |                 0.0 |                 1.0 |                 1.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 2.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 1.0 |                 0.0 |
|       4 |                 2.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 1.0 |                 0.0 |                 0.0 |                 0.0 |                 1.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |
|       5 |                 2.0 |                 1.0 |                 0.0 |                 1.0 |                 1.0 |                 1.0 |                 1.0 |                 0.0 |                 1.0 |                 0.0 |                 0.0 |                 2.0 |                 1.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |

In [96]:

#统计每个用户每个月是否消费，消费记录为1否则记录为0
df_purchase = df_purchase.applymap(lambda x:1 if x > 0 else 0)
#df.applymap(func):df的一个运算工具，运算对应的是df中的每一个元素
df_purchase.head(7)


Out[96]:

|   month | 1997-01-01 00:00:00 | 1997-02-01 00:00:00 | 1997-03-01 00:00:00 | 1997-04-01 00:00:00 | 1997-05-01 00:00:00 | 1997-06-01 00:00:00 | 1997-07-01 00:00:00 | 1997-08-01 00:00:00 | 1997-09-01 00:00:00 | 1997-10-01 00:00:00 | 1997-11-01 00:00:00 | 1997-12-01 00:00:00 | 1998-01-01 00:00:00 | 1998-02-01 00:00:00 | 1998-03-01 00:00:00 | 1998-04-01 00:00:00 | 1998-05-01 00:00:00 | 1998-06-01 00:00:00 |
| ------: | ------------------: | ------------------: | ------------------: | ------------------: | ------------------: | ------------------: | ------------------: | ------------------: | ------------------: | ------------------: | ------------------: | ------------------: | ------------------: | ------------------: | ------------------: | ------------------: | ------------------: | ------------------: |
| user_id |                     |                     |                     |                     |                     |                     |                     |                     |                     |                     |                     |                     |                     |                     |                     |                     |                     |                     |
|       1 |                   1 |                   0 |                   0 |                   0 |                   0 |                   0 |                   0 |                   0 |                   0 |                   0 |                   0 |                   0 |                   0 |                   0 |                   0 |                   0 |                   0 |                   0 |
|       2 |                   1 |                   0 |                   0 |                   0 |                   0 |                   0 |                   0 |                   0 |                   0 |                   0 |                   0 |                   0 |                   0 |                   0 |                   0 |                   0 |                   0 |                   0 |
|       3 |                   1 |                   0 |                   1 |                   1 |                   0 |                   0 |                   0 |                   0 |                   0 |                   0 |                   1 |                   0 |                   0 |                   0 |                   0 |                   0 |                   1 |                   0 |
|       4 |                   1 |                   0 |                   0 |                   0 |                   0 |                   0 |                   0 |                   1 |                   0 |                   0 |                   0 |                   1 |                   0 |                   0 |                   0 |                   0 |                   0 |                   0 |
|       5 |                   1 |                   1 |                   0 |                   1 |                   1 |                   1 |                   1 |                   0 |                   1 |                   0 |                   0 |                   1 |                   1 |                   0 |                   0 |                   0 |                   0 |                   0 |
|       6 |                   1 |                   0 |                   0 |                   0 |                   0 |                   0 |                   0 |                   0 |                   0 |                   0 |                   0 |                   0 |                   0 |                   0 |                   0 |                   0 |                   0 |                   0 |
|       7 |                   1 |                   0 |                   0 |                   0 |                   0 |                   0 |                   0 |                   0 |                   0 |                   1 |                   0 |                   0 |                   0 |                   0 |                   1 |                   0 |                   0 |                   0 |

In [ ]:

将用户按照每一个月份分成：
unreg:观望用户（前两月没买，第三个月才第一次买,则用户前两个月为观望用户）
unactive:首月购买后，后序月份没有购买则在没有购买的月份中该用户的为非活跃用户
new:当前月就进行首次购买的用户在当前月为新用户
active:连续月份购买的用户在这些月中为活跃用户
return:购买之后间隔n月再次购买的第一个月份为该月份的回头客


In [97]:

#将df_purchase中的原始数据0和1修改为new，unactive…,返回新的df叫做df_purchase_new
#固定算法
def active_status(data): #data就是df_purchase中的某一行数据（0,1不同分布组成）
status = []#某个用户每一个月的活跃度
for i in range(18):

    #若本月没有消费
    if data[i] == 0:
        if len(status) > 0:
            if status[i-1] == 'unreg':
                status.append('unreg')
            else:
                status.append('unactive')
        else:
            status.append('unreg')
                
    #若本月消费
    else:
        if len(status) == 0:
            status.append('new')
        else:
            if status[i-1] == 'unactive':
                status.append('return')
            elif status[i-1] == 'unreg':
                status.append('new')
            else:
                status.append('active')
return status

pivoted_status = df_purchase.apply(active_status,axis = 1)
pivoted_status.head()


Out[97]:

|   month | 1997-01-01 00:00:00 | 1997-02-01 00:00:00 | 1997-03-01 00:00:00 | 1997-04-01 00:00:00 | 1997-05-01 00:00:00 | 1997-06-01 00:00:00 | 1997-07-01 00:00:00 | 1997-08-01 00:00:00 | 1997-09-01 00:00:00 | 1997-10-01 00:00:00 | 1997-11-01 00:00:00 | 1997-12-01 00:00:00 | 1998-01-01 00:00:00 | 1998-02-01 00:00:00 | 1998-03-01 00:00:00 | 1998-04-01 00:00:00 | 1998-05-01 00:00:00 | 1998-06-01 00:00:00 |
| ------: | ------------------: | ------------------: | ------------------: | ------------------: | ------------------: | ------------------: | ------------------: | ------------------: | ------------------: | ------------------: | ------------------: | ------------------: | ------------------: | ------------------: | ------------------: | ------------------: | ------------------: | ------------------: |
| user_id |                     |                     |                     |                     |                     |                     |                     |                     |                     |                     |                     |                     |                     |                     |                     |                     |                     |                     |
|       1 |                 new |            unactive |            unactive |            unactive |            unactive |            unactive |            unactive |            unactive |            unactive |            unactive |            unactive |            unactive |            unactive |            unactive |            unactive |            unactive |            unactive |            unactive |
|       2 |                 new |            unactive |            unactive |            unactive |            unactive |            unactive |            unactive |            unactive |            unactive |            unactive |            unactive |            unactive |            unactive |            unactive |            unactive |            unactive |            unactive |            unactive |
|       3 |                 new |            unactive |              return |              active |            unactive |            unactive |            unactive |            unactive |            unactive |            unactive |              return |            unactive |            unactive |            unactive |            unactive |            unactive |              return |            unactive |
|       4 |                 new |            unactive |            unactive |            unactive |            unactive |            unactive |            unactive |              return |            unactive |            unactive |            unactive |              return |            unactive |            unactive |            unactive |            unactive |            unactive |            unactive |
|       5 |                 new |              active |            unactive |              return |              active |              active |              active |            unactive |              return |            unactive |            unactive |              return |              active |            unactive |            unactive |            unactive |            unactive |            unactive |

- 每月【不同活跃】用户的计数
  - purchase_status_ct = df_purchase_new.apply(lambda x : pd.value_counts(x)).fillna(0)
  - 转置进行最终结果的查看

In [99]:

pivoted_status.apply(lambda x:pd.value_counts(x)).fillna(0)


Out[99]:

|    month | 1997-01-01 00:00:00 | 1997-02-01 00:00:00 | 1997-03-01 00:00:00 | 1997-04-01 00:00:00 | 1997-05-01 00:00:00 | 1997-06-01 00:00:00 | 1997-07-01 00:00:00 | 1997-08-01 00:00:00 | 1997-09-01 00:00:00 | 1997-10-01 00:00:00 | 1997-11-01 00:00:00 | 1997-12-01 00:00:00 | 1998-01-01 00:00:00 | 1998-02-01 00:00:00 | 1998-03-01 00:00:00 | 1998-04-01 00:00:00 | 1998-05-01 00:00:00 | 1998-06-01 00:00:00 |
| -------: | ------------------: | ------------------: | ------------------: | ------------------: | ------------------: | ------------------: | ------------------: | ------------------: | ------------------: | ------------------: | ------------------: | ------------------: | ------------------: | ------------------: | ------------------: | ------------------: | ------------------: | ------------------: |
|   active |                 0.0 |              1157.0 |              1681.0 |              1773.0 |               852.0 |               747.0 |               746.0 |               604.0 |               528.0 |               532.0 |               624.0 |               632.0 |               512.0 |               472.0 |               571.0 |               518.0 |               459.0 |               446.0 |
|      new |              7846.0 |              8476.0 |              7248.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |
|   return |                 0.0 |                 0.0 |               595.0 |              1049.0 |              1362.0 |              1592.0 |              1434.0 |              1168.0 |              1211.0 |              1307.0 |              1404.0 |              1232.0 |              1025.0 |              1079.0 |              1489.0 |               919.0 |              1029.0 |              1060.0 |
| unactive |                 0.0 |              6689.0 |             14046.0 |             20748.0 |             21356.0 |             21231.0 |             21390.0 |             21798.0 |             21831.0 |             21731.0 |             21542.0 |             21706.0 |             22033.0 |             22019.0 |             21510.0 |             22133.0 |             22082.0 |             22064.0 |
|    unreg |             15724.0 |              7248.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |                 0.0 |

In [100]:

#进行转置在进行查看
pivoted_status.apply(lambda x:pd.value_counts(x)).fillna(0).T


Out[100]:

|            | active |    new | return | unactive |   unreg |
| ---------: | -----: | -----: | -----: | -------: | ------: |
|      month |        |        |        |          |         |
| 1997-01-01 |    0.0 | 7846.0 |    0.0 |      0.0 | 15724.0 |
| 1997-02-01 | 1157.0 | 8476.0 |    0.0 |   6689.0 |  7248.0 |
| 1997-03-01 | 1681.0 | 7248.0 |  595.0 |  14046.0 |     0.0 |
| 1997-04-01 | 1773.0 |    0.0 | 1049.0 |  20748.0 |     0.0 |
| 1997-05-01 |  852.0 |    0.0 | 1362.0 |  21356.0 |     0.0 |
| 1997-06-01 |  747.0 |    0.0 | 1592.0 |  21231.0 |     0.0 |
| 1997-07-01 |  746.0 |    0.0 | 1434.0 |  21390.0 |     0.0 |
| 1997-08-01 |  604.0 |    0.0 | 1168.0 |  21798.0 |     0.0 |
| 1997-09-01 |  528.0 |    0.0 | 1211.0 |  21831.0 |     0.0 |
| 1997-10-01 |  532.0 |    0.0 | 1307.0 |  21731.0 |     0.0 |
| 1997-11-01 |  624.0 |    0.0 | 1404.0 |  21542.0 |     0.0 |
| 1997-12-01 |  632.0 |    0.0 | 1232.0 |  21706.0 |     0.0 |
| 1998-01-01 |  512.0 |    0.0 | 1025.0 |  22033.0 |     0.0 |
| 1998-02-01 |  472.0 |    0.0 | 1079.0 |  22019.0 |     0.0 |
| 1998-03-01 |  571.0 |    0.0 | 1489.0 |  21510.0 |     0.0 |
| 1998-04-01 |  518.0 |    0.0 |  919.0 |  22133.0 |     0.0 |
| 1998-05-01 |  459.0 |    0.0 | 1029.0 |  22082.0 |     0.0 |
0 |    0.0 | 1362.0 |  21356.0 |     0.0 |
| 1997-06-01 |  747.0 |    0.0 | 1592.0 |  21231.0 |     0.0 |
| 1997-07-01 |  746.0 |    0.0 | 1434.0 |  21390.0 |     0.0 |
| 1997-08-01 |  604.0 |    0.0 | 1168.0 |  21798.0 |     0.0 |
| 1997-09-01 |  528.0 |    0.0 | 1211.0 |  21831.0 |     0.0 |
| 1997-10-01 |  532.0 |    0.0 | 1307.0 |  21731.0 |     0.0 |
| 1997-11-01 |  624.0 |    0.0 | 1404.0 |  21542.0 |     0.0 |
| 1997-12-01 |  632.0 |    0.0 | 1232.0 |  21706.0 |     0.0 |
| 1998-01-01 |  512.0 |    0.0 | 1025.0 |  22033.0 |     0.0 |
| 1998-02-01 |  472.0 |    0.0 | 1079.0 |  22019.0 |     0.0 |
| 1998-03-01 |  571.0 |    0.0 | 1489.0 |  21510.0 |     0.0 |
| 1998-04-01 |  518.0 |    0.0 |  919.0 |  22133.0 |     0.0 |
| 1998-05-01 |  459.0 |    0.0 | 1029.0 |  22082.0 |     0.0 |
| 1998-06-01 |  446.0 |    0.0 | 1060.0 |  22064.0 |     0.0 |