Python数据分析入门与实践-笔记

mysteryflower

已于 2022-02-24 09:39:22 修改

阅读量2k

点赞数 3

分类专栏： python 文章标签： python 数据分析开发语言

于 2022-02-16 13:01:11 首次发布

本文链接：https://blog.csdn.net/mysteryflower/article/details/122942349

版权

python 专栏收录该内容

54 篇文章 4 订阅

订阅专栏

第1章实验环境的搭建

本章将主要介绍Anaconda和Jupyter Notebook。包括如何在windows，Mac，linux等平台上安装Anaconda，以及Jupyter Notebook的基本启动使用方法。

1-1 导学视频

数学科学和机器学习

数学科学工作流

课程具体安排：

第一章：实验环境的搭建
第二章：Numpy入门
第三章：Pandas入门
第四章：Pandas玩转数据
第五章：绘图与可视化-Matplotlib
第六章：绘图与可视化-Seaborn
第七章：数据分析项目实战
第八章：总结

适合人群：

有一定的自学和动手能力
有最基本的Python基础
将来想从事数据分析和机器学习相关领域工作

1-2 Anaconda和Jupyter notebook介绍

Anaconda/Jupyter notebook：open Data Science Platform

Anaconda是什么？

最著名的Python数据科学平台
750+流行的Python&R包
跨平台：Windows，Mac，Linux
conda：可扩展的包管理工具
免费分发
非常活跃的社区

Anaconda的安装

下载地址

现在：https://www.anaconda.com/products/individual
之前：https://www.anaconda.com/download/

检查安装是否正确：

cd ~/anaconda
bin/conda --version
conda 4.3.21

Conda: Package和Environment管理

安装Packages
更新Packages
创建沙盒：Conda environment

Conda的Environment管理
创建一个新的environment

conda create --name python34 python3.4

激活一个environment

activate python34 # for Windows
source activate python34 # for Linux & Mac

退出一个environment

deactivate python34 # for Windows
source deactivate python34 # for Linux & Mac

删除一个environment

conda remove --name python34 --all

Conda的package管理
Conda的包管理有点类似pip
安装一个Python包

conda install numpy

查看已安装的Python包

conda list
conda list -n python34 #查看指定环境安装的Python包

删除一个Python包

conda remove --name python34 numpy

Data Science IDE vs Developer IDE

Data Science IDEs in Anaconda

从IPython到Jupyter

什么是Ipython？

一个强大的交互式shell
是Jupyter的kernel
支持交互式数据分析和可视化

Ipython Kernel

主要负责运行用户代码
通过stdin/stdout和Ipython shell交互
用json message通过ZeroMQ和notebook交互

什么是Jupyter Notebook？

前身是Ipython notebook
一个开源的Web application
可以创建和分享包含代码、视图、注释的文档
可以用于数据统计、分析、建模、机器学习等领域

Notebook和kernel之间的交互

核心是Notebook server
Notebook server 加载和保存 notebook

Notebook的文件格式(.ipynb)

由IPython Notebook 定义的一种格式(json)
可以读取在线数据，CSV/XLS文件
可以转换成其他格式(py,html,pdf,md等)

NBViewer

一个online的ipynb格式notebook展示工具
可以通过url分享
Github集成了NBViewer
通过转换器轻松集成到Blogs Emails、Wikis、Books

本课程实验室环境

在Windows/Mac/Linux上安装Anaconda
使用Python3.6作为基础环境
使用Jupyter Notebook作为编程IDE

1-3 Anaconda在Mac上的安装演示

下载macOS版本安装包，Python3.6+64位版本(截止2022/2/15，Python3.9)
Anaconda3-2021.11-MacOSX-x86_64.pkg
选择Install for me only，其他基本默认选项
不建议改变安装目录(安装需1.44GB)

~] ls
~] pwd
~] cd anaconda/
anaconda] ls
anaconda] cd bin
bin] ./conda --version
conda 4.3.21
bin] ./conda list
bin] ./jupyter notebook # 打开浏览器

1-4 Anaconda在windows上安装演示

下载Windows版本安装包，Python3.6+64位版本(截止2022/2/15，Python3.9)
Anaconda3-2021.11-Windows-x86_64.exe
选择Just Me(recommended)，其他基本默认选项
在【开始菜单】里可看到安装好的Anaconda3
打开Jupyter Notebook

1-5 Anaconda在Linux上的安装演示

下载Linux版本安装包，Python3.6+64位版本(截止2022/2/15，Python3.9)
复制安装包链接

~] wget https://repo.anaconda.com/archive/Anaconda3-2021.11-Linux-x86_64.sh
~] ls
Anaconda3-2021.11-Linux-x86_64.sh
~] ls -lh
~] sh Anaconda3-2021.11-Linux-x86_64.sh # 选择默认选项
~] pwd
/home/centos
~] cd anaconda3
anaconda3] ls
anaconda3] cd bin
anaconda3] ./conda --version
conda 4.3.21
anaconda3] ./jupyter notebook --no-browser # 复制链接

本地终端

~ ssh -N -f -L localhost:888:localhost:8888 gitlab-demo-ci
~ ssh -N -f -L localhost:888:localhost:8888 root@gitlab-demo-ci

浏览器打开，链接复制进去

!ifconfig  # 对应linux系统中 ifconfig

1-6 Jupyter-notebook的使用演示

cd anaconda3
cd jupyter-notebook/python-data-science
python-data-science git:(master) ls
README.md    demo.ipynb
python-data-science git:(master) xx/bin/jupyter notebook # 可打开

第2章 Numpy入门

本章将介绍Python数据科学领域里最基础的一个库——Numpy，回顾矩阵运算基础，介绍最重要的数据结构Array以及如何通过Numpy进行数组和矩阵运算。

2-1 数据科学领域5个常用Python库

Numpy
Scipy
Pandas
Matplotlib
Scikit-learn

Numpy

N维数组(矩阵)，快速高效，矢量属性运算
高效的Index，不需要循环
开源免费跨平台，运行效率足以和C/Matlab媲美
NumPy 中文

Scipy

依赖于Numpy
专为科学和工程设计
实现了多种常用科学计算：线性代数，傅里叶变换，信号和图像处理

Pandas

结构化数据分析利器(依赖Numpy)
提供了多种高级数据结构：Time-Series，DataFrame，Panel
强大的数据索引和处理能力

Matplotlib

Python 2D绘图领域使用最广泛的套件
基本能取代Matlab的绘图功能(散点，曲线，柱形等)
通过mplot3d可以绘制精美的3D图

Scikit-learn

机器学习的Python模块
建立在Scipy之上，提供了常用的机器学习算法：聚类，回归
简单易学的API接口

2-2 数学基础回顾之矩阵运算

基本概念

矩阵：矩形的数据，即二维数组。其中向量和标量都是矩阵的特例
向量：是指1xn或者nx1的矩阵
标量：1x1的矩阵
数组：N维的数组，时矩阵的延伸

特殊矩阵

全0全1矩阵

单位矩阵

矩阵加减运算

相加、减的两个矩阵必须要有相同的行和列
行和列对应元素相加减

数组乘法(点乘)

数组乘法(点乘)是对应元素之间的乘法

矩阵乘法

设A为mxp的矩阵，B为pxn的矩阵，mxn的矩阵C为A与B的乘积，记为C=AB，其中矩阵C中的第i行第j列元素可以表示为：

其他线性代数知识

清华大学出版的线性代数
http://bs.szu.edu.cn/sljr/Up/day_110824/201108240409437708.pdf

2-3 Array的创建及访问

Jupyter notebook 新建文件 Array.ipynb

# 数组的创建和访问
import numpy as np
# create from python list
list_1 = [1, 2, 3, 4]
list_1
#  [1, 2, 3, 4]

array_1 = np.array(list_1)
array_1
# array([1, 2, 3, 4])

list_2 = [5, 6, 7, 8]
array_2 = np.array([list_1,list_2])
array_2
# array([[1, 2, 3, 4],
       [5, 6, 7, 8]])

array_2.shape  # 数组的维度 (n,m) n行m列
# (2, 4)

# 扩展
# array_2.ndim - 数组的轴（维度）的个数。在Python世界中，维度的数量被称为rank。
# array_2.itemsize - 数组中每个元素的字节大小
# array_2.data - 该缓冲区包含数组的实际元素。


array_2.size  # 数组元素的总数，等于 shape 的元素的乘积
# 8

array_2.dtype  # 一个描述数组中元素类型的对象
# dtype('int32') 看电脑，也可能是dtype('int64')

array_3 = np.array([[1.0,2,3],[4.0,5,6]])
array_3.dtype
# dtype('float64')

array_4 = np.arange(1,10)
array_4
# array([1, 2, 3, 4, 5, 6, 7, 8, 9])

array_4 = np.arange(1, 10, 2)
array_4
# array([1, 3, 5, 7, 9])

np.zeros(5)
# array([0., 0., 0., 0., 0.])    # 零矩阵

np.zeros([2,3])  # 两行三列的二维零矩阵
# array([[0., 0., 0.],
       [0., 0., 0.]])

np.eye(5)    # n=5的单位矩阵
# array([[1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1.]])

np.eye(5).dtype
# dtype('float64')

a = np.arange(1,10)
a
# array([1, 2, 3, 4, 5, 6, 7, 8, 9])

a[1]    
# 2(取数组第2个元素)

a[1:5]    
# array([2, 3, 4, 5]) 取数组第2-5个元素

b = np.array([[1,2,3],[4,5,6]])
b
# array([[1, 2, 3],
       [4, 5, 6]])

b[1][0]
# 4

b[1,0]
# 4

c = np.array([[1,2,3],[4,5,6],[7,8,9]])
c
# array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])
c[:2,1:]
# array([[2, 3],
       [5, 6]])

2-4 数组与矩阵运算

Jupyter notebook 新建文件数组与矩阵运算.ipynb

# 快速创建数组
import numpy as np
np.random.randn(10)        # 返回10个小数元素的一维数组
# array([ 0.26674666, -0.91111093,  0.30684449, -0.80206634, -0.89176532,
        0.7950014 , -0.53259808, -0.09981816,  1.2960139 , -0.9668373 ])
np.random.randint(10)    # 0
np.random.randint(10,size=(2,3))    # 生成一个2x3的二维数组，数组元素[0,9]
# array([[7, 5, 8],
       [1, 5, 8]])
np.random.randint(10,size=20)        # 生成20个元素的一维数组，数组元素[0,9]
# array([5, 6, 4, 8, 0, 9, 6, 2, 2, 9, 2, 1, 4, 6, 1, 5, 8, 2, 3, 4])
np.random.randint(10,size=20).reshape(4,5)    # 对生成20个元素的一维数组进行重塑成4x5的二维数组，数组元素[0,9]
# array([[7, 1, 0, 5, 7],
       [8, 0, 3, 7, 9],
       [9, 0, 7, 3, 2],
       [9, 1, 5, 8, 7]])

# 数组运算
a = np.random.randint(10,size=20).reshape(4,5)
b = np.random.randint(10,size=20).reshape(4,5)
a
# array([[2, 3, 8, 4, 8],
       [0, 7, 9, 9, 9],
       [1, 8, 1, 8, 6],
       [3, 4, 7, 5, 1]])
b
# array([[8, 4, 3, 1, 6],
       [4, 4, 6, 2, 9],
       [9, 4, 8, 5, 8],
       [6, 2, 5, 5, 8]])
a + b
# array([[10,  7, 11,  5, 14],
       [ 4, 11, 15, 11, 18],
       [10, 12,  9, 13, 14],
       [ 9,  6, 12, 10,  9]])
a - b
# array([[-6, -1,  5,  3,  2],
       [-4,  3,  3,  7,  0],
       [-8,  4, -7,  3, -2],
       [-3,  2,  2,  0, -7]])
a * b
# array([[16, 12, 24,  4, 48],
       [ 0, 28, 54, 18, 81],
       [ 9, 32,  8, 40, 48],
       [18,  8, 35, 25,  8]])
a / b
# 可能会报错，看b里是否有元素0
array([[0.25      , 0.75      , 2.66666667, 4.        , 1.33333333],
       [0.        , 1.75      , 1.5       , 4.5       , 1.        ],
       [0.11111111, 2.        , 0.125     , 1.6       , 0.75      ],
       [0.5       , 2.        , 1.4       , 1.        , 0.125     ]])
np.mat([[1,2,3],[4,5,6]])
# matrix([[1, 2, 3],
        [4, 5, 6]])
a
# array([[2, 3, 8, 4, 8],
       [0, 7, 9, 9, 9],
       [1, 8, 1, 8, 6],
       [3, 4, 7, 5, 1]])
np.mat(a)
# 
matrix([[2, 3, 8, 4, 8],
        [0, 7, 9, 9, 9],
        [1, 8, 1, 8, 6],
        [3, 4, 7, 5, 1]])

# 矩阵的运算
A = np.mat(a)
B = np.mat(b)
A
# matrix([[2, 3, 8, 4, 8],
        [0, 7, 9, 9, 9],
        [1, 8, 1, 8, 6],
        [3, 4, 7, 5, 1]])
B
# matrix([[8, 4, 3, 1, 6],
        [4, 4, 6, 2, 9],
        [9, 4, 8, 5, 8],
        [6, 2, 5, 5, 8]])
A + B
# matrix([[10,  7, 11,  5, 14],
        [ 4, 11, 15, 11, 18],
        [10, 12,  9, 13, 14],
        [ 9,  6, 12, 10,  9]])
A - B
# matrix([[-6, -1,  5,  3,  2],
        [-4,  3,  3,  7,  0],
        [-8,  4, -7,  3, -2],
        [-3,  2,  2,  0, -7]])
A * B    # 报错，A的列数和B的行数不一致

a = np.mat(np.random.randint(10,size=20).reshape(4,5))
b = np.mat(np.random.randint(10,size=20).reshape(5,4))
a
# matrix([[9, 9, 3, 0, 5],
        [9, 4, 6, 4, 5],
        [9, 0, 7, 0, 9],
        [7, 2, 6, 0, 6]])
b
# matrix([[2, 2, 6, 4],
        [8, 9, 8, 0],
        [2, 1, 3, 9],
        [3, 1, 0, 2],
        [9, 3, 1, 4]])    
a * b
# matrix([[141, 117, 140,  83],
        [119,  79, 109, 118],
        [113,  52,  84, 135],
        [ 96,  56,  82, 106]])

# Array常用函数
a = np.random.randint(10,size=20).reshape(4,5)
np.unique(a)    # 对a中所有元素去重
# array([0, 1, 2, 3, 4, 5, 6, 8, 9])
a
# array([[4, 2, 8, 4, 2],
       [6, 9, 6, 4, 0],
       [9, 2, 6, 9, 0],
       [1, 3, 8, 5, 9]])
sum(a)        # a中所有行列求和
# array([20, 16, 28, 22, 11])
sum(a[0])    # a中第一行求和
# 20
sum(a[:,0])    # a中第一列求和
# 20
a.max()        # a中最大值
# 9
max(a[0])    # a中第一行最大值
# 8
max(a[:,0])    # a中第一列最大值
# 9

2-5 Array的input和output

Jupyter notebook 新建文件 Array的input和output.ipynb

# 使用pickle序列化numpy array
import pickle
import numpy as np
x = np.arange(10)
x
# array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
f = open('x.pk1','wb')
pickle.dump(x, f)
!ls        # windows系统可用!dir
# Array.ipynb            Array的input和output.ipynb
  x.pk1                    数组与矩阵运算.ipynb
f = open('x.pk1','rb')
pickle.load(f)
# array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
np.save('one_array', x)
!ls
# Array.ipynb            Array的input和output.ipynb
  x.pk1                    one_array.npy
  数组与矩阵运算.ipynb
np.load('one_array.npy')
# array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.arange(20)
y
# array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])
np.savez('two_array.npz', a=x, b=y)
!ls
# Array.ipynb                        two_array.npz
  Array的input和output.ipynb        x.pk1
  one_array.npy                        数组与矩阵运算.ipynb
np.load('two_array.npz')
# <numpy.lib.npyio.NpzFile at 0x17033c77df0>
c = np.load('two_array.npz')
c['a']
# array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
c['b']
# array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])

scipy文档

现在：https://docs.scipy.org/doc/scipy/getting_started.html
之前：https://docs.scipy.org/doc/numpy-dev/user/quickstart.html

第3章 Pandas入门

本章将介绍Python数据科学领域用于数据分析最重要的一个库——Pandas。将从pandas里最重要的两种数据结构Series和DataFrame开始，介绍其创建和基本操作，通过实际操作理解Series和DataFrame的关系。

3-1 Pandas Series

Jupyter notebook 新建文件 Series.ipynb

import numpy as np
import pandas as pd
s1 = pd.Series([1,2,3,4])
s1
# 0    1
  1    2
  2    3
  3    4
  dtype: int64
s1.values
# array([1, 2, 3, 4], dtype=int64)
s1.index
# RangeIndex(start=0, stop=4, step=1)
s2 = pd.Series(np.arange(10))
s2            # 有些电脑 dtype: int64
# 0    0
  1    1
  2    2
  3    3
  4    4
  5    5
  6    6
  7    7
  8    8
  9    9
  dtype: int32
s3 = pd.Series({'1':1, '2':2, '3':3})
s3
# 1    1
  2    2
  3    3
  dtype: int64
s3.values
# array([1, 2, 3], dtype=int64)
s3.index
# Index(['1', '2', '3'], dtype='object')
s4 = pd.Series([1,2,3,4],index=['A','B','C','D'])
s4
# A    1
  B    2
  C    3
  D    4
  dtype: int64
s4.values
# array([1, 2, 3, 4], dtype=int64)
s4.index
# Index(['A', 'B', 'C', 'D'], dtype='object')
s4['A']
# 1
s4[s4>2]
# C    3
  D    4
  dtype: int64
s4
# A    1
  B    2
  C    3
  D    4
  dtype: int64
s4.to_dict()
# {'A': 1, 'B': 2, 'C': 3, 'D': 4} 
s5 = pd.Series(s4.to_dict())
s5
# A    1
  B    2
  C    3
  D    4
  dtype: int64
index_1 = ['A', 'B', 'C', 'D','E']
s6 = pd.Series(s5,index=index_1)
s6
# A    1.0
  B    2.0
  C    3.0
  D    4.0
  E    NaN
  dtype: float64
pd.isnull(s6)
# A    False
  B    False
  C    False
  D    False
  E     True
dtype: bool
pd.notnull(s6)
# A     True
  B     True
  C     True
  D     True
  E    False
  dtype: bool
s6
# A    1.0
  B    2.0
  C    3.0
  D    4.0
  E    NaN
  dtype: float64
s6.name = 'demo'
s6
# A    1.0
  B    2.0
  C    3.0
  D    4.0
  E    NaN
  Name: demo, dtype: float64
s6.index.name = 'demo index'
s6
# demo index
  A    1.0
  B    2.0
  C    3.0
  D    4.0
  E    NaN
  Name: demo, dtype: float64
s6.index
# Index(['A', 'B', 'C', 'D', 'E'], dtype='object', name='demo index')
s6.values
# array([ 1.,  2.,  3.,  4., nan])

3-2 Pandas DataFrame

Jupyter notebook 新建文件 DataFrame.ipynb

import numpy as np
import pandas as pd
from pandas import Series, DataFrame

import webbrowser
link = 'https://www.tiobe.com/tiobe-index/'
webbrowser.open(link)        # 浏览器里打开链接
True
df = pd.read_clipboard()    # 复制页面 table里前10条数据，包含表头
df
# 输出
Position    Programming    Language    Ratings
0    21    SAS    0.66%    None
1    22    Scratch    0.64%    None
2    23    Fortran    0.58%    None
3    24    Rust    0.54%    None
4    25    (Visual)    FoxPro    0.52%
5    26    COBOL    0.42%    None
6    27    Dart    0.42%    None
7    28    Kotlin    0.41%    None
8    29    Lua    0.40%    None
9    30    Julia    0.40%    None

type(df)
# pandas.core.frame.DataFrame
df.columns
# Index(['Position', 'Programming', 'Language', 'Ratings'], dtype='object')
df.Ratings
#
0     None
1     None
2     None
3     None
4    0.52%
5     None
6     None
7     None
8     None
9     None
Name: Ratings, dtype: object

df_new = DataFrame(df,columns=['Programming','Language'])
df_new
# 输出
Programming    Language
0    SAS    0.66%
1    Scratch    0.64%
2    Fortran    0.58%
3    Rust    0.54%
4    (Visual)    FoxPro
5    COBOL    0.42%
6    Dart    0.42%
7    Kotlin    0.41%
8    Lua    0.40%
9    Julia    0.40%

df['Position']
#
0    21
1    22
2    23
3    24
4    25
5    26
6    27
7    28
8    29
9    30
Name: Position, dtype: int64

type(df['Position'])
pandas.core.series.Series
df_new = DataFrame(df,columns=['Programming','Language','Language1'])
df_new
# 输出
Programming    Language    Language1
0    SAS    0.66%    NaN
1    Scratch    0.64%    NaN
2    Fortran    0.58%    NaN
3    Rust    0.54%    NaN
4    (Visual)    FoxPro    NaN
5    COBOL    0.42%    NaN
6    Dart    0.42%    NaN
7    Kotlin    0.41%    NaN
8    Lua    0.40%    NaN
9    Julia    0.40%    NaN

# 填充的三种方式
df_new['Language1'] = range(0,10)
# df_new['Language1'] = np.arange(0,10)
# df_new['Language1'] = pd.Series(np.arange(0,10))
df_new
# 输出
Programming    Language    Language1
0    SAS    0.66%    0
1    Scratch    0.64%    1
2    Fortran    0.58%    2
3    Rust    0.54%    3
4    (Visual)    FoxPro    4
5    COBOL    0.42%    5
6    Dart    0.42%    6
7    Kotlin    0.41%    7
8    Lua    0.40%    8
9    Julia    0.40%    9

df_new['Language1'] = pd.Series([100,200], index=[1,2])
df_new
# 输出
Programming    Language    Language1
0    SAS    0.66%    NaN
1    Scratch    0.64%    100.0
2    Fortran    0.58%    200.0
3    Rust    0.54%    NaN
4    (Visual)    FoxPro    NaN
5    COBOL    0.42%    NaN
6    Dart    0.42%    NaN
7    Kotlin    0.41%    NaN
8    Lua    0.40%    NaN
9    Julia    0.40%    NaN

3-3 深入理解Series和Dataframe

Jupyter notebook 新建文件深入理解Series和Dataframe.ipynb

import numpy as np
import pandas as pd
from pandas import Series, DataFrame

data = {'Country':['Belgium', 'India', 'Brazil'],
       'Capital':['Brussels','New Delhi', 'Brasilia'],
       'Population':[11190846, 1303171035, 207847528]}

#Series
s1 = pd.Series(data['Country'])
s1
# 输出
0    Belgium
1      India
2     Brazil
dtype: object

s1.values
# array(['Belgium', 'India', 'Brazil'], dtype=object)
s1.index
# RangeIndex(start=0, stop=3, step=1)
s1 = pd.Series(data['Country'],index=['A','B','C'])
# 输出
A    Belgium
B      India
C     Brazil
dtype: object

s1.values
# array(['Belgium', 'India', 'Brazil'], dtype=object)
s1.index
# Index(['A', 'B', 'C'], dtype='object')

#Dataframe
df1 = pd.DataFrame(data)
df1
# 输出
    Country    Capital    Population
0    Belgium    Brussels    11190846
1    India    New Delhi    1303171035
2    Brazil    Brasilia    207847528

df1['Country']
# 输出
0    Belgium
1      India
2     Brazil
Name: Country, dtype: object

cou = df1['Country']
type(cou)
# pandas.core.series.Series
df1.iterrows()
# <generator object DataFrame.iterrows at 0x0000018DD44C59E0>

for row in df1.iterrows():
    print(row),print(type(row)),print(len(row))
# 输出
(0, Country        Belgium
Capital       Brussels
Population    11190846
Name: 0, dtype: object)
<class 'tuple'>
2
(1, Country            India
Capital        New Delhi
Population    1303171035
Name: 1, dtype: object)
<class 'tuple'>
2
(2, Country          Brazil
Capital        Brasilia
Population    207847528
Name: 2, dtype: object)
<class 'tuple'>
2

for row in df1.iterrows():
    print(type(row[0]),row[0],row[1])
    break
# 输出
<class 'int'> 0 Country        Belgium
Capital       Brussels
Population    11190846
Name: 0, dtype: object

# <class 'int'>  ??
<class 'numpy.int64'> 


for row in df1.iterrows():
    print(type(row[0]),type(row[1]))
    break
# 输出
<class 'int'> <class 'pandas.core.series.Series'>

# <class 'int'>  ??
<class 'numpy.int64'> 


df1
# 输出
Country    Capital    Population
0    Belgium    Brussels    11190846
1    India    New Delhi    1303171035
2    Brazil    Brasilia    207847528

data
# 输出
{'Country': ['Belgium', 'India', 'Brazil'],
 'Capital': ['Brussels', 'New Delhi', 'Brasilia'],
 'Population': [11190846, 1303171035, 207847528]}


s1 = pd.Series(data['Country'])
s2 = pd.Series(data['Capital'])
s3 = pd.Series(data['Population'])
df_new = pd.DataFrame([s1,s2,s3])
df_new
# 输出
    0    1    2
0    Belgium    India    Brazil
1    Brussels    New Delhi    Brasilia
2    11190846    1303171035    207847528

df1
# 输出
Country    Capital    Population
0    Belgium    Brussels    11190846
1    India    New Delhi    1303171035
2    Brazil    Brasilia    207847528

df_new = df_new.T
df_new
# 输出
    0    1    2
0    Belgium    Brussels    11190846
1    India    New Delhi    1303171035
2    Brazil    Brasilia    207847528

df_new = pd.DataFrame([s1,s2,s3], index=['Country','Capital','Population'])
df_new
# 输出
        0    1    2
Country    Belgium    India    Brazil
Capital    Brussels    New Delhi    Brasilia
Population    11190846    1303171035    207847528

df_new = df_new.T
df_new
# 输出
    Country    Capital    Population
0    Belgium    Brussels    11190846
1    India    New Delhi    1303171035
2    Brazil    Brasilia    207847528

3-4 Pandas-Dataframe-IO操作

Jupyter notebook 新建文件 DataFrame IO.ipynb

import numpy as np
import pandas as pd
from pandas import Series,DataFrame

import webbrowser

link = 'http://pandas.pydata.org/pandas-docs/version/0.20/io.html'
webbrowser.open(link)    # 打开浏览器，返回True； 复制 网页表格内容
# True

df1 = pd.read_clipboard()
df1
# 输出
    Format Type    Data Description    Reader    Writer
0    text    CSV    read_csv    to_csv
1    text    JSON    read_json    to_json
2    text    HTML    read_html    to_html
3    text    Local clipboard    read_clipboard    to_clipboard
4    binary    MS Excel    read_excel    to_excel
5    binary    HDF5 Format    read_hdf    to_hdf
6    binary    Feather Format    read_feather    to_feather
7    binary    Msgpack    read_msgpack    to_msgpack
8    binary    Stata    read_stata    to_stata
9    binary    SAS    read_sas    
10    binary    Python Pickle Format    read_pickle    to_pickle
11    SQL    SQL    read_sql    to_sql
12    SQL    Google Big Query    read_gbq    to_gbq

df1.to_clipboard()
df1
# 输出
    Format Type    Data Description    Reader    Writer
0    text    CSV    read_csv    to_csv
1    text    JSON    read_json    to_json
2    text    HTML    read_html    to_html
3    text    Local clipboard    read_clipboard    to_clipboard
4    binary    MS Excel    read_excel    to_excel
5    binary    HDF5 Format    read_hdf    to_hdf
6    binary    Feather Format    read_feather    to_feather
7    binary    Msgpack    read_msgpack    to_msgpack
8    binary    Stata    read_stata    to_stata
9    binary    SAS    read_sas    
10    binary    Python Pickle Format    read_pickle    to_pickle
11    SQL    SQL    read_sql    to_sql
12    SQL    Google Big Query    read_gbq    to_gbq

df1.to_csv('df1.csv')
!ls   # windows系统可用 !dir
# DataFrame IO.ipynb    df1.csv

!more df1.csv
# 输出
,Format Type,Data Description,Reader,Writer
0,text,CSV,read_csv,to_csv
1,text,JSON,read_json,to_json
2,text,HTML,read_html,to_html
3,text,Local clipboard,read_clipboard,to_clipboard
4,binary,MS Excel,read_excel,to_excel
5,binary,HDF5 Format,read_hdf,to_hdf
6,binary,Feather Format,read_feather,to_feather
7,binary,Msgpack,read_msgpack,to_msgpack
8,binary,Stata,read_stata,to_stata
9,binary,SAS,read_sas, 
10,binary,Python Pickle Format,read_pickle,to_pickle
11,SQL,SQL,read_sql,to_sql
12,SQL,Google Big Query,read_gbq,to_gbq

df1.to_csv('df1.csv',index=False)    # 去掉索引
!ls
# DataFrame IO.ipynb    df1.csv

!more df1.csv
# 输出
Format Type,Data Description,Reader,Writer
text,CSV,read_csv,to_csv
text,JSON,read_json,to_json
text,HTML,read_html,to_html
text,Local clipboard,read_clipboard,to_clipboard
binary,MS Excel,read_excel,to_excel
binary,HDF5 Format,read_hdf,to_hdf
binary,Feather Format,read_feather,to_feather
binary,Msgpack,read_msgpack,to_msgpack
binary,Stata,read_stata,to_stata
binary,SAS,read_sas, 
binary,Python Pickle Format,read_pickle,to_pickle
SQL,SQL,read_sql,to_sql
SQL,Google Big Query,read_gbq,to_gbq

df2 = pd.read_csv('df1.csv')
df2
# 输出
    Format Type    Data Description    Reader    Writer
0    text    CSV    read_csv    to_csv
1    text    JSON    read_json    to_json
2    text    HTML    read_html    to_html
3    text    Local clipboard    read_clipboard    to_clipboard
4    binary    MS Excel    read_excel    to_excel
5    binary    HDF5 Format    read_hdf    to_hdf
6    binary    Feather Format    read_feather    to_feather
7    binary    Msgpack    read_msgpack    to_msgpack
8    binary    Stata    read_stata    to_stata
9    binary    SAS    read_sas    
10    binary    Python Pickle Format    read_pickle    to_pickle
11    SQL    SQL    read_sql    to_sql
12    SQL    Google Big Query    read_gbq    to_gbq

df1.to_json()
# 输出
'{"Format Type":{"0":"text","1":"text","2":"text","3":"text","4":"binary","5":"binary","6":"binary","7":"binary","8":"binary","9":"binary","10":"binary","11":"SQL","12":"SQL"},"Data Description":{"0":"CSV","1":"JSON","2":"HTML","3":"Local clipboard","4":"MS Excel","5":"HDF5 Format","6":"Feather Format","7":"Msgpack","8":"Stata","9":"SAS","10":"Python Pickle Format","11":"SQL","12":"Google Big Query"},"Reader":{"0":"read_csv","1":"read_json","2":"read_html","3":"read_clipboard","4":"read_excel","5":"read_hdf","6":"read_feather","7":"read_msgpack","8":"read_stata","9":"read_sas","10":"read_pickle","11":"read_sql","12":"read_gbq"},"Writer":{"0":"to_csv","1":"to_json","2":"to_html","3":"to_clipboard","4":"to_excel","5":"to_hdf","6":"to_feather","7":"to_msgpack","8":"to_stata","9":" ","10":"to_pickle","11":"to_sql","12":"to_gbq"}}'

pd.read_json(df1.to_json())
# 输出
    Format Type    Data Description    Reader    Writer
0    text    CSV    read_csv    to_csv
1    text    JSON    read_json    to_json
2    text    HTML    read_html    to_html
3    text    Local clipboard    read_clipboard    to_clipboard
4    binary    MS Excel    read_excel    to_excel
5    binary    HDF5 Format    read_hdf    to_hdf
6    binary    Feather Format    read_feather    to_feather
7    binary    Msgpack    read_msgpack    to_msgpack
8    binary    Stata    read_stata    to_stata
9    binary    SAS    read_sas    
10    binary    Python Pickle Format    read_pickle    to_pickle
11    SQL    SQL    read_sql    to_sql
12    SQL    Google Big Query    read_gbq    to_gbq


df1.to_html()
# 输出
'<table border="1" class="dataframe">\n  <thead>\n    <tr style="text-align: right;">\n      <th></th>\n      <th>Format Type</th>\n      <th>Data Description</th>\n      <th>Reader</th>\n      <th>Writer</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>text</td>\n      <td>CSV</td>\n      <td>read_csv</td>\n      <td>to_csv</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>text</td>\n      <td>JSON</td>\n      <td>read_json</td>\n      <td>to_json</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>text</td>\n      <td>HTML</td>\n      <td>read_html</td>\n      <td>to_html</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>text</td>\n      <td>Local clipboard</td>\n      <td>read_clipboard</td>\n      <td>to_clipboard</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>binary</td>\n      <td>MS Excel</td>\n      <td>read_excel</td>\n      <td>to_excel</td>\n    </tr>\n    <tr>\n      <th>5</th>\n      <td>binary</td>\n      <td>HDF5 Format</td>\n      <td>read_hdf</td>\n      <td>to_hdf</td>\n    </tr>\n    <tr>\n      <th>6</th>\n      <td>binary</td>\n      <td>Feather Format</td>\n      <td>read_feather</td>\n      <td>to_feather</td>\n    </tr>\n    <tr>\n      <th>7</th>\n      <td>binary</td>\n      <td>Msgpack</td>\n      <td>read_msgpack</td>\n      <td>to_msgpack</td>\n    </tr>\n    <tr>\n      <th>8</th>\n      <td>binary</td>\n      <td>Stata</td>\n      <td>read_stata</td>\n      <td>to_stata</td>\n    </tr>\n    <tr>\n      <th>9</th>\n      <td>binary</td>\n      <td>SAS</td>\n      <td>read_sas</td>\n      <td></td>\n    </tr>\n    <tr>\n      <th>10</th>\n      <td>binary</td>\n      <td>Python Pickle Format</td>\n      <td>read_pickle</td>\n      <td>to_pickle</td>\n    </tr>\n    <tr>\n      <th>11</th>\n      <td>SQL</td>\n      <td>SQL</td>\n      <td>read_sql</td>\n      <td>to_sql</td>\n    </tr>\n    <tr>\n      <th>12</th>\n      <td>SQL</td>\n      <td>Google Big Query</td>\n      <td>read_gbq</td>\n      <td>to_gbq</td>\n    </tr>\n  </tbody>\n</table>'

df1.to_html('df1.html')
!ls
# DataFrame IO.ipynb    df1.csv        df1.html
df1.to_excel('df1.xlsx')

3-5 DataFrame的Selecting和indexing

Jupyter notebook 新建文件 Selecting and indexing.ipynb

import numpy as np
import pandas as pd
from pandas import Series, DataFrame

!pwd    # pwd 对应windows系统 chdir
# /Users/xxx/xx

!ls /Users/xxx/xx/homework    # ls 对应windows系统 dir pwd
# movie_metadata.csv

imdb = pd.read_csv('/Users/xxx/xx/homework/movie_metadata.csv')
imdb
# 输出
color    director_name    num_critic_for_reviews    duration    director_facebook_likes    actor_3_facebook_likes    actor_2_name    actor_1_facebook_likes    gross    genres    ...    num_user_for_reviews    language    country    content_rating    budget    title_year    actor_2_facebook_likes    imdb_score    aspect_ratio    movie_facebook_likes
0    Color    James Cameron    723.0    178.0    0.0    855.0    Joel David Moore    1000.0    760505847.0    Action|Adventure|Fantasy|Sci-Fi    ...    3054.0    English    USA    PG-13    237000000.0    2009.0    936.0    7.9    1.78    33000
1    Color    Gore Verbinski    302.0    169.0    563.0    1000.0    Orlando Bloom    40000.0    309404152.0    Action|Adventure|Fantasy    ...    1238.0    English    USA    PG-13    300000000.0    2007.0    5000.0    7.1    2.35    0
2    Color    Sam Mendes    602.0    148.0    0.0    161.0    Rory Kinnear    11000.0    200074175.0    Action|Adventure|Thriller    ...    994.0    English    UK    PG-13    245000000.0    2015.0    393.0    6.8    2.35    85000
3    Color    Christopher Nolan    813.0    164.0    22000.0    23000.0    Christian Bale    27000.0    448130642.0    Action|Thriller    ...    2701.0    English    USA    PG-13    250000000.0    2012.0    23000.0    8.5    2.35    164000
4    NaN    Doug Walker    NaN    NaN    131.0    NaN    Rob Walker    131.0    NaN    Documentary    ...    NaN    NaN    NaN    NaN    NaN    NaN    12.0    7.1    NaN    0
...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...
5038    Color    Scott Smith    1.0    87.0    2.0    318.0    Daphne Zuniga    637.0    NaN    Comedy|Drama    ...    6.0    English    Canada    NaN    NaN    2013.0    470.0    7.7    NaN    84
5039    Color    NaN    43.0    43.0    NaN    319.0    Valorie Curry    841.0    NaN    Crime|Drama|Mystery|Thriller    ...    359.0    English    USA    TV-14    NaN    NaN    593.0    7.5    16.00    32000
5040    Color    Benjamin Roberds    13.0    76.0    0.0    0.0    Maxwell Moody    0.0    NaN    Drama|Horror|Thriller    ...    3.0    English    USA    NaN    1400.0    2013.0    0.0    6.3    NaN    16
5041    Color    Daniel Hsia    14.0    100.0    0.0    489.0    Daniel Henney    946.0    10443.0    Comedy|Drama|Romance    ...    9.0    English    USA    PG-13    NaN    2012.0    719.0    6.3    2.35    660
5042    Color    Jon Gunn    43.0    90.0    16.0    16.0    Brian Herzlinger    86.0    85222.0    Documentary    ...    84.0    English    USA    PG    1100.0    2004.0    23.0    6.6    1.85    456
5043 rows × 28 columns

imdb.shape
# (5043, 28)

imdb.head()
# 输出
    color    director_name    num_critic_for_reviews    duration    director_facebook_likes    actor_3_facebook_likes    actor_2_name    actor_1_facebook_likes    gross    genres    ...    num_user_for_reviews    language    country    content_rating    budget    title_year    actor_2_facebook_likes    imdb_score    aspect_ratio    movie_facebook_likes
0    Color    James Cameron    723.0    178.0    0.0    855.0    Joel David Moore    1000.0    760505847.0    Action|Adventure|Fantasy|Sci-Fi    ...    3054.0    English    USA    PG-13    237000000.0    2009.0    936.0    7.9    1.78    33000
1    Color    Gore Verbinski    302.0    169.0    563.0    1000.0    Orlando Bloom    40000.0    309404152.0    Action|Adventure|Fantasy    ...    1238.0    English    USA    PG-13    300000000.0    2007.0    5000.0    7.1    2.35    0
2    Color    Sam Mendes    602.0    148.0    0.0    161.0    Rory Kinnear    11000.0    200074175.0    Action|Adventure|Thriller    ...    994.0    English    UK    PG-13    245000000.0    2015.0    393.0    6.8    2.35    85000
3    Color    Christopher Nolan    813.0    164.0    22000.0    23000.0    Christian Bale    27000.0    448130642.0    Action|Thriller    ...    2701.0    English    USA    PG-13    250000000.0    2012.0    23000.0    8.5    2.35    164000
4    NaN    Doug Walker    NaN    NaN    131.0    NaN    Rob Walker    131.0    NaN    Documentary    ...    NaN    NaN    NaN    NaN    NaN    NaN    12.0    7.1    NaN    0
5 rows × 28 columns

imdb.tail(10)
# 输出
color    director_name    num_critic_for_reviews    duration    director_facebook_likes    actor_3_facebook_likes    actor_2_name    actor_1_facebook_likes    gross    genres    ...    num_user_for_reviews    language    country    content_rating    budget    title_year    actor_2_facebook_likes    imdb_score    aspect_ratio    movie_facebook_likes
5033    Color    Shane Carruth    143.0    77.0    291.0    8.0    David Sullivan    291.0    424760.0    Drama|Sci-Fi|Thriller    ...    371.0    English    USA    PG-13    7000.0    2004.0    45.0    7.0    1.85    19000
5034    Color    Neill Dela Llana    35.0    80.0    0.0    0.0    Edgar Tancangco    0.0    70071.0    Thriller    ...    35.0    English    Philippines    Not Rated    7000.0    2005.0    0.0    6.3    NaN    74
5035    Color    Robert Rodriguez    56.0    81.0    0.0    6.0    Peter Marquardt    121.0    2040920.0    Action|Crime|Drama|Romance|Thriller    ...    130.0    Spanish    USA    R    7000.0    1992.0    20.0    6.9    1.37    0
5036    Color    Anthony Vallone    NaN    84.0    2.0    2.0    John Considine    45.0    NaN    Crime|Drama    ...    1.0    English    USA    PG-13    3250.0    2005.0    44.0    7.8    NaN    4
5037    Color    Edward Burns    14.0    95.0    0.0    133.0    Caitlin FitzGerald    296.0    4584.0    Comedy|Drama    ...    14.0    English    USA    Not Rated    9000.0    2011.0    205.0    6.4    NaN    413
5038    Color    Scott Smith    1.0    87.0    2.0    318.0    Daphne Zuniga    637.0    NaN    Comedy|Drama    ...    6.0    English    Canada    NaN    NaN    2013.0    470.0    7.7    NaN    84
5039    Color    NaN    43.0    43.0    NaN    319.0    Valorie Curry    841.0    NaN    Crime|Drama|Mystery|Thriller    ...    359.0    English    USA    TV-14    NaN    NaN    593.0    7.5    16.00    32000
5040    Color    Benjamin Roberds    13.0    76.0    0.0    0.0    Maxwell Moody    0.0    NaN    Drama|Horror|Thriller    ...    3.0    English    USA    NaN    1400.0    2013.0    0.0    6.3    NaN    16
5041    Color    Daniel Hsia    14.0    100.0    0.0    489.0    Daniel Henney    946.0    10443.0    Comedy|Drama|Romance    ...    9.0    English    USA    PG-13    NaN    2012.0    719.0    6.3    2.35    660
5042    Color    Jon Gunn    43.0    90.0    16.0    16.0    Brian Herzlinger    86.0    85222.0    Documentary    ...    84.0    English    USA    PG    1100.0    2004.0    23.0    6.6    1.85    456
10 rows × 28 columns

imdb['color']
# 输出
0       Color
1       Color
2       Color
3       Color
4         NaN
        ...  
5038    Color
5039    Color
5040    Color
5041    Color
5042    Color
Name: color, Length: 5043, dtype: object

imdb['color'][0]
# 'Color'
imdb['color'][1]
# 'Color'

imdb[['color','director_name']]
# 输出
    color    director_name
0    Color    James Cameron
1    Color    Gore Verbinski
2    Color    Sam Mendes
3    Color    Christopher Nolan
4    NaN    Doug Walker
...    ...    ...
5038    Color    Scott Smith
5039    Color    NaN
5040    Color    Benjamin Roberds
5041    Color    Daniel Hsia
5042    Color    Jon Gunn
5043 rows × 2 columns

sub_df = imdb[['director_name','movie_title','imdb_score']]
sub_df
# 输出
director_name    movie_title    imdb_score
0    James Cameron    Avatar    7.9
1    Gore Verbinski    Pirates of the Caribbean: At World's End    7.1
2    Sam Mendes    Spectre    6.8
3    Christopher Nolan    The Dark Knight Rises    8.5
4    Doug Walker    Star Wars: Episode VII - The Force Awakens  ...    7.1
...    ...    ...    ...
5038    Scott Smith    Signed Sealed Delivered    7.7
5039    NaN    The Following    7.5
5040    Benjamin Roberds    A Plague So Pleasant    6.3
5041    Daniel Hsia    Shanghai Calling    6.3
5042    Jon Gunn    My Date with Drew    6.6
5043 rows × 3 columns

sub_df.head()
# 输出
    director_name    movie_title    imdb_score
0    James Cameron    Avatar    7.9
1    Gore Verbinski    Pirates of the Caribbean: At World's End    7.1
2    Sam Mendes    Spectre    6.8
3    Christopher Nolan    The Dark Knight Rises    8.5
4    Doug Walker    Star Wars: Episode VII - The Force Awakens  ...    7.1

sub_df.head(5)
# 输出
    director_name    movie_title    imdb_score
0    James Cameron    Avatar    7.9
1    Gore Verbinski    Pirates of the Caribbean: At World's End    7.1
2    Sam Mendes    Spectre    6.8
3    Christopher Nolan    The Dark Knight Rises    8.5
4    Doug Walker    Star Wars: Episode VII - The Force Awakens  ...    7.1

sub_df.iloc[10:20,:]
# 输出
    director_name    movie_title    imdb_score
10    Zack Snyder    Batman v Superman: Dawn of Justice    6.9
11    Bryan Singer    Superman Returns    6.1
12    Marc Forster    Quantum of Solace    6.7
13    Gore Verbinski    Pirates of the Caribbean: Dead Man's Chest    7.3
14    Gore Verbinski    The Lone Ranger    6.5
15    Zack Snyder    Man of Steel    7.2
16    Andrew Adamson    The Chronicles of Narnia: Prince Caspian    6.6
17    Joss Whedon    The Avengers    8.1
18    Rob Marshall    Pirates of the Caribbean: On Stranger Tides    6.7
19    Barry Sonnenfeld    Men in Black 3    6.8

sub_df.iloc[10:20,0:2]
# 输出
director_name    movie_title
10    Zack Snyder    Batman v Superman: Dawn of Justice
11    Bryan Singer    Superman Returns
12    Marc Forster    Quantum of Solace
13    Gore Verbinski    Pirates of the Caribbean: Dead Man's Chest
14    Gore Verbinski    The Lone Ranger
15    Zack Snyder    Man of Steel
16    Andrew Adamson    The Chronicles of Narnia: Prince Caspian
17    Joss Whedon    The Avengers
18    Rob Marshall    Pirates of the Caribbean: On Stranger Tides
19    Barry Sonnenfeld    Men in Black 3

tmp_df = sub_df.iloc[10:20,0:2]
tmp_df
# 输出
director_name    movie_title
10    Zack Snyder    Batman v Superman: Dawn of Justice
11    Bryan Singer    Superman Returns
12    Marc Forster    Quantum of Solace
13    Gore Verbinski    Pirates of the Caribbean: Dead Man's Chest
14    Gore Verbinski    The Lone Ranger
15    Zack Snyder    Man of Steel
16    Andrew Adamson    The Chronicles of Narnia: Prince Caspian
17    Joss Whedon    The Avengers
18    Rob Marshall    Pirates of the Caribbean: On Stranger Tides
19    Barry Sonnenfeld    Men in Black 3

tmp_df.iloc[2:4,:]
# 输出
    director_name    movie_title
12    Marc Forster    Quantum of Solace
13    Gore Verbinski    Pirates of the Caribbean: Dead Man's Chest

tmp_df.loc[15:17,:]
# 输出
    director_name    movie_title
15    Zack Snyder    Man of Steel
16    Andrew Adamson    The Chronicles of Narnia: Prince Caspian
17    Joss Whedon    The Avengers

tmp_df.loc[15:17,:'movie_title']
# 输出
    director_name    movie_title
15    Zack Snyder    Man of Steel
16    Andrew Adamson    The Chronicles of Narnia: Prince Caspian
17    Joss Whedon    The Avengers

tmp_df.loc[15:17,:'director_name']
# 输出
    director_name
15    Zack Snyder
16    Andrew Adamson
17    Joss Whedon

3-6 Series和Dataframe的Reindexing

Jupyter notebook 新建文件 Reindexing Series and DataFrame.ipynb

import numpy as np
import pandas as pd
from pandas import Series, DataFrame

# series reindex
s1 = Series([1,2,3,4], index=['A','B','C','D'])
s1
# 输出
A    1
B    2
C    3
D    4
dtype: int64

# s1.reindex()    # 光标移动到方法上面，按shift+tab，弹出文档，连续按选择文档详细程度
s1.reindex(index=['A','B','C','D','E'])
# 输出
A    1.0
B    2.0
C    3.0
D    4.0
E    NaN
dtype: float64

s1.reindex(index=['A','B','C','D','E'],fill_value=0)
# 输出
A    1
B    2
C    3
D    4
E    0
dtype: int64

s1.reindex(index=['A','B','C','D','E'],fill_value=10)
# 输出
A     1
B     2
C     3
D     4
E    10
dtype: int64

s2 = Series(['A','B','C'], index=[1,5,10])
s2
# 输出
1     A
5     B
10    C
dtype: object

s2.reindex(index=range(15))
# 输出
0     NaN
1       A
2     NaN
3     NaN
4     NaN
5       B
6     NaN
7     NaN
8     NaN
9     NaN
10      C
11    NaN
12    NaN
13    NaN
14    NaN
dtype: object

s2.reindex(index=range(15),method='ffill')
# 输出
0     NaN
1       A
2       A
3       A
4       A
5       B
6       B
7       B
8       B
9       B
10      C
11      C
12      C
13      C
14      C
dtype: object

# reindex dataframe
df1 = DataFrame(np.random.rand(25).reshape([5,5]))
df1
# 输出
    0    1    2    3    4
0    0.255424    0.315708    0.951327    0.423676    0.975377
1    0.087594    0.192460    0.502268    0.534926    0.423024
2    0.817002    0.113410    0.468270    0.410297    0.278942
3    0.315239    0.018933    0.133764    0.240001    0.910754
4    0.267342    0.451077    0.282865    0.170235    0.898429


df1 = DataFrame(np.random.rand(25).reshape([5,5]), index=['A','B','D','E','F'], columns=['c1','c2','c3','c4','c5'])
df1
# 输出
    c1    c2    c3    c4    c5
A    0.278063    0.894546    0.932129    0.178442    0.303684
B    0.186239    0.260677    0.708358    0.275914    0.369878
D    0.786987    0.125907    0.191987    0.338194    0.009877
E    0.192269    0.909661    0.227301    0.343989    0.610203
F    0.503267    0.306472    0.197467    0.063800    0.813786

df1.reindex(index=['A','B','C','D','E','F'])
# 输出
    c1    c2    c3    c4    c5
A    0.278063    0.894546    0.932129    0.178442    0.303684
B    0.186239    0.260677    0.708358    0.275914    0.369878
C    NaN    NaN    NaN    NaN    NaN
D    0.786987    0.125907    0.191987    0.338194    0.009877
E    0.192269    0.909661    0.227301    0.343989    0.610203
F    0.503267    0.306472    0.197467    0.063800    0.813786

df1.reindex(columns=['c1','c2','c3','c4','c5','c6'])
# 输出
    c1    c2    c3    c4    c5    c6
A    0.278063    0.894546    0.932129    0.178442    0.303684    NaN
B    0.186239    0.260677    0.708358    0.275914    0.369878    NaN
D    0.786987    0.125907    0.191987    0.338194    0.009877    NaN
E    0.192269    0.909661    0.227301    0.343989    0.610203    NaN
F    0.503267    0.306472    0.197467    0.063800    0.813786    NaN

df1.reindex(index=['A','B','C','D','E','F'],columns=['c1','c2','c3','c4','c5','c6'])
# 输出
    c1    c2    c3    c4    c5    c6
A    0.278063    0.894546    0.932129    0.178442    0.303684    NaN
B    0.186239    0.260677    0.708358    0.275914    0.369878    NaN
C    NaN    NaN    NaN    NaN    NaN    NaN
D    0.786987    0.125907    0.191987    0.338194    0.009877    NaN
E    0.192269    0.909661    0.227301    0.343989    0.610203    NaN
F    0.503267    0.306472    0.197467    0.063800    0.813786    NaN


s1
# 输出
A    1
B    2
C    3
D    4
dtype: int64

s1.reindex(index=['A','B'])
# 输出
A    1
B    2
dtype: int64


df1
# 输出
    c1    c2    c3    c4    c5
A    0.278063    0.894546    0.932129    0.178442    0.303684
B    0.186239    0.260677    0.708358    0.275914    0.369878
D    0.786987    0.125907    0.191987    0.338194    0.009877
E    0.192269    0.909661    0.227301    0.343989    0.610203
F    0.503267    0.306472    0.197467    0.063800    0.813786

df1.reindex(index=['A','B'])
# 输出
    c1    c2    c3    c4    c5
A    0.278063    0.894546    0.932129    0.178442    0.303684
B    0.186239    0.260677    0.708358    0.275914    0.369878

s1
# 输出
A    1
B    2
C    3
D    4
dtype: int64

s1.drop('A')
# 输出
B    2
C    3
D    4
dtype: int64

df1
# 输出
    c1    c2    c3    c4    c5
A    0.278063    0.894546    0.932129    0.178442    0.303684
B    0.186239    0.260677    0.708358    0.275914    0.369878
D    0.786987    0.125907    0.191987    0.338194    0.009877
E    0.192269    0.909661    0.227301    0.343989    0.610203
F    0.503267    0.306472    0.197467    0.063800    0.813786

df1.drop('A',axis=0)
# 输出
    c1    c2    c3    c4    c5
B    0.186239    0.260677    0.708358    0.275914    0.369878
D    0.786987    0.125907    0.191987    0.338194    0.009877
E    0.192269    0.909661    0.227301    0.343989    0.610203
F    0.503267    0.306472    0.197467    0.063800    0.813786

df1.drop('c1',axis=0)
# 报错，行中没有该字段

df1.drop('c1',axis=1)
# 输出
    c2    c3    c4    c5
A    0.894546    0.932129    0.178442    0.303684
B    0.260677    0.708358    0.275914    0.369878
D    0.125907    0.191987    0.338194    0.009877
E    0.909661    0.227301    0.343989    0.610203
F    0.306472    0.197467    0.063800    0.813786

3-7 谈一谈NaN

Jupyter notebook 新建文件谈一谈NaN.ipynb

# NaN - means Not a Number
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

n = np.nan
type(n)
# float

m = 1
m + n
# nan


# Nan in Series
s1 = Series([1, 2, np.nan, 3, 4], index=['A','B','C','D','E'])
s1
# 输出
A    1.0
B    2.0
C    NaN
D    3.0
E    4.0
dtype: float64

s1.isnull()
# 输出
A    False
B    False
C     True
D    False
E    False
dtype: bool

s1.notnull()
# 输出
A     True
B     True
C    False
D     True
E     True
dtype: bool

s1
# 输出
A    1.0
B    2.0
C    NaN
D    3.0
E    4.0
dtype: float64

s1.dropna()
# 输出
A    1.0
B    2.0
D    3.0
E    4.0
dtype: float64

# Nan in DataFrame
dframe = DataFrame([[1,2,3],[np.nan,5,6],[7,np.nan,9],[np.nan,np.nan,np.nan]])
dframe
# 输出
    0    1    2
0    1.0    2.0    3.0
1    NaN    5.0    6.0
2    7.0    NaN    9.0
3    NaN    NaN    NaN

dframe.isnull()
# 输出
    0    1    2
0    False    False    False
1    True    False    False
2    False    True    False
3    True    True    True

dframe.notnull()
# 输出
    0    1    2
0    True    True    True
1    False    True    True
2    True    False    True
3    False    False    False

df1 = dframe.dropna(axis=0)
df1
# 输出
    0    1    2
0    1.0    2.0    3.0


df1 = dframe.dropna(axis=1)
df1
# 输出
0
1
2
3

df1 = dframe.dropna(axis=1,how='any')
df1
# 输出
0
1
2
3

# 输出
df1 = dframe.dropna(axis=0,how='any')
df1
# 输出
    0    1    2
0    1.0    2.0    3.0

df1 = dframe.dropna(axis=0,how='all')
df1
# 输出
    0    1    2
0    1.0    2.0    3.0
1    NaN    5.0    6.0
2    7.0    NaN    9.0

dframe2 = DataFrame([[1,2,3,np.nan],[2,np.nan,5,6],[np.nan,7,np.nan,9],[1,np.nan,np.nan,np.nan]])
dframe2
# 输出
    0    1    2    3
0    1.0    2.0    3.0    NaN
1    2.0    NaN    5.0    6.0
2    NaN    7.0    NaN    9.0
3    1.0    NaN    NaN    NaN

df2 = dframe2.dropna(thresh=None)
df2
# 输出
0    1    2    3

df2 = dframe2.dropna(thresh=2)
df2
# 输出
    0    1    2    3
0    1.0    2.0    3.0    NaN
1    2.0    NaN    5.0    6.0
2    NaN    7.0    NaN    9.0

dframe2
# 输出
    0    1    2    3
0    1.0    2.0    3.0    NaN
1    2.0    NaN    5.0    6.0
2    NaN    7.0    NaN    9.0
3    1.0    NaN    NaN    NaN

dframe2.fillna(value=1)
# 输出
    0    1    2    3
0    1.0    2.0    3.0    1.0
1    2.0    1.0    5.0    6.0
2    1.0    7.0    1.0    9.0
3    1.0    1.0    1.0    1.0

dframe2.fillna(value={0:0,1:1,2:2,3:3})    # 列填充
# 输出
    0    1    2    3
0    1.0    2.0    3.0    3.0
1    2.0    1.0    5.0    6.0
2    0.0    7.0    2.0    9.0
3    1.0    1.0    2.0    3.0

df1
# 输出
    0    1    2
0    1.0    2.0    3.0
1    NaN    5.0    6.0
2    7.0    NaN    9.0

df2
# 输出
    0    1    2    3
0    1.0    2.0    3.0    NaN
1    2.0    NaN    5.0    6.0
2    NaN    7.0    NaN    9.0

df1.dropna()
# 输出
    0    1    2
0    1.0    2.0    3.0

df1.fillna(1)
# 输出
    0    1    2
0    1.0    2.0    3.0
1    1.0    5.0    6.0
2    7.0    1.0    9.0

第5章绘图和可视化之Matplotlib

数据的可视化是数据分析领域里非常重要的内容。本章会学习Matplotlib的基本使用，包括如何对Pandas里的Series和DataFrame绘图，以及图形样式和显示模式的设置等内容。

5-1 Matplotlib介绍

为什么要Python画图

GUI太复杂
Excel太头疼
Python简单，免费( sorry Matlab)

什么是matplotlib？

一个Python包
用于2D绘图
非常强大和流行
有很多扩展

Hello world in matplotlib

import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
x = np.linspace(0,2*np.pi,100)
y = np.sin(x)
plt.plot(x,y)

Matplotlib

Backend：主要处理把图显示到哪里和画在哪里？
Artist：图形显示成什么样？
Scripting：pyplot，Python语法和API

参考资料

5-2 matplotlib简单绘图之plot

Jupyter notebook 新建文件 matplotlib的简单绘图-plot.ipynb

import numpy as np
import matplotlib.pyplot as plt

a = [1, 2, 3]
plt.plot(a)
# [<matplotlib.lines.Line2D at 0x20619607f40>]
# plt.show() # 新版图形会直接显示

a = [1, 2, 3]
b = [4, 5, 6]
plt.plot(a, b)

a = [1, 2, 3]
b = [4, 5, 6, 7]
plt.plot(a, b)
# ValueError: x and y must have same first dimension, but have shapes (3,) and (4,)

a = [1, 2, 3]
b = [4, 5, 6]
# %matplotlib inline # 旧版加上这句可以不用show，新版不需要
plt.plot(a, b)

%timeit np.arange(10)
# 464 ns ± 25.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

plt.plot(a,b,'*')

plt.plot(a,b,'--')

plt.plot(a,b,'r--')

plt.plot(a,b,'w--')  # 线是白色，看不到

c = [10, 8, 6]
d = [1, 8, 3]
plt.plot(a, b, c, d)

c = [10, 8, 6]
d = [1, 8, 3]
plt.plot(a, b, 'r--', c, d, 'b*')

t = np.arange(0.0, 2.0, 0.1)
t
# array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. , 1.1, 1.2,
       1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9])
t.size
# 20
s = np.sin(t*np.pi)
s
# 输出
array([ 0.00000000e+00,  3.09016994e-01,  5.87785252e-01,  8.09016994e-01,
        9.51056516e-01,  1.00000000e+00,  9.51056516e-01,  8.09016994e-01,
        5.87785252e-01,  3.09016994e-01,  1.22464680e-16, -3.09016994e-01,
       -5.87785252e-01, -8.09016994e-01, -9.51056516e-01, -1.00000000e+00,
       -9.51056516e-01, -8.09016994e-01, -5.87785252e-01, -3.09016994e-01])

plt.plot(t, s)

plt.plot(t, s, 'r--')

plt.plot(t, s, 'r--', t*2, s)

plt.plot(t, s, 'r--', t*2, s, '--')

plt.plot(t, s, 'r--', t*2, s, '--')
plt.xlabel('this is x')
plt.ylabel('this is y')
plt.title('this is a demo')

plt.plot(t, s, 'r--', label='aaaa')
plt.plot(t*2, s, 'b--', label='bbbb')
plt.xlabel('this is x')
plt.ylabel('this is y')
plt.title('this is a demo')
plt.legend()

5-3 matplotlib简单绘图之subplot

Jupyter notebook 新建文件 matplotlib简单绘图之subplot.ipynb

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0.0, 5.0)
y1 = np.sin(np.pi*x)
y2 = np.sin(np.pi*x*2)

plt.plot(x, y1, 'b--', label='sin(pi*x)')
plt.ylabel('y1 value')
plt.plot(x, y2, 'r--', label='sin(pi*2x)')
plt.ylabel('y2 value')
plt.xlabel('x value')
plt.title('this is x-y value')
plt.legend()

plt.subplot(2, 1, 1)
plt.plot(x, y1, 'b--')
plt.ylabel('y1')
plt.subplot(2, 1, 2)
plt.plot(x, y2, 'r--')
plt.ylabel('y2')
plt.xlabel('x')

plt.subplot(2, 2, 1)
plt.plot(x, y1, 'b--')
plt.ylabel('y1')
plt.subplot(2, 2, 2)
plt.plot(x, y2, 'r--')
plt.ylabel('y2')
plt.xlabel('x')
plt.subplot(2, 2, 3)
plt.plot(x, y1, 'r*')
plt.subplot(2, 2, 4)
plt.plot(x, y1, 'b*')

plt.subplot(221)
plt.plot(x, y1, 'b--')
plt.ylabel('y1')
plt.subplot(222)
plt.plot(x, y2, 'r--')
plt.ylabel('y2')
plt.xlabel('x')
plt.subplot(223)
plt.plot(x, y1, 'r*')
plt.subplot(224)
plt.plot(x, y1, 'b*')

a = plt.subplots()
a

type(a)
# tuple
a[0]

a[1]
# <AxesSubplot:>
figure, ax = plt.subplots()
ax.plot([1, 2, 3, 4, 5])

figure, ax = plt.subplots(2, 2)
ax.plot([1, 2, 3, 4, 5])
# AttributeError: 'numpy.ndarray' object has no attribute 'plot'

figure, ax = plt.subplots(2, 2)
ax

ax[0][0].plot(x, y1)
ax[0][1].plot(x, y2)
# [<matplotlib.lines.Line2D at 0x23c99d1df70>]
figure    # 老版：plt.show()

5-4 Pandas绘图之Series

Jupyter notebook 新建文件 Pandas绘图之Series.ipynb

import numpy as np
import pandas as pd
from pandas import Series
import matplotlib.pyplot as plt

s = Series([1, 2, 3, 4, 5])
s
# 输出
0    1
1    2
2    3
3    4
4    5
dtype: int64

s.cumsum()
# 输出
0     1
1     3
2     6
3    10
4    15
dtype: int64

s1 = Series(np.random.randn(100)).cumsum()
s1.plot()

s1.plot(kind='bar')

s1 = Series(np.random.randn(1000)).cumsum()
s1.plot(kind='bar')

s1.plot(kind='line')

s1.plot(kind='line', grid=True)

s1.plot(kind='line', grid=True, label='S1', title='This is Series')
plt.legend()

s1.plot(kind='line', grid=True, label='S1', title='This is Series', style='--')
plt.legend()

s1 = Series(np.random.randn(1000)).cumsum()
s2 = Series(np.random.randn(1000)).cumsum()
s1.plot(kind='line', grid=True, label='S1', title='This is Series')
s2.plot(label='S2')

fig, ax = plt.subplots(2,1)
ax
ax[0].plot(s1)
ax[1].plot(s2)

s1.plot(ax=ax[0], label='S1')
s2.plot(ax=ax[1], label='S2')
fig

s1[0:10].plot(ax=ax[0], label='S1', kind='bar')
fig

s1 = Series(np.random.randn(10)).cumsum()
s2 = Series(np.random.randn(10)).cumsum()
fig, ax = plt.subplots(2,1)
s1[0:10].plot(ax=ax[0], label='S1', kind='bar')
s2.plot(ax=ax[1], label='S2')
plt.show()

5-5 Pandas绘图之DataFrame

Jupyter notebook 新建文件 Pandas绘图之DataFrame.ipynb

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas import Series, DataFrame

df = DataFrame(
    np.random.randint(1, 10, 40).reshape(10,4),
    columns=['A','B','C','D']
)
df
# 输出
    A	B	C	D
0	6	9	5	4
1	6	3	3	3
2	9	3	8	1
3	4	3	9	9
4	4	5	3	7
5	2	2	4	4
6	5	8	9	9
7	6	8	3	1
8	1	5	6	6
9	5	1	5	1

df.plot()

df.plot(kind='bar')

df.plot(kind='barh')

df.plot(kind='bar', stacked=True)

df.plot(kind='area')

df.iloc[5]
# 输出
A    2
B    2
C    4
D    4
Name: 5, dtype: int32

a = df.iloc[5]
type(a)
# pandas.core.series.Series

df.iloc[5].plot()

for i in df.index:
    df.iloc[i].plot(label=str(i))
plt.legend()
plt.show()

df['A']
# 输出
0    6
1    6
2    9
3    4
4    4
5    2
6    5
7    6
8    1
9    5
Name: A, dtype: int32

df['A'].plot()

df.plot()

df
# 输出
    A	B	C	D
0	6	9	5	4
1	6	3	3	3
2	9	3	8	1
3	4	3	9	9
4	4	5	3	7
5	2	2	4	4
6	5	8	9	9
7	6	8	3	1
8	1	5	6	6
9	5	1	5	1

df.T
# 输出
    0	1	2	3	4	5	6	7	8	9
A	6	6	9	4	4	2	5	6	1	5
B	9	3	3	3	5	2	8	8	5	1
C	5	3	8	9	3	4	9	3	6	5
D	4	3	1	9	7	4	9	1	6	1

df.T.plot()

5-6 直方图和密度图

Jupyter notebook 新建文件 matplotlib里的直方图和密度图.ipynb

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas import Series, DataFrame

# 直方图
s = Series(np.random.randn(1000))
plt.hist(s)

plt.hist(s, rwidth=0.9)

a = np.arange(10)
a
# array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

plt.hist(a, rwidth=0.9)

s
# 输出
0      1.234429
1     -0.157434
2     -0.587175
3      0.583882
4     -0.612201
         ...   
995   -0.155292
996   -1.126121
997   -0.173816
998    0.608625
999   -0.862248
Length: 1000, dtype: float64

re = plt.hist(s, rwidth=0.9)

re
# 输出
(array([  6.,  22.,  55., 127., 210., 234., 171., 115.,  43.,  17.]),
 array([-3.24995597, -2.65054559, -2.0511352 , -1.45172482, -0.85231444,
        -0.25290405,  0.34650633,  0.94591672,  1.5453271 ,  2.14473748,
         2.74414787]),
 <BarContainer object of 10 artists>)

type(re)
# tuple

len(re)
# 3

re[0]
# array([  6.,  22.,  55., 127., 210., 234., 171., 115.,  43.,  17.])

re[1]
# array([-3.24995597, -2.65054559, -2.0511352 , -1.45172482, -0.85231444,
       -0.25290405,  0.34650633,  0.94591672,  1.5453271 ,  2.14473748,
        2.74414787])

re[2]
# <BarContainer object of 10 artists>

plt.hist(s)

plt.hist(s, rwidth=0.9)

plt.hist(s, rwidth=0.9, bins=20)

plt.hist(s, rwidth=0.9, bins=200)

plt.hist(s, rwidth=0.9, bins=20, color='r')

s.plot()

s.plot(kind='kde')

https://matplotlib.org/stable/tutorials/index

第6章绘图和可视化之Seaborn

Seaborn是对Matplotlib的进一步封装，其强大的调色功能和内置的多种多样的绘图模式，使之成为当下最流行的数据科学绘图工具。本章将介绍Seaborn的基本使用，以及和matplotlib的功能对比。

6-1 seaborn介绍

Seaborn - Powerful Matplotlib Extension
Statistical data visualization

Seaborn的优势在哪里？

Jupyter notebook 新建文件 Seaborn和matplotlib对比.ipynb

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline  # 新版可以不写

iris = pd.read_csv('../homework/iris.csv')
iris.head()
# 输出
	SepalLength	SepalWidth	PetalLength	PetalWidth	Name
0	5.1	3.5	1.4	0.2	Iris-setosa
1	4.9	3.0	1.4	0.2	Iris-setosa
2	4.7	3.2	1.3	0.2	Iris-setosa
3	4.6	3.1	1.5	0.2	Iris-setosa
4	5.0	3.6	1.4	0.2	Iris-setosa

# 需求：画一个花瓣(petal)和花萼(sepal)长度的散点图，并且点的颜色要区分鸢尾花的种类
iris.Name.unique()
# array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)


color_map = dict(zip(iris.Name.unique(), ['blue', 'green', 'red']))
for species, group in iris.groupby('Name'):
    plt.scatter(group['PetalLength'], group['SepalLength'],
               color=color_map[species],
               alpha=0.3, edgecolor=None,
               label=species)

plt.legend(frameon=True, title='Name')
plt.xlabel('petalLength')
plt.ylabel('sepalLength')

sns.lmplot('PetalLength','SepalLength', iris, hue='Name', fit_reg=False)

https://www.youtube.com/watch?v=FytuB8nFHPQ

6-2 seaborn实现直方图和密度图

Jupyter notebook 新建文件 seaborn实现直方图和密度图.ipynb

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas import Series, DataFrame
%matplotlib inline
import seaborn as sns

s1 = Series(np.random.randn(100))
plt.hist(s1)

s1.plot(kind='kde')

sns.distplot(s1, hist=True, kde=True)

sns.distplot(s1, hist=True, kde=False)

sns.distplot(s1, hist=False, kde=True)

sns.distplot(s1, hist=False, kde=True, rug=True)

sns.distplot(s1, bins=20, hist=False, kde=True, rug=True)

sns.distplot(s1, bins=20, hist=True, kde=False, rug=True)

sns.kdeplot(s1)

sns.kdeplot(s1, shade=True)

sns.kdeplot(s1, shade=True, color='r')

sns.plt.plot(s1)  # 相当于 plt.plot(s1) ，旧版可以，新版好像不行
sns.plt.hist(s1)  # 相当于 plt.hist(s1) ，旧版可以，新版好像不行

# 作业
sns.rugplot(s1)

6-3 seaborn实现柱状图和热力图

Jupyter notebook 新建文件 seaborn实现柱状图和热力图.ipynb

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from pandas import Series, DataFrame
%matplotlib inline

df = sns.load_dataset('flights')  # 下方是 flights 的来源
df.head()
# 输出
	year	month	passengers
0	1949	Jan	112
1	1949	Feb	118
2	1949	Mar	132
3	1949	Apr	129
4	1949	May	121

df.shape
# (144, 3)

https://github.com/

https://github.com/mwaskom/seaborn-data

df = df.pivot(index='month', columns='year', values='passengers')
df
# 输出
year	1949	1950	1951	1952	1953	1954	1955	1956	1957	1958	1959	1960
month												
Jan	112	115	145	171	196	204	242	284	315	340	360	417
Feb	118	126	150	180	196	188	233	277	301	318	342	391
Mar	132	141	178	193	236	235	267	317	356	362	406	419
Apr	129	135	163	181	235	227	269	313	348	348	396	461
May	121	125	172	183	229	234	270	318	355	363	420	472
Jun	135	149	178	218	243	264	315	374	422	435	472	535
Jul	148	170	199	230	264	302	364	413	465	491	548	622
Aug	148	170	199	242	272	293	347	405	467	505	559	606
Sep	136	158	184	209	237	259	312	355	404	404	463	508
Oct	119	133	162	191	211	229	274	306	347	359	407	461
Nov	104	114	146	172	180	203	237	271	305	310	362	390
Dec	118	140	166	194	201	229	278	306	336	337	405	432

sns.heatmap(df)

df.plot()

sns.heatmap(df, annot=True)

sns.heatmap(df, annot=True, fmt='d')

df.sum()
# 输出
year
1949    1520
1950    1676
1951    2042
1952    2364
1953    2700
1954    2867
1955    3408
1956    3939
1957    4421
1958    4572
1959    5140
1960    5714
dtype: int64

s = df.sum()
s.index
# Int64Index([1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959,
            1960],
           dtype='int64', name='year')

s.values
# array([1520, 1676, 2042, 2364, 2700, 2867, 3408, 3939, 4421, 4572, 5140,
       5714], dtype=int64)

sns.barplot(x=s.index, y=s.values)

type(s)
# pandas.core.series.Series

s.plot(kind='bar')

6-4 seaborn图形显示效果的设置

Jupyter notebook 新建文件 seaborn设置图形显示效果.ipynb

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# 1. axes_style and set_style
%matplotlib inline

x = np.linspace(0, 14, 100)
y1 = np.sin(x)
y2 = np.sin(x+2)*1.25
plt.plot(x, y1)
plt.plot(x, y2)

def sinplot():
    plt.plot(x, y1)
    plt.plot(x, y2)

sinplot()

import seaborn as sns
sinplot()

style = ['darkgrid', 'dark', 'white', 'whitegrid', 'ticks']
sns.set_style(style[0])    # 可试试这几种
sinplot()

sns.axes_style()

sns.set_style(style[0], {'grid.color':'red'})
sinplot()

sns.axes_style()

sns.set()
sinplot()

sns.set_style('white')
sinplot()

sns.set()
sinplot()

# 2. plotting_context() and set_context()
context = ['paper', 'notebook', 'talk', 'poster']
sns.set_context(context[3])
sinplot()

sns.plotting_context()

sns.set_context(context[1], rc={'grid.linewidth':1.0})
sinplot()

sns.set_context(context[1], rc={'grid.linewidth':3.0})
sinplot()

sns.plotting_context()
sns.set()
sns.plotting_context()

6-5 seaborn强大的调色功能

Jupyter notebook 新建文件 seaborn强大的调色功能.ipynb

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

def sinplot():
    x = np.linspace(0, 14, 100)
    plt.figure(figsize=(8, 6))
    for i in range(4):
        plt.plot(x, np.sin(x+i)*(i+0.75), label="sin(x+%s)*(%s+0.75)" % (i,i))
    plt.legend()

sinplot()

import seaborn as sns
sinplot()

import seaborn as sns
sns.set()
sinplot()

sns.color_palette()  # RGB

sns.palplot(sns.color_palette())

pal_style = ['deep', 'muted', 'pastel', 'bright', 'dark', 'colorblind']
sns.palplot(sns.color_palette('dark'))

sns.set_palette(sns.color_palette('dark'))
sns.color_palette()

sinplot()

sns.set()
sinplot()

with sns.color_palette('dark'):
    sinplot()

sinplot()

sns.color_palette()

pal1 = sns.color_palette([(0.5, 0.1, 0.7),(0.3, 0.1, 0.9)])
sns.palplot(pal1)

sns.color_palette('hls', 8)

http://seaborn.pydata.org/tutorial
Choosing color palettes

第7章数据分析项目实战

通过前六章的学习，我们基本上掌握了数据分析领域里主要工具的使用，本章将通过一个股票市场的分析实战项目，和大家一起用学过的知识去分析数据，进而得到有用的信息。

7-1 实战准备

数学科学工作流

B 站全站视频信息爬虫：https://github.com/chenjiandongx/bili-spider
数据分析和挖掘有哪些公开的数据来源？：https://www.zhihu.com/question/19969760
	https://www.quora.com/Data/Where-can-I-find-large-datasets-open-to-the-public
    https://www.kaggle.com/datasets

google public data explorer：https://www.google.com/publicdata/directory
AWS Public Datasets：https://aws-amazon.com/public-datasets/

7-2 股票市场分析实战之数据获取

Jupyter notebook 新建文件股票市场分析实战--数据获取.ipynb

http://finance.yahoo.com/
搜索阿里巴巴：BABA

https://pandas-datareader.readthedocs.io/en/latest/

conda install pandas-datareader  # 安装pandas-datareader

import pandas_datareader as pdr
alibaba = pdr.get_data_yahoo('BABA')
alibaba.head()
# 输出
	High	Low	Open	Close	Volume	Adj Close
Date						
2017-02-22	105.199997	102.419998	102.480003	104.199997	15779400	104.199997
2017-02-23	104.860001	101.820000	104.720001	102.459999	10095600	102.459999
2017-02-24	103.000000	101.300003	101.389999	102.949997	7356600	102.949997
2017-02-27	103.824997	102.220001	102.500000	103.599998	6865900	103.599998
2017-02-28	103.989998	102.029999	103.889999	102.900002	8017500	102.900002

alibaba.shape
# (1259, 6)

alibaba.tail()
# 输出
    High	Low	Open	Close	Volume	Adj Close
Date						
2022-02-14	122.384003	119.370003	120.559998	121.919998	13154900	121.919998
2022-02-15	126.805000	123.377998	123.769997	126.239998	14634700	126.239998
2022-02-16	127.580002	124.510002	125.610001	125.559998	17993300	125.559998
2022-02-17	129.399994	124.056000	125.000000	124.430000	15906000	124.430000
2022-02-18	120.879997	117.199997	120.879997	118.989998	21165800	118.989998

alibaba.describe()
# 输出
    High	Low	Open	Close	Volume	Adj Close
count	1259.000000	1259.000000	1259.000000	1259.000000	1.259000e+03	1259.000000
mean	189.265396	184.532867	187.110629	186.942844	1.848556e+07	186.942844
std	43.814601	42.927590	43.471982	43.413098	1.035043e+07	43.413098
min	103.000000	101.300003	101.389999	102.309998	4.120700e+06	102.309998
25%	163.034500	158.815002	161.314995	160.599998	1.227665e+07	160.599998
50%	183.100006	178.520004	180.800003	180.839996	1.617560e+07	180.839996
75%	213.695000	209.366501	211.195000	211.480003	2.129500e+07	211.480003
max	319.320007	308.910004	313.500000	317.140015	1.418300e+08	317.140015

alibaba.info
# 输出
<bound method DataFrame.info of                   High         Low        Open       Close    Volume  \
Date                                                                   
2017-02-22  105.199997  102.419998  102.480003  104.199997  15779400   
2017-02-23  104.860001  101.820000  104.720001  102.459999  10095600   
2017-02-24  103.000000  101.300003  101.389999  102.949997   7356600   
2017-02-27  103.824997  102.220001  102.500000  103.599998   6865900   
2017-02-28  103.989998  102.029999  103.889999  102.900002   8017500   
...                ...         ...         ...         ...       ...   
2022-02-14  122.384003  119.370003  120.559998  121.919998  13154900   
2022-02-15  126.805000  123.377998  123.769997  126.239998  14634700   
2022-02-16  127.580002  124.510002  125.610001  125.559998  17993300   
2022-02-17  129.399994  124.056000  125.000000  124.430000  15906000   
2022-02-18  120.879997  117.199997  120.879997  118.989998  21165800   

             Adj Close  
Date                    
2017-02-22  104.199997  
2017-02-23  102.459999  
2017-02-24  102.949997  
2017-02-27  103.599998  
2017-02-28  102.900002  
...                ...  
2022-02-14  121.919998  
2022-02-15  126.239998  
2022-02-16  125.559998  
2022-02-17  124.430000  
2022-02-18  118.989998  

[1259 rows x 6 columns]>

alibaba.info()
# 输出
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1259 entries, 2017-02-22 to 2022-02-18
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   High       1259 non-null   float64
 1   Low        1259 non-null   float64
 2   Open       1259 non-null   float64
 3   Close      1259 non-null   float64
 4   Volume     1259 non-null   int64  
 5   Adj Close  1259 non-null   float64
dtypes: float64(5), int64(1)
memory usage: 68.9 KB

7-3 股票市场分析实战之历史趋势分析

Jupyter notebook 新建文件股票市场分析实战--历史趋势分析.ipynb

# 基本信息
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

# 股票数据的读取
import pandas_datareader as pdr

# 可视化
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# time
from datetime import datetime

start = datetime(2015,9,20)
alibaba = pdr.get_data_yahoo('BABA',start=start)
amazon = pdr.get_data_yahoo('AMZN',start=start)

# alibaba.to_csv('../homework/BABA.csv')
# amazon.to_csv('../homework/AMZN.csv')

alibaba.head()
# 输出
	High	Low	Open	Close	Volume	Adj Close
Date						
2015-09-21	66.400002	62.959999	65.379997	63.900002	22355100	63.900002
2015-09-22	63.270000	61.580002	62.939999	61.900002	14897900	61.900002
2015-09-23	62.299999	59.680000	61.959999	60.000000	22684600	60.000000
2015-09-24	60.340000	58.209999	59.419998	59.919998	20645700	59.919998
2015-09-25	60.840000	58.919998	60.630001	59.240002	17009100	59.240002


alibaba['Adj Close'].plot()

alibaba['Adj Close'].plot(legend=True)

alibaba['Volume'].plot(legend=True)

alibaba['Adj Close'].plot()
amazon['Adj Close'].plot()

alibaba.head()
# 输出
    High	Low	Open	Close	Volume	Adj Close
Date						
2015-09-21	66.400002	62.959999	65.379997	63.900002	22355100	63.900002
2015-09-22	63.270000	61.580002	62.939999	61.900002	14897900	61.900002
2015-09-23	62.299999	59.680000	61.959999	60.000000	22684600	60.000000
2015-09-24	60.340000	58.209999	59.419998	59.919998	20645700	59.919998
2015-09-25	60.840000	58.919998	60.630001	59.240002	17009100	59.240002

alibaba['high-low'] = alibaba['High'] - alibaba['Low']
alibaba['high-low'].plot()

alibaba.head()
# 输出
	High	Low	Open	Close	Volume	Adj Close	high-low
Date							
2015-09-21	66.400002	62.959999	65.379997	63.900002	22355100	63.900002	3.440002
2015-09-22	63.270000	61.580002	62.939999	61.900002	14897900	61.900002	1.689999
2015-09-23	62.299999	59.680000	61.959999	60.000000	22684600	60.000000	2.619999
2015-09-24	60.340000	58.209999	59.419998	59.919998	20645700	59.919998	2.130001
2015-09-25	60.840000	58.919998	60.630001	59.240002	17009100	59.240002	1.920002

# daily return
alibaba['daily-return'] = alibaba['Adj Close'].pct_change()
alibaba['daily-return']
# 输出
Date
2015-09-21         NaN
2015-09-22   -0.031299
2015-09-23   -0.030695
2015-09-24   -0.001333
2015-09-25   -0.011348
                ...   
2022-02-14   -0.002699
2022-02-15    0.035433
2022-02-16   -0.005387
2022-02-17   -0.009000
2022-02-18   -0.043719
Name: daily-return, Length: 1617, dtype: float64

alibaba['daily-return'].plot()

alibaba['daily-return'].plot(figsize=(10,4), linestyle='--', marker='o')

alibaba['daily-return'].plot(kind='hist')

sns.distplot(alibaba['daily-return'].dropna(), bins=100, color='purple')

7-4 股票市场分析实战之风险分析

Jupyter notebook 新建文件股票市场分析实战--风险分析.ipynb

# 基本信息
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

# 股票数据的读取
import pandas_datareader as pdr

# 可视化
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# time
from datetime import datetime

start = datetime(2015,1,1)
company = ['AAPL', 'GOOG', 'MSFT', 'AMZN', 'FB']
top_tech_df = pdr.get_data_yahoo(company, start=start)['Adj Close']
# top_tech_df.to_csv('../homework/top5.csv')
top_tech_df.head()
# 输出
    Symbols	AAPL	GOOG	MSFT	AMZN	FB
Date					
2014-12-31	24.951860	524.958740	40.836304	310.350006	78.019997
2015-01-02	24.714508	523.373108	41.108841	308.519989	78.449997
2015-01-05	24.018263	512.463013	40.730816	302.190002	77.190002
2015-01-06	24.020521	500.585632	40.132992	295.290009	76.150002
2015-01-07	24.357340	499.727997	40.642887	298.420013	76.150002

top_tech_dr = top_tech_df.pct_change()
top_tech_dr.head()
# 输出
Symbols	AAPL	GOOG	MSFT	AMZN	FB
Date					
2014-12-31	NaN	NaN	NaN	NaN	NaN
2015-01-02	-0.009512	-0.003020	0.006674	-0.005897	0.005511
2015-01-05	-0.028172	-0.020846	-0.009196	-0.020517	-0.016061
2015-01-06	0.000094	-0.023177	-0.014677	-0.022833	-0.013473
2015-01-07	0.014022	-0.001713	0.012705	0.010600	0.000000

top_tech_df.plot()

top_tech_df[['AAPL', 'FB', 'MSFT']].plot()

sns.jointplot('GOOG','GOOG', top_tech_dr, kind='scatter')

sns.jointplot('AMZN','GOOG', top_tech_dr, kind='scatter')

sns.jointplot('MSFT','FB', top_tech_dr, kind='scatter')

sns.pairplot(top_tech_dr.dropna())

https://en.wikipedia.org/wiki/Quantile

top_tech_dr['AAPL'].quantile(0.52)
# 0.001484101307547557

top_tech_dr['MSFT'].quantile(0.05)
# -0.025949460040733018

vips =  pdr.get_data_yahoo('VIPS', start=start)['Adj Close']  # 唯品会
vips.plot()

vips.pct_change().quantile(0.2)
# -0.02404166247703312

第8章课程总结

本章的总结不是对前面8章内容的汇总，而是给大家指明了一条继续学习和锻炼的道路。希望大家坚持练习，早日修成正果。

8-1 总结

4个pdf(建议打印出来，经常看看)：
   Numpy_Python_cheat_Sheet.pdf
   Pandas_cheat_sheet.pdf
   Python_Matplotlib_Cheat_Sheet.pdf
   Seaborn_cheat_sheet.pdf

kaggle 网站：

1、多看tutorial例子
https://www.kaggle.com/datasets：提供了一些公开数据
Bitcoin Historical Data：比特币的历史数据
kernels：做Data Science
之前：https://www.kaggle.com/kernels
现在：https://www.kaggle.com/code
Data ScienceTutorial for Beginners
Jupiter Notebook编写，fork 的功能
可以直接 fork Notebook ， kaggle 会打开一个 online 的 Jupiter Notebook

mysteryflower

关注

3
点赞
踩
8

收藏

觉得还不错? 一键收藏
1
评论
Python数据分析入门与实践-笔记

第1章实验环境的搭建本章将主要介绍Anaconda和Jupyter Notebook。包括如何在windows，Mac，linux等平台上安装Anaconda，以及Jupyter Notebook的基本启动使用方法。1-1 导学视频数学科学和机器学习数学科学工作流课程具体安排：第一章：实验环境的搭建第二章：Nump...
复制链接

扫一扫