数据分析课程笔记

salmon1802

已于 2022-07-27 01:11:07 修改

阅读量557

点赞数 1

分类专栏：笔记文章标签：数据分析 python 数据挖掘

于 2022-06-08 02:20:49 首次发布

本文链接：https://blog.csdn.net/Salmon1122/article/details/125175922

版权

笔记专栏收录该内容

17 篇文章 1 订阅

订阅专栏

在这里插入图片描述

Anaconda

Anaconda 附带了一大批常用数据科学包，它附带了 conda、Python 和 150 多个科学包及其依赖项。可以用于在同一个机器上安装不同版本的软件包及其依赖，并能够在不同的环境之间切换。就好像是npm的plus++版。

Jupyter

Jupyter Notebook 的本质是一个 Web 应用程序，便于创建和共享程序文档，支持实时代码，数学方程，可视化和 markdown。用途包括：数据清理和转换，数值模拟，统计建模，机器学习等等。一款超级自动化笔记本与工作台。

在需要的根目录下打开CMD后输入 jupyter notebook 即可启动

NumPy

NumPy（Numerical Python）是Python的一种开源的数值计算库。它可用来存储和处理大型矩阵与高纬度数组，且包含大量科学计算所必须的函数，运算执行效率比Python自身高效的多。
NumPy的数据存储和Python原生的List是不一样的，它的很多代码是通过C语言实现的，所以比Python更加高效。

在这里插入图片描述

ndarray对象

NumPy可以定义一个n维数组对象，简称ndarray（N-dimension array），与Python中的List不同的是，它只允许存储相同类型的元素。
在这里插入图片描述

numpy.array可通过主动声明列表或元组来使用，也可以配合生成器使用。
当声明的元素类型不一致的时候会将所有元素自动转换为“最大的元素”的类型。
有趣的是，浮点型的转换很有意思：

print(np.array([1, 2, 3, 4.5]))

输出为：
[1.  2.  3.  4.5]

ndim可以展示当前array的维度，当声明的array序列数量不一致时会进行强制转换：

array1 = np.array([[1,2,3,4],(1,2,3)])
print(array1)
print(array1.ndim)

输出结果为：
[list([1, 2, 3, 4]) (1, 2, 3)]
1 #维度为1

array1 = np.array([[1,2,3,4],(1,2,3,4)])
print(array1)
print(array1.ndim)

输出结果为：
[[1 2 3 4]
 [1 2 3 4]]
2 #维度为2

dtypye可以将array里的元素强制转换为指定类型，例如：

array2 = np.array([1, 2.5, 3, 2, 4.7], dtype=int)
print(array2)

输出结果为：
[1 2 3 2 4] #结果与int()方法一致

copy属性，默认为True表示可以创建副本：

因为可以创建副本，所以在执行array2_backup_backup = np.array(array2)时，会开辟一个新的空间存储array2_backup_backup。

当将copy属性声明为False时表示不可复制，即不会开辟一个新空间来存储新array
在这里插入图片描述

ndmin，可选参数，用于指定array的维度。
subok，可选参数，默认为False，此时表示array为其自身声明的类型；当为True时表示array为传入的元素集合的本身的类型。（为假则以自身为准，为真则以他方为准）

可以观察到array3与5的type相同，以传入的元素集合类型为准。

arange()

在这里插入图片描述

当参数缺省时需要指定给予的参数是哪个位置上的，否则达不到预计效果

在这里插入图片描述

当list太长时不会全部显示

linspace() 等差数列

在这里插入图片描述

可以注意到，再linspace中若endpoint为True时，指定的开始和结束参数不再是传统的左闭右开区间，而是全闭区间。
retstep可以显示数列的均差d

并且可以注意到，生成的数列是一个tuple

logspace() 等比数列

与linspace相同，不再是传统的左闭右开区间。在这里插入图片描述
其中，base未声明时默认为10。

现在来分析一下logspace的参数是如何搭配使用的，例如：

array7 = np.logspace(1,5,3)
print(array7)

输出结果为：
[1.e+01 1.e+03 1.e+05]
即：10 1000 100000

首先，我们可以看到传入的参数为：start = 1，stop = 5， num = 3，此时logspace了解到我们需要一个数组，此数组有三个元素，默认基底为10，起始幂为1，终止幂为5，所以将1~5之间的数字三等分为【1、3、5】，所以结果为【 $10^{1}、10^{3}、10^{5}$ 】

以下这个例子可能更能说明问题，他在1~2之间取了个等差数列，元素个数为10，再以其为基底做幂运算。在这里插入图片描述

zeros，全0数列

在这里插入图片描述
dtype默认为浮点型，可声明为其他数据类型。

可以观察到第32行代码只给予了一个（2），zeros就生成了一个一维浮点型全零数组。

三维数组则为(块，行，列)，也可看为数学上的坐标(Z，X，Y)
在这里插入图片描述

zeros_like函数，返回一个和给定数组形状和类型相同的全零数组

ones() 全1数组

与zeros()一致，此处不再赘述。以下为示例代码：

array6 = np.ones(9)
array7 = np.ones((3,3))
array8 = np.ones_like([1,2,3])
print('----------------')
print(array6)
print('----------------')
print(array7)
print('----------------')
print(array8)

输出结果为：
----------------
[1. 1. 1. 1. 1. 1. 1. 1. 1.]
----------------
[[1. 1. 1.]
 [1. 1. 1.]
 [1. 1. 1.]]
----------------
[1 1 1]

ndarray对象属性

在这里插入图片描述
下面给出示例代码：

a = np.array([1, 2, 3])
b = np.array([[1, 2, 3],
              [4, 5, 6]])
c = np.array([
    [
        [1, 2, 3],
        [4, 5, 6]
    ],
    [
        [6, 5, 4],
        [3, 2, 1]
    ]
    ])
print(a.shape,a.ndim,a.size,a.dtype,a.itemsize)
print(b.shape,b.ndim,b.size,b.dtype,b.itemsize)
print(c.shape,c.ndim,c.size,c.dtype,c.itemsize)



输出结果为：
(3,) 1 3 int32 4
(2, 3) 2 6 int32 4
(2, 2, 3) 3 12 int32 4

reshape 调整数列维度

array = np.arange(10)
print(array)
print(array.reshape((2,5)))

输出结果如下：
[0 1 2 3 4 5 6 7 8 9]
[[0 1 2 3 4]
 [5 6 7 8 9]]

resize 也是调整维度，但与reshape不同的是，他不需要指定和元素个数相同的维度，会自动填充。

array = np.array([[1,2],[3,4]])
print(array)
print('np.resize:', )
print(np.resize(array,(2,3)))
# print('np.reshape:',) 因为给出的array中元素个数不符合维度要求，所以无法执行
# print( np.reshape(array,(2,3)))
print("array.resize:", )
array.resize(2,3)
print(array)

输出结果为：
[[1 2]
 [3 4]]
np.resize:
[[1 2 3]
 [4 1 2]]
array.resize:
[[1 2 3]
 [4 0 0]]

可以看到，NumPy的resize方法与创建array对象后其自带的resize有区别，前者使用数组本身来填充空余位置，后者使用‘0’来填充剩余位置。

切片与索引

在这里插入图片描述

在原Python的list中我们可以得到如下结果：
运行：
A = [1,2,3,4]
B = A
print(A,B)
B[0] = 10
print(A,B)
C = A[::]
print(C)
C[0] = 1
print(A,B,C)

输出为：
[1, 2, 3, 4] [1, 2, 3, 4]
[10, 2, 3, 4] [10, 2, 3, 4]
[10, 2, 3, 4]
[10, 2, 3, 4] [10, 2, 3, 4] [1, 2, 3, 4]
可以发现切片（C = A[::]）是复制了一份副本而不是直接引用。

而在ndarray中我们可以发现切片操作不再是复制一份副本，而是直接引用
A = np.arange(1, 10)
print(A)
B = A[::]
print(A, B)
B[0] = 10
print(A, B)

输出为：
[1 2 3 4 5 6 7 8 9]
[1 2 3 4 5 6 7 8 9] [1 2 3 4 5 6 7 8 9]
[10  2  3  4  5  6  7  8  9] [10  2  3  4  5  6  7  8  9]

对于多维数组我们仍旧可以使用切片

array = np.arange(20).reshape(4,5)
print(array)
print(array[2:])

输出为：
[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]
 [15 16 17 18 19]]
 

[[10 11 12 13 14]
 [15 16 17 18 19]]

也可以使用...代表全取

array = np.arange(20).reshape(4,5)
print(array)
print(array[...,1])

输出为：
[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]
 [15 16 17 18 19]]

[ 1  6 11 16] 取了第二列

同样可以对列做切片，只需要小小的改动一下

array = np.arange(20).reshape(4,5)
print(array)
print(array[...,1:])

输出为：
[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]
 [15 16 17 18 19]]

[[ 1  2  3  4]
 [ 6  7  8  9]
 [11 12 13 14]
 [16 17 18 19]

需要注意的是[?,?]与[?][?]是不一致的。
[?,?]表示直接取得某个下标的元素，[A][B]表示先取第A个集合，再取A中的第B个元素


array = np.arange(20).reshape(4,5)
print(array)
print(array[1,2])
print(array[1][2])
print(array[...,1])
print(array[...][1])


[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]
 [15 16 17 18 19]]

7
7

[ 1  6 11 16]
[5 6 7 8 9] # 对于数组array先取出所有的行构成一个集合，再取出索引为1的元素。

除了切片numpy还提供更加人性化的操作，如下：

print(array)
print(array[array > 10])
print(array[(array > 3) & (array < 7)])
print(array[(array < 3) | (array > 7)])
array[array % 2 == 1] = -1
print(array)



输出为：
[[ 1  2  3  4  5]
 [ 6  7  8  9 10]
 [11 12 13 14 15]
 [16 17 18 19 20]]

[11 12 13 14 15 16 17 18 19 20

[4 5 6]

[ 1  2  8  9 10 11 12 13 14 15 16 17 18 19 20]

[[-1  2 -1  4 -1]
 [ 6 -1  8 -1 10]
 [-1 12 -1 14 -1]
 [16 -1 18 -1 20]]

NumPy的广播机制

广播机制就是在两个维度不同的矩阵（数组）做运算时，对维度较小的数组进行自动填充的操作，使其与高纬度数组可以进行运算。

A = np.array([
    [1,2,3],
    [4,5,6],
    [7,8,9]
])
B = np.array([1,2,3])
print(A + B)
print(A * B)

输出结果为：
[[ 2  4  6]
 [ 5  7  9]
 [ 8 10 12]]
 
[[ 1  4  9]
 [ 4 10 18]
 [ 7 16 27]]

在这里插入图片描述

A = np.array([
    [1,2,3],
    [4,5,6],
    [7,8,9]
])
B = np.array([1,2,3])
print(np.shape(A + B))
print(np.shape(A * B))

输出为：
(3, 3)
(3, 3)

其中，A的shape为（3,3），B的shape为(3,)，右对齐后为：
3  3
   3
---------
3  3   此时有数值相同或为空的值，符合广播机制前提

A = np.array([
    [1,2,3],
    [4,5,6],
    [7,8,9]
])
B = np.array([1])
print(np.shape(A + B))
print(np.shape(A * B))
print(np.shape(B))

输出为：
(3, 3)
(3, 3)
(1,)
其中，A的shape为（3,3），B的shape为(1,)，右对齐后为：
3  3
   1
---------
3  3   此时有数值为1或为空的值，符合广播机制前提

但当我们违反这一前提时，将报错，如下：

A = np.array([
    [1,2,3],
    [4,5,6],
    [7,8,9]
])
B = np.array([1,2])
print(np.shape(A + B))
print(np.shape(A * B))
print(np.shape(B))

输出为：
Traceback (most recent call last):
ValueError: operands could not be broadcast together with shapes (3,3) (2,)

mean()

此函数用来求的所有元素的平均值，如果只想求某一维度的avg则需要设置axis参数。

在这里插入图片描述

A = np.array([
    [1,2,3],
    [4,5,6],
    [7,8,9]
])
print(A.mean())
print(A.mean(axis=0)) #固定某一列的所有行上的元素avg
print(A.mean(axis=1)) #固定某一行的所有列上的元素avg

输出为：
[4. 5. 6.]
[2. 5. 8.]

median

中位数，又称中点数，中值

A = np.array([
    [1,2,3],
    [4,5,6],
    [7,8,9]
])
print(np.median(A))

输出为：
5.0

std （standard deviation）

在这里插入图片描述

studentsGradeA = np.array([95,100,40,50,60,80,75,80,60,90])
studentsGradeB = np.array([75,85,70,80,70,65,65,75,80,85])
print(studentsGradeA.std())
print(studentsGradeB.std())
print(np.std(studentsGradeA))
print(np.std(studentsGradeB))

输出为：
18.867962264113206
7.0710678118654755
18.867962264113206
7.0710678118654755

var

方差，与标准差不同的是，方差没有计量单位而标准差有计量单位，一般不使用方差，其处理结果不符合人类直观思维。

studentsGradeA = np.array([95,100,40,50,60,80,75,80,60,90])
studentsGradeB = np.array([75,85,70,80,70,65,65,75,80,85])
print(studentsGradeA.var())
print(studentsGradeB.var())

输出结果为：
356.0
50.0

max与min

最大最小值。

A = np.array([
    [1,2,3],
    [4,5,6],
    [7,8,9]
])
print(A.max())
print(A.max(axis=0))
print(A.max(axis=1))

输出为：
[7 8 9]
[3 6 9]

average

加权平均值。
在这里插入图片描述

Ming = np.array([80,90,95])
Gang = np.array([95,90,80])
weight = np.array([0.2,0.3,0.5])
print(np.average(Ming,weights=weight))
print(np.average(Gang,weights=weight))

输出为：
90.5
86.0

数据类型

在这里插入图片描述

操作文件 loadtxt

- loadtxt可以读取文本文件并将其中的内容转换为可以操作的数组

在这里插入图片描述

随机数 numpy.random

在这里插入图片描述

rand与randn不同的是，前者是产生均匀分布的随机数，后者产生正态分布的随机数。

nums1 = np.random.rand(10000)
plt.hist(nums1)
plt.show()

nums2 = np.random.randn(10000)
plt.hist(nums2)
plt.show()

输出为：
在这里插入图片描述

在这里插入图片描述

nums3 = np.random.randint(1, 10, size=(3,3))
print(nums3)

输出为：
[[8 8 7]
 [3 9 9]
 [4 4 6]]

在这里插入图片描述
这里的sample没有什么规则，纯随机数

num = np.random.sample((3,3))
print(num)

输出为：
[[0.4458495  0.9152457  0.72226558]
 [0.16033389 0.89143945 0.68896212]
 [0.41554979 0.03056996 0.44517876]]

随机种子

np.random.seed(10)
nums1 = np.random.randn(3,3)
print(nums1)

np.random.seed(10)
nums2 = np.random.randn(3,3)
print(nums2)

输出为：
[[ 1.3315865   0.71527897 -1.54540029]
 [-0.00838385  0.62133597 -0.72008556]
 [ 0.26551159  0.10854853  0.00429143]]
 
[[ 1.3315865   0.71527897 -1.54540029]
 [-0.00838385  0.62133597 -0.72008556]
 [ 0.26551159  0.10854853  0.00429143]]
 可以看到是相同的，但是这里要注意每次“播种”只能收获一次。如下：

np.random.seed(10)
nums1 = np.random.randn(3,3)
print(nums1)

# np.random.seed(10)
nums2 = np.random.randn(3,3)
print(nums2)

输出为：
[[ 1.3315865   0.71527897 -1.54540029]
 [-0.00838385  0.62133597 -0.72008556]
 [ 0.26551159  0.10854853  0.00429143]]
 
[[-0.17460021  0.43302619  1.20303737]
 [-0.96506567  1.02827408  0.22863013]
 [ 0.44513761 -1.13660221  0.13513688]]
 是不一样的。

normal 自定义正态分布

nums = np.random.normal(10, 1, 100000)
plt.hist(nums)
plt.show()

输出为：
在这里插入图片描述
可以看到对称轴坐标为10

其它函数

在这里插入图片描述

append

array1 = np.arange(1,10).reshape(3,3)
print(array1)
array2 = np.append(array1, [1,2,3])
print(array2)
array3 = np.append(array1, [[1,1,1],[1,1,1]],axis=0)
print(array3)
array4 = np.append(array1, [[1,1,1],[1,1,1],[1,1,1]],axis=1)
print(array4)

输出为：
[[1 2 3]
 [4 5 6]
 [7 8 9]]
 
[1 2 3 4 5 6 7 8 9 1 2 3]

[[1 2 3]
 [4 5 6]
 [7 8 9]
 [1 1 1]
 [1 1 1]]
 
[[1 2 3 1 1 1]
 [4 5 6 1 1 1]
 [7 8 9 1 1 1]]

insert

array1 = np.arange(1, 10).reshape(3, 3)
print(array1)
array2 = np.insert(array1, 1, [0, 0, 0])
print(array2)
array3 = np.insert(array1, 1, [[1, 1, 1]], axis=0)
print(array3)
array4 = np.insert(array1, 1, [[1, 1, 1]], axis=1)
print(array4)

输出为：
[[1 2 3]
 [4 5 6]
 [7 8 9]]
 
[1 0 0 0 2 3 4 5 6 7 8 9]

[[1 2 3]
 [1 1 1]
 [4 5 6]
 [7 8 9]]
 
[[1 1 2 3]
 [4 1 5 6]
 [7 1 8 9]]

delete

这里的obj即数组元素索引

array1 = np.arange(1, 10).reshape(3, 3)
print(array1)
array2 = np.delete(array1, 1)
print(array2)
array3 = np.delete(array1, [1,2], axis=0)
print(array3)
array4 = np.delete(array1, [1,2], axis=1)
print(array4)

输出为：
[[1 2 3]
 [4 5 6]
 [7 8 9]]
 
[1 3 4 5 6 7 8 9]

[[1 2 3]]

[[1]
 [4]
 [7]]

argwhere

array = np.arange(0, 4).reshape(2, 2)
array[1, 1] = 0
print(array)
print(np.argwhere(array >= 1))

输出为：
[[0 1]
 [2 0]]
[[0 1]
 [1 0]]

unique
老朋友了。用于删除数组中重复的元素，并返回一个处理好的顺序数组。

此函数返回输入数组中的唯一元素数组。该函数可以返回唯一值数组的数组和关联索引数组。索引的性质取决于函数调用中返回参数的类型。

numpy.unique(arr, return_index, return_inverse, return_counts)

arr 输入数组。如果不是一维阵列，将会变平
return_index 如果为True，则返回输入待处理数组中保留的元素的索引
return_inverse 如果为True，则返回待处理数组中元素与唯一数组的索引位置关系
return_counts 如果为True，则返回唯一数组中的各个元素的重复数量

array = np.array([1,1,2,2,3,3,4])
print(array)
uniqueArray = np.unique(array)
print(uniqueArray)

uniqueArray = np.unique(array, return_index=True)
print(uniqueArray)
uniqueArray = np.unique(array, return_inverse=True)
print(uniqueArray)
uniqueArray = np.unique(array, return_counts=True)
print(uniqueArray)

输出为：
[1 1 2 2 3 3 4]
[1 2 3 4]
(array([1, 2, 3, 4]), array([0, 2, 4, 6], dtype=int64))
(array([1, 2, 3, 4]), array([0, 0, 1, 1, 2, 2, 3], dtype=int64))
(array([1, 2, 3, 4]), array([2, 2, 2, 1], dtype=int64))

numpy.argmax与argmin取出最多或最少的元素索引

array = np.array([1,1,2,3,3,3,4])
print(np.argmax(array))
print(array[np.argmax(array)])
print('-'*20)
print(np.argmin(array))
print(array[np.argmin(array)])

6
4
--------------------
0
1

sort排序

注意返回的是一个数组副本。

print('-'*20)
array = np.array([[1,5,2], [9,3,4]])
print(np.sort(array))
print(array)
print(np.sort(array,axis=0))
print(np.sort(array,axis=1))
print('-'*20)
dtype = np.dtype([('name', 'S10'),('age', int)])
people = np.array([("salmon",22),("monsal",11),("almson",33)],dtype=dtype)
print(np.sort(people,order='age'))

输出为：
--------------------
[[1 2 5]
 [3 4 9]]
[[1 5 2]
 [9 3 4]]
[[1 3 2]
 [9 5 4]]
[[1 2 5]
 [3 4 9]]
--------------------
[(b'monsal', 11) (b'salmon', 22) (b'almson', 33)]

在这里插入图片描述

array = np.array([5,3,2,2,7,3,1])
sorted_index = np.argsort(array)
print(sorted_index)

输出为：
[6 2 3 1 5 0 4]

matplotlib

在这里插入图片描述

plot、title、xlabel、ylabel

plot：绘制一个线性图表
title：为图表设置名称
xlabel：设置x轴名称
ylabel：设置y轴名称

x = np.arange(-50, 51)
y = x ** 2
plt.plot(x, y)
plt.title('y = x^2')
plt.xlabel('x-axis')
plt.ylabel('y-axis')
plt.show()

输出为：
在这里插入图片描述

fontsize、linewidth

fontsize：设置文字大小
linewidth：设置表格线条

x = np.arange(-50, 51)
y = x ** 2
plt.plot(x, y, linewidth=5)
plt.title('y = x^2')
plt.xlabel('x-axis', fontsize=15)
plt.ylabel('y-axis', fontsize=15)
plt.show()

输出为：
在这里插入图片描述

绘制多个图形

x = np.arange(-50, 51)
y1 = x ** 2
y2 = x
plt.plot(x, y1, linewidth=3)
plt.title('y = x^2')
plt.xlabel('x-axis', fontsize=15)
plt.ylabel('y-axis', fontsize=15)
plt.plot(x, y2, linewidth=3)
plt.show()

在这里插入图片描述

xticks

在这里插入图片描述

times = ['2020-1-1', '2020-1-2', '2020-1-3', '2020-1-4', '2020-1-5', '2020-1-6', '2020-1-7', '2020-1-8', '2020-1-9','2022-1-1', '2022-1-2', '2022-1-3', '2022-1-4', '2022-1-5']
sales = np.random.randint(0, 2000, size=len(times))
plt.plot(times, sales)
plt.show()

输出为：
在这里插入图片描述
这里我们可以发现横坐标字体重叠，而xticks就是为了解决这一类问题的

times = ['2020-1-1', '2020-1-2', '2020-1-3', '2020-1-4', '2020-1-5', '2020-1-6', '2020-1-7', '2020-1-8', '2020-1-9',
         '2022-1-1', '2022-1-2', '2022-1-3', '2022-1-4', '2022-1-5']
sales = np.random.randint(0, 2000, size=len(times))
plt.plot(times, sales)
plt.xticks(range(1, len(times), 2), labels=[1, 2, 3, 4, 5, 6, 7], rotation=45)  # 每隔两位显示一个数据 并自定义坐标为1,2,3,4... 且旋转45度
plt.show()

输出为：
在这里插入图片描述

legend

设置多折线的曲线含义

x_axis = [1,2,3,4,5]
nums1 = np.random.randint(500,2000,size=len(x_axis))
nums2 = np.random.randint(0,1500,size=len(x_axis))
plt.xticks(rotation=45)
plt.plot(x_axis,nums1,label='income')
plt.plot(x_axis,nums2,label='expenditure')
plt.legend()
plt.show()

输出为：

在这里插入图片描述

其中legend()可手动设置位置
在这里插入图片描述

text

在这里插入图片描述

x_axis = [1,2,3,4,5]
nums1 = np.random.randint(500,2000,size=len(x_axis))
nums2 = np.random.randint(0,1500,size=len(x_axis))
plt.xticks(rotation=45)
plt.plot(x_axis,nums1,label='income')
plt.plot(x_axis,nums2,label='expenditure')
for x,y in zip(x_axis,nums1):
    plt.text(x,y,y)
for x,y in zip(x_axis,nums2):
    plt.text(x,y,y)
plt.legend()
plt.show()
输出为：

在这里插入图片描述

grid

在这里插入图片描述
linestyle = ‘–’代表虚线，默认实线

x = np.linspace(-np.pi,np.pi,512,endpoint=True)
cos = np.cos(x)
sin = np.sin(x)
plt.plot(x,sin)
plt.plot(x,cos)
plt.grid(True)
输出为：

在这里插入图片描述

gca （get current axes）

此函数可以对坐标轴进行移动操作
在这里插入图片描述

x = np.arange(-100, 100)
y = x**2
plt.plot(x,y)
plt.show()
输出为：

在这里插入图片描述
这里我们并不想要这种样式的坐标轴，所以将其改造为数学上常用的X，Y坐标

x = np.arange(-100, 100)
y = x ** 2
plt.plot(x, y)
ax = plt.gca()
ax.spines['right'].set_color('none')  # 将右方坐标轴变成透明
ax.spines['top'].set_color('none')    # 将上方坐标轴变为透明
ax.spines['left'].set_position(('axes', 0.5))
ax.spines['bottom'].set_position(('data', 0.0))
#set_position有三个参数：data axes outward 其中data表示指定轴线移动到指定坐标位置
#axes表示使坐标轴移动到指定坐标轴长度比例位置
plt.show()

输出为：

在这里插入图片描述

图片分辨率及大小

在这里插入图片描述

图表各杂项参数设置

在这里插入图片描述

color 可以指定颜色代表的16进制，也可以使用颜色的英文或缩写
alpha （0~1之间）折线透明度
linestyle 折线样式
marker 标记点

x = np.arange(-10, 11)
y = x ** 2
plt.plot(x, y, color='r', linewidth='1', marker='H', linestyle='--', label='y = x^2')
plt.legend()
plt.show()
输出为：

在这里插入图片描述

创建图形对象

在这里插入图片描述

x = np.arange(-10, 11)
y = x ** 2
figure = plt.figure('y = x^2', figsize=(4, 2), dpi=100)
plt.plot(x, y)
plt.show()
输出为：

在这里插入图片描述

绘制多子图

在这里插入图片描述

add_axes() 在画布中创建一个区域

x = np.arange(-10, 11)
y = x ** 2
figure = plt.figure('y = x^2', figsize=(4, 2), dpi=100)
axes1 = figure.add_axes([0.3, 0.3, 0.5, 0.5])
plt.plot(x,y)
axes2 = figure.add_axes([0, 0.5, 0.3, 0.3])
axes2.plot([1,2,3,4],[1,2,3,4]) #使用axes的plot和plt的效果一致
plt.show()
输出为：

在这里插入图片描述

subplot() 将画布均等的划分为多个区域

这里提一下plot(x,y)中的x可省略，默认为0~N-1等差递增，其中N为y轴元素个数

x = np.arange(0, 11)
y = x ** 2
plt.subplot(2, 1, 1) #画布被分隔为两行一列，现在我们开始操作第一个区域
plt.plot(x, y, marker='o')

plt.subplot(212) #画布被分隔为两行一列，现在我们开始操作第二个区域
plt.plot(np.arange(11) ** 2)
plt.show()
输出为：

在这里插入图片描述

x = np.arange(0, 11)
y = x ** 2
plt.subplot(2, 1, 1, title='y = x ^ 2', xlabel='This is the X axis')  # 画布被分隔为两行一列，现在我们开始操作第一个区域
plt.plot(x, y, marker='o')

plt.subplot(212, title='y = x ^ 2', xlabel='This is the X axis')  # 画布被分隔为两行一列，现在我们开始操作第二个区域
plt.plot(np.arange(11) ** 2)
# plt.tight_layout()
plt.show()

在这里插入图片描述
这里我们可以注意到如果没有使用plt.tight_layout()图形将会重叠，加上后的效果如下图

subplots()

x = np.arange(1, 102)
y1 = x ** 2
y2 = np.sqrt(x)
y3 = np.exp(x)
y4 = np.log2(x)
fig, axes = plt.subplots(2, 2)  # 将一个画布划分为两行两列的图形
axes[0][0].plot(x, y1, marker='o',markevery=25)
axes[0][0].set_title('quare')
axes[0][1].plot(x, y2, marker='o',markevery=25)
axes[0][1].set_title('sqrt')
axes[1][0].plot(x, y3, marker='o',markevery=25)
axes[1][0].set_title('e')
axes[1][1].plot(x, y4, marker='o',markevery=25)
axes[1][1].set_title('log')
plt.tight_layout()
plt.show()
输出为：

在这里插入图片描述

柱状图

在这里插入图片描述

x = np.arange(5)
data = [5, 25, 20, 15, 10]
plt.bar(x,data,color=['r','g','b'])
plt.show()
输出为：

在这里插入图片描述

同位置多柱状图

start_position = np.arange(5)
countries = ['China', 'Germany', 'America', 'England', 'Russia']
gold_medal = [16, 4, 12, 7, 10]
silver_medal = [8, 10, 12, 5, 7]
bronze_medal = [13, 5, 2, 7, 6]
# 设置横坐标位置
width = 0.25
gold_position = start_position
silver_position = gold_position + width
bronze_position = silver_position + width
# 绘图
plt.bar(gold_position, gold_medal, color='gold', width=width, label='Gold number')
plt.bar(silver_position, silver_medal, color='silver', width=width, label='Silver number')
plt.bar(bronze_position, bronze_medal, color='g', width=width, label='Bronze number')
# 重新修改X坐标
plt.xticks(start_position + width, labels=countries)
# 在柱状图上方加入数据说明并调整对齐方式
for i in np.arange(len(start_position)):
    plt.text(gold_position[i], gold_medal[i], gold_medal[i], va='bottom', ha='center')
    plt.text(silver_position[i], silver_medal[i], silver_medal[i], va='bottom', ha='center')
    plt.text(bronze_position[i], bronze_medal[i], bronze_medal[i], va='bottom', ha='center')
# 设置图片说明
plt.legend()
plt.show()
输出为：

在这里插入图片描述

堆叠柱状图

start_position = np.arange(5)
countries = ['China', 'Germany', 'America', 'England', 'Russia']
gold_medal = np.array([16, 4, 12, 7, 10])
silver_medal = np.array([8, 10, 12, 5, 7])
bronze_medal = np.array([13, 5, 2, 7, 6])
# 设置横坐标位置
width = 0.25
plt.bar(start_position, gold_medal, color='gold', width=width, label='Gold number', bottom=silver_medal+bronze_medal)
plt.bar(start_position, silver_medal, color='silver', width=width, label='Silver number', bottom=bronze_medal)
plt.bar(start_position, bronze_medal, color='g', width=width, label='Bronze number')
plt.xticks(start_position, labels=countries)
plt.show()
输出为：

在这里插入图片描述

水平条形图

其中left相当于bottom，height相当于width

plt.rcParams['figure.figsize']=(14.5, 4)
movie = ['The Shawshank Redemption', 'Forrest Gump', 'Green Book']
boxOffice_day1 = np.array([9842, 9531, 5489])
boxOffice_day2 = np.array([8652, 9248, 6890])
boxOffice_day3 = np.array([2356, 6982, 8904])
sum_boxOffice = boxOffice_day1+boxOffice_day2+boxOffice_day3
day2_position = boxOffice_day1
day3_position = boxOffice_day1 + boxOffice_day2
height = 0.25
plt.barh(movie,boxOffice_day1,height=height)
plt.barh(movie,boxOffice_day2,height=height,left=day2_position)
plt.barh(movie,boxOffice_day3,height=height,left=day3_position)
for i in range(len(movie)):
    plt.text(sum_boxOffice[i],movie[i],sum_boxOffice[i],va='center',ha='left')
plt.show()

在这里插入图片描述

# 其余代码同上
y_start_position = np.arange(len(movie))
plt.barh(y_start_position, boxOffice_day1, height=height)
plt.barh(y_start_position + height, boxOffice_day2, height=height)
plt.barh(y_start_position + height * 2, boxOffice_day3, height=height)
plt.yticks(y_start_position + height, movie)
for i in range(len(movie)):
    plt.text(boxOffice_day1[i], y_start_position[i], boxOffice_day1[i], va='center', ha='left')
    plt.text(boxOffice_day2[i], y_start_position[i] + height, boxOffice_day2[i], va='center', ha='left')
    plt.text(boxOffice_day3[i], y_start_position[i] + height * 2, boxOffice_day3[i], va='center', ha='left')
plt.show()

在这里插入图片描述

直方图

在这里插入图片描述

values = np.random.randint(140, 200, 100)
plt.hist(values, bins=10, edgecolor='white')
plt.title('Height/Rate data statistics')
plt.xlabel('Height')
plt.ylabel('Rate')
plt.show()

在这里插入图片描述

hist函数还会有三个返回值。

values = np.random.randint(140, 200, 100)
nums,bins,patches = plt.hist(values, bins=10, edgecolor='white')
print(nums)
print(bins)
print(patches)
for i in patches:
    print(i)

输出为：
[19.  9.  9.  6. 10. 12.  4.  8. 13. 10.]
[140.  145.9 151.8 157.7 163.6 169.5 175.4 181.3 187.2 193.1 199. ]
<BarContainer object of 10 artists>
Rectangle(xy=(140, 0), width=5.9, height=19, angle=0)
Rectangle(xy=(145.9, 0), width=5.9, height=9, angle=0)
Rectangle(xy=(151.8, 0), width=5.9, height=9, angle=0)
Rectangle(xy=(157.7, 0), width=5.9, height=6, angle=0)
Rectangle(xy=(163.6, 0), width=5.9, height=10, angle=0)
Rectangle(xy=(169.5, 0), width=5.9, height=12, angle=0)
Rectangle(xy=(175.4, 0), width=5.9, height=4, angle=0)
Rectangle(xy=(181.3, 0), width=5.9, height=8, angle=0)
Rectangle(xy=(187.2, 0), width=5.9, height=13, angle=0)
Rectangle(xy=(193.1, 0), width=5.9, height=10, angle=0)

添加折线辅助线

fig, ax = plt.subplots()
values = np.random.randint(140, 200, 100)
nums, bins, patches = plt.hist(values, bins=10, edgecolor='white')
plt.title('Height/Rate data statistics')
plt.xlabel('Height')
plt.ylabel('Rate')
# 使折线标识点转移到中间
array = (bins[:10] + bins[1:]) / 2
ax.plot(array, nums, marker='o')
plt.show()

在这里插入图片描述

不等距分组的直方图
可以通过自定义设置bins数组来实现

x = np.random.normal(0, 1, 100)
bins = [-10, -1, 0, 1, 10]
plt.hist(x,bins,edgecolor='white')
plt.title('Unequal grouping')
plt.show()

在这里插入图片描述

多数值类型直方图

# 生成了一个二维数组，传入三组数据
x_multi = [np.random.randn(i) for i in [10000,4000,7000]]
plt.hist(x_multi, bins=10, edgecolor='white',label=['A','B','C'])
plt.title('multi-type')
plt.legend()
plt.show()

在这里插入图片描述

# 生成了一个二维数组，传入三组数据
x_multi = [np.random.randn(i) for i in [10000,4000,7000]]
# plt.hist(x_multi, bins=10, edgecolor='white',label=['A','B','C'])
# 加入stacked（默认为False）后表示图形可重叠
plt.hist(x_multi, bins=10, edgecolor='white',label=['A','B','C'],stacked=True) 
plt.title('multi-type')
plt.legend()
plt.show()

在这里插入图片描述

饼状图

在这里插入图片描述

x = np.random.randint(100, size=10)
labels = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
explode = [0.5, 0.02, 0.03, 0.04, 0.05, 0.05, 0.04, 0.03, 0.02, 0.01]
plt.pie(x, labels=labels,autopct='%.2f%%',explode=explode)
plt.show()

在这里插入图片描述

x = np.random.randint(100, size=10)
labels = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
explode = [0.5, 0.02, 0.03, 0.04, 0.05, 0.05, 0.04, 0.03, 0.02, 0.01]
plt.pie(x, labels=labels,autopct='%.2f%%',explode=explode,labeldistance=1.1,pctdistance=1.3)
plt.show()

在这里插入图片描述

散点图

在这里插入图片描述

x = np.random.randint(100, size=100)
y = np.random.randint(100, size=100)
colors = np.random.rand(100)
s = np.random.randint(1000, size=100)
plt.scatter(x, y, c=colors, alpha=0.5, s=s)
plt.show()

在这里插入图片描述

箱线图

plt.boxplot()

在这里插入图片描述

该种图使用上四分位数到下四分位数绘制一个盒子，然后使用一条线穿过盒子，上延伸至上边缘最大值，下延伸至下边缘最小值。

x = np.array([1, 20, 99, 100, 30])
box = {'linestyle': '--', 'linewidth': 1, 'color': 'b'}
mean = {'marker': 'o', 'markerfacecolor': 'pink', 'markersize': 2}
plt.boxplot(x, meanline=True, showmeans=True, boxprops=box, meanprops=mean)
plt.grid()
plt.show()

在这里插入图片描述

词云图

即将关键字的出现频率可视化展现出来
在这里插入图片描述

在这里插入图片描述

wordcloud函数参数

from wordcloud import WordCloud
import jieba

with open('D:/code/研一/test/雨霖铃·寒蝉凄切.txt',encoding='utf8') as file:
    # 读取数据文件
    data = file.read()
    data = data.replace('\r', '')
    data = data.replace('\n', '')
    data = data.replace('\t', '')
    # 如果数据文件中有中文的话必须要在font_path上指定字体,经测试，有的字体依旧无法正常显示，这里使用STZHONGS字体
    # collocations默认为True,含义是数据文件中是否有多字搭配的一个完整词
    wordcloud = WordCloud(font_path='D:/code/研一/test/STZHONGS.TTF'
                          , collocations=False
                          , background_color='black'
                          , width=1000
                          , height=800
                          , max_words=10).generate(data)

    img = wordcloud.to_image()
    img.show()

在这里插入图片描述
可以发现词是成一组一组出现的，如果想达到和英文一样的效果得引入jieba分词

全模式即将一个句子中可能是词的字组合全部列出来，精确模式只列出可能需要的词。默认使用精确模式。

with open('D:/code/研一/test/论十大关系.txt',encoding='utf8') as file:
    # 读取数据文件
    data = file.read()
    # 使用jieba分词
    data = jieba.cut(data)
    # 将分好的词以空格分隔包装成一个句子
    data = ' '.join(data)
    # 如果数据文件中有中文的话必须要在font_path上指定字体,经测试，有的字体依旧无法正常显示，这里使用STZHONGS字体
    # collocations默认为True,含义是数据文件中是否有多字搭配的一个完整词
    wordcloud = WordCloud(font_path='D:/code/研一/test/STZHONGS.TTF'
                          , collocations=False
                          , background_color='black'
                          , width=1000
                          , height=800).generate(data)

    img = wordcloud.to_image()
    img.show()

在这里插入图片描述
这里我们可以发现有很多我们不需要的介词、语气助词等，通过以下方法可以进一步提取。

这里需要注意，除了引入import jieba以外还需要引入import jieba.analyse

import jieba.analyse
with open('D:/code/研一/test/论十大关系.txt',encoding='utf8') as file:
    # 读取数据文件
    data = file.read()
    # 使用jieba分词，v等于动词，n为名词
    data = jieba.analyse.extract_tags(data,allowPOS=('v','n'))
    # 将分好的词以空格分隔包装成一个句子
    data = ' '.join(data)
    # 如果数据文件中有中文的话必须要在font_path上指定字体,经测试，有的字体依旧无法正常显示，这里使用STZHONGS字体
    # collocations默认为True,含义是数据文件中是否有多字搭配的一个完整词
    wordcloud = WordCloud(font_path='D:/code/研一/test/STZHONGS.TTF'
                          , collocations=False
                          , background_color='black'
                          , width=1000
                          , height=800).generate(data)

    img = wordcloud.to_image()
    img.show()

在这里插入图片描述

图片另存为

在这里插入图片描述

x = np.random.randint(100, size=100)
y = np.random.randint(100, size=100)
colors = np.random.rand(100)
s = np.random.randint(1000, size=100)
plt.scatter(x, y, c=colors, alpha=0.5, s=s)
# 如果想要保存的目录不存在，可以调用OS模块，如下
if os.path.exists('D:/code/研一/test') is False:
    os.mkdir('D:/code/研一/test')
plt.savefig('D:/code/研一/test/scatterDiagram.jpg')
plt.show()

在这里插入图片描述

pandas

series

在这里插入图片描述

当输入某索引想要访问该索引上的数据时，如果该索引不存在则报错，但是可以直接新增索引及数据。

当下标均是字符型时，既可以使用声明的字符确定data，也可以使用默认的数字RangeIndex来访问data。

当下标中存在数值型的数据时，不可以使用默认的数字RangeIndex来访问data。

如果位置索引与标签索引一致，则会使用标签索引。

感觉这玩意和ndarray差不多，就不详细记笔记了。有不同的点再记一下。

import numpy as np
import pandas as pd

# series 可以使用标签索引进行切片，与普通索引切片不一样的是，它不是左闭右开区间而是全封闭区间
series_list = pd.Series(np.array(np.arange(0, 3)),index=['a', 'b', 'c'])
print(series_list)
print(series_list['b':'c'])

输出为：
a    0
b    1
c    2
dtype: int32
b    1
c    2
dtype: int32

head() 查看前？条数据，默认为5条，也可以指定
tail() 同上，默认查看后5条数据
reindex() 重新设置索引，但当原先的索引与新索引有不同时，该处的data将会被赋值为NaN
直接将series类型的list进行相加，将会以索引为标准将相同的相加，若这些list中存在独有的索引时，将其对应的data赋值为NaN
drop() 删除指定索引及data，返回修改后的list，原list不变。此函数有一个inplace参数，默认为False，当设置其为True时，将会同时修改原list
增加直接使用list[?] = ? 即可

DataFrame

在这里插入图片描述

使用二维数组、字典均可以转化成DataFrame对象。同时它们也可以互相嵌套进行创建。
字典创建列，列表创建行

Data1 = [['a', 1], ['b', 2], ['c', 3]]
Data2 = [{'d': 4}, {'e': 5}, {'d': 4, 'e': 5, 'f': 6}]
Data3 = {'name': ['甲', '乙', '丙'], 'age': [20, 30, 40]}
DF1 = pd.DataFrame(Data1)
DF2 = pd.DataFrame(Data2)
DF3 = pd.DataFrame(Data3)
print(DF1)
print(DF2)

输出为：
   0  1
0  a  1
1  b  2
2  c  3

     d    e    f
0  4.0  NaN  NaN
1  NaN  5.0  NaN
2  4.0  5.0  6.0

  name  age
0    甲   20
1    乙   30
2    丙   40

使用Series进行创建DataFrame，这种方式可以设置属性类型。

Data4 = {
    'name': pd.Series(['魑', '魅', '魍', '魉']),
    'age': pd.Series([20, 30, 40],dtype=float),
    'sex': pd.Series(['男', '女', '男']),
    'salary': pd.Series([4000, 6000, 8000], dtype=int)
}
DF4 = pd.DataFrame(Data4)
print(DF4)

输出为：
  name   age  sex  salary
0    魑  20.0    男  4000.0
1    魅  30.0    女  6000.0
2    魍  40.0    男  8000.0
3    魉   NaN  NaN     NaN

这里注意到在salary这一列中，即使设定类型为int仍旧转化为了浮点型，这里是因为NaN类型属于浮点型。

可以使用如DF4[‘列名’]来访问一列的数据，如果要访问多列可以使用DF4[['name', 'age', 'salary']]方式，且列标签不可进行切片操作。

添加数据则直接定义即可

Data4 = {
    'name': pd.Series(['魑', '魅', '魍', '魉']),
    'age': pd.Series([20, 30, 40], dtype=float),
    'sex': pd.Series(['男', '女', '男']),
    'salary': pd.Series([4000, 6000, 8000,10000], dtype=int)
}
DF4 = pd.DataFrame(Data4)
DF4['sum'] = DF4['age'] + DF4['salary']
print(DF4)

  name   age  sex  salary     sum
0    魑  20.0    男    4000  4020.0
1    魅  30.0    女    6000  6030.0
2    魍  40.0    男    8000  8040.0
3    魉   NaN  NaN   10000     NaN

在这里插入图片描述

Data4 = {
    'name': pd.Series(['魑', '魅', '魍', '魉']),
    'age': pd.Series([20, 30, 40], dtype=float),
    'sex': pd.Series(['男', '女', '男']),
    'salary': pd.Series([4000, 6000, 8000, 10000], dtype=int)
}
DF4 = pd.DataFrame(Data4)
DF4.insert(4, 'sum', DF4['age'] + DF4['salary'])
print(DF4)

输出结果同上

同时，可以使用del或者pop()对DataFrame进行删除。不同的是pop会有返回值，del则是直接删除。

Data4 = {
    'name': pd.Series(['魑', '魅', '魍', '魉']),
    'age': pd.Series([20, 30, 40], dtype=float),
    'sex': pd.Series(['男', '女', '男']),
    'salary': pd.Series([4000, 6000, 8000, 10000], dtype=int)
}
DF4 = pd.DataFrame(Data4)
DF4.insert(4, 'sum', DF4['age'] + DF4['salary'])
del DF4['sum']
DF4.pop('age')
print(DF4)

  name  sex  salary
0    魑    男    4000
1    魅    女    6000
2    魍    男    8000
3    魉  NaN   10000

loc 选定某一行

Data4 = {
    'name': pd.Series(['魑', '魅', '魍', '魉']),
    'age': pd.Series([20, 30, 40], dtype=float),
    'sex': pd.Series(['男', '女', '男']),
    'salary': pd.Series([4000, 6000, 8000, 10000], dtype=int)
}
DF4 = pd.DataFrame(Data4)
print(DF4)
# 输出某一行数据
print(DF4.loc[0])
# 输出交叉选定的某一数据，这里也可以使用切片
print(DF4.loc[1,'name'])
print(DF4.loc[0:3,'name'])

  name   age  sex  salary
0    魑  20.0    男    4000
1    魅  30.0    女    6000
2    魍  40.0    男    8000
3    魉   NaN  NaN   10000

name         魑
age       20.0
sex          男
salary    4000
Name: 0, dtype: object

魅

0    魑
1    魅
2    魍

同时pandas也有iloc，它通过数字索引寻找指定行，普通的loc是通过标签寻找指定行。因此在iloc中寻找某一特定位置的数据时位置参数应全为数字，如DF.iloc[1,1]。

在这里插入图片描述

Data4 = {
    'name': pd.Series(['魑', '魅', '魍', '魉']),
    'age': pd.Series([20, 30, 40], dtype=float),
    'sex': pd.Series(['男', '女', '男']),
    'salary': pd.Series([4000, 6000, 8000, 10000], dtype=int)
}
DF4 = pd.DataFrame(Data4)
print(DF4)
print('================================' * 10)
data = {'name': '妖', 'age': 10, 'sex': '女', 'salary': 10000}
# 使用append函数时，若没有指定index则会报错，不过可以使用ignore_index参数来自动分配一个index
DF4 = DF4.append(data, ignore_index=True)
# 也可以赋予其一个index
DF4 = DF4.append(pd.Series(data, name='rank5'))
print(DF4)

  name   age  sex  salary
0    魑  20.0    男    4000
1    魅  30.0    女    6000
2    魍  40.0    男    8000
3    魉   NaN  NaN   10000
============================================================================================
      name   age  sex  salary
0        魑  20.0    男    4000
1        魅  30.0    女    6000
2        魍  40.0    男    8000
3        魉   NaN  NaN   10000
4        妖  10.0    女   10000
rank5    妖  10.0    女   10000

在这里插入图片描述

Data4 = {
    'name': pd.Series(['魑', '魅', '魍', '魉']),
    'age': pd.Series([20, 30, 40], dtype=float),
    'sex': pd.Series(['男', '女', '男']),
    'salary': pd.Series([4000, 6000, 8000, 10000], dtype=int)
}
DF4 = pd.DataFrame(Data4)
print(DF4)
print('================================' * 10)
data1 = ['魔', 0, '男', 0]
data2 = [['妖', 0, '男', 0], ['鬼', 0, '男', 0]]
data3 = [[[1, 2, 3, 4]]]
DF5 = DF4.append(data1)
print(DF5)
print('================================' * 10)
DF6 = DF4.append(pd.DataFrame(data2,columns=['name', 'age', 'sex', 'salary']), ignore_index=True)
print(DF6)
print('================================' * 10)
DF7 = DF4.append(data3)
print(DF7)


  name   age  sex  salary
0    魑  20.0    男    4000
1    魅  30.0    女    6000
2    魍  40.0    男    8000
3    魉   NaN  NaN   10000
============================================================================================
  name   age  sex   salary    0
0    魑  20.0    男   4000.0  NaN
1    魅  30.0    女   6000.0  NaN
2    魍  40.0    男   8000.0  NaN
3    魉   NaN  NaN  10000.0  NaN
0  NaN   NaN  NaN      NaN    魔
1  NaN   NaN  NaN      NaN    0
2  NaN   NaN  NaN      NaN    男
3  NaN   NaN  NaN      NaN    0
============================================================================================
  name   age  sex  salary
0    魑  20.0    男    4000
1    魅  30.0    女    6000
2    魍  40.0    男    8000
3    魉   NaN  NaN   10000
4    妖   0.0    男       0
5    鬼   0.0    男       0
============================================================================================
  name   age  sex   salary             0
0    魑  20.0    男   4000.0           NaN
1    魅  30.0    女   6000.0           NaN
2    魍  40.0    男   8000.0           NaN
3    魉   NaN  NaN  10000.0           NaN
0  NaN   NaN  NaN      NaN  [1, 2, 3, 4]

DataFrame常用属性及方法

Data4 = {
    'name': pd.Series(['魑', '魅', '魍', '魉']),
    'age': pd.Series([20, 30, 40], dtype=float),
    'sex': pd.Series(['男', '女', '男']),
    'salary': pd.Series([4000, 6000, 8000, 10000], dtype=int)
}
DF4 = pd.DataFrame(Data4)
print(DF4.T)
print('================================' * 10)
print(DF4.axes)
print('================================' * 10)
print(DF4.empty)
print('================================' * 10)
print(DF4.columns)
print(DF4.columns.size)
print('================================' * 10)
print(DF4.shape)
print('================================' * 10)
print(DF4.values)
print('================================' * 10)
print(DF4.head(1))
print(DF4.tail(1))

================================================================================================================================================================================================================================================================================================================================
           0     1     2      3
name       魑     魅     魍      魉
age     20.0  30.0  40.0    NaN
sex        男     女     男    NaN
salary  4000  6000  8000  10000
================================================================================================================================================================================================================================================================================================================================
[RangeIndex(start=0, stop=4, step=1), Index(['name', 'age', 'sex', 'salary'], dtype='object')]
================================================================================================================================================================================================================================================================================================================================
False
================================================================================================================================================================================================================================================================================================================================
Index(['name', 'age', 'sex', 'salary'], dtype='object')
4
================================================================================================================================================================================================================================================================================================================================
(4, 4)
================================================================================================================================================================================================================================================================================================================================
[['魑' 20.0 '男' 4000]
 ['魅' 30.0 '女' 6000]
 ['魍' 40.0 '男' 8000]
 ['魉' nan nan 10000]]
================================================================================================================================================================================================================================================================================================================================
  name   age sex  salary
0    魑  20.0   男    4000
  name  age  sex  salary
3    魉  NaN  NaN   10000

在这里插入图片描述

Data4 = {
    'name': pd.Series(['魑', '魅', '魍', '魉']),
    'age': pd.Series([20, 30, 40], dtype=float),
    'sex': pd.Series(['男', '女', '男']),
    'salary': pd.Series([4000, 6000, 8000, 10000], dtype=int)
}
DF4 = pd.DataFrame(Data4)
DF5 = DF4.rename(index={0: 'rank1', 1:'rank2', 2:'rank3', 3:'rank4'})
print(DF5)
print('================================' * 10)
DF6 = DF4.sort_index(axis=1)
print(DF6)

============================================================================================
      name   age  sex  salary
rank1    魑  20.0    男    4000
rank2    魅  30.0    女    6000
rank3    魍  40.0    男    8000
rank4    魉   NaN  NaN   10000
============================================================================================
    age name  salary  sex
0  20.0    魑    4000    男
1  30.0    魅    6000    女
2  40.0    魍    8000    男
3   NaN    魉   10000  NaN

在这里插入图片描述

Data4 = {
    'name': pd.Series(['魑', '魅', '魍', '魉']),
    'age': pd.Series([20, 20, 40], dtype=float),
    'sex': pd.Series(['男', '女', '男']),
    'salary': pd.Series([4000, 6000, 8000, 10000], dtype=int)
}
DF4 = pd.DataFrame(Data4)
DF5 = DF4.sort_values(axis=0, by=['age', 'salary'], ascending=[False, True])
print(DF5)
print('============================================================')
DF6 = DF4.sort_values(axis=1, by=[0, 1], ascending=[False, True])
print(DF6)

============================================================================================
  name   age  sex  salary
2    魍  40.0    男    8000
0    魑  20.0    男    4000
1    魅  20.0    女    6000
3    魉   NaN  NaN   10000
============================================================
  name  sex  salary   age
0    魑    男    4000  20.0
1    魅    女    6000  20.0
2    魍    男    8000  40.0
3    魉  NaN   10000   NaN

时间模块

time

print('============================================================')
# 获得当前时间
tmp_time = time.localtime()
print(tmp_time)
# 从struct结构转化为可供直观阅读的format_time结构
print(time.strftime('%Y-%m-%d %H:%M:%S',tmp_time))
print(time.strftime('%Y-%m-%d %H:%M:%S'))
# 和上面的作用相反
print(time.strptime('2022-07-14 02:20:20','%Y-%m-%d %H:%M:%S'))
# 获得时间戳
now = time.time()
print(now)
# 将localtime转化为时间戳的形式
before = time.mktime(tmp_time)
print(before)
print('程序耗时为:', now - before)

============================================================
time.struct_time(tm_year=2022, tm_mon=7, tm_mday=14, tm_hour=2, tm_min=27, tm_sec=57, tm_wday=3, tm_yday=195, tm_isdst=0)
2022-07-14 02:27:57
2022-07-14 02:27:57
time.struct_time(tm_year=2022, tm_mon=7, tm_mday=14, tm_hour=2, tm_min=20, tm_sec=20, tm_wday=3, tm_yday=195, tm_isdst=-1)
1657736877.1353724
1657736877.0
程序耗时为: 0.13537240028381348

以下是time函数属性值，可以直接取出来使用
在这里插入图片描述

datetime
py.datetime
py.timedelta

now = datetime.datetime.now()
print(now)
yesterday = now - datetime.timedelta(days=1)
print(yesterday)
torrow = now + datetime.timedelta(days=1)
print(torrow)

2022-07-16 02:08:18.565168
2022-07-15 02:08:18.565168
2022-07-17 02:08:18.565168

在这里插入图片描述

timestamp
timedelta

now = pd.Timestamp.now()
print(now)
print(now - pd.Timedelta(1, 'D')) # 一天
print(now - pd.Timedelta(1, 'W')) # 一星期
print(now - pd.Timedelta(1, 'T')) # 一分钟
print(now - pd.Timedelta(1, 'S')) # 一秒

2022-07-18 01:58:27.455146
2022-07-17 01:58:27.455146
2022-07-11 01:58:27.455146
2022-07-18 01:57:27.455146
2022-07-18 01:58:26.455146