数据分析三个基础包

最新推荐文章于 2023-08-21 11:39:10 发布

归去来兮★

最新推荐文章于 2023-08-21 11:39:10 发布

阅读量289

点赞数

分类专栏：笔记文章标签：数据分析

本文链接：https://blog.csdn.net/zhouhe_/article/details/114890999

版权

笔记专栏收录该内容

6 篇文章 0 订阅

订阅专栏

文章目录

简介

 此博文主要用于整理数据分析三个基础包

numpy

简介
用于快速处理任意维度的数组

numpy 提供了一个核心的数据结构 ------N维数组类型ndarray
导包方式：import numpy as np
生成数组的方法和数组的基本方法
- 生成普通数组
  - 数组名 = np.array(N为数组,dtype)
- 生成 0 和 1 的数组
  - 数组名 = np.zeros(shape ,dtype)
  - 数组名 = np.ones(shape,dtype)
    
    Shape = (3,4)：三行四列,接收以个维数
    
    dtype = np.int32: 数据类型，有默认值
- 从现有数组中生成
  - 新数组 = np.array(旧数组) 属于深拷贝
  - 新数组 = np.copy(旧数组) 属于深拷贝*
  - 数组名 = np.asarray(旧数组) 属于深拷贝
  - 数组名.flatten() 生成一维数组，属于浅拷贝
  - 数组名.view() 浅拷贝
  - 数组名.copy() 深拷贝
- 生成固定范围的有序数组
  - np.linspace(start, stop,num,dtype)
  - np.arange(start, stop,step)
    
    均生成一维数组
- 生成随机数组
  - np.random.rand(shape) (0,1]
  - np.random.uniform(start, stop,size)
- 生成正态分布的随机数组(与生成随机数类似)
  - np.random.randn(start, stop,num,dtype)
  - np.random.normal(start, stop,size)
- ndarray属性
  
  ndarray.shape 返回数组的维度，也可以用于赋值设置数组的维度ndarray.shape=(3,3)
  
  ndarray.ndim 返回数组的维数
  
  ndarray.size 返回元素总个数
  
  ndarray.itemsize 数组中一个元素的长度（字节）
  
  ndarray.dtype 数组中元素的类型
- ndarray的索引和切片
  
  使用切片访问二维数组 ndarray[ : , : ]
  
  和列表索引切片一致，不再赘述
- 数组形状的修改
  - 数组名.reshape(shape)
  - 数组名.resize(shape)
  - 数组名.T 数组的转置
- 数组数据类型的修改
  - 数组名.astype(dtype)
  - 数组名.tobytes()
- 数组的去重
  - np.unique(数组)
ndarray的运算
- 逻辑运算
  - 数组名 > 10 返回布尔数组
  - 数组名[布尔数组] = 值 根据布尔数组，进行设置
    
    布尔索引 ndarray[ndarray > 5]
- 通用判断函数
  - np.all(布尔数组) 只要有一个False 就返回 False
  - np.any(布尔数组) 只要有一个True 就返回 True
- 统计运算
  - np.max(数组[,axix = ])
  - np.min(数组[,axix = ])
  - np.mean(数组[,axix = ])
  - np.median(数组[,axix = ])
  - np.var(数组[,axix = ])
  - np.std(数组[,axix = ])
  - np.argmax(数组[,axix = ])
  - np.argmin(数组[,axix = ])
  - 符合条件的索引 np.argwhere(ndarray > 0)
    
    axis 指定行或列，0 ：表示行，1 表示列
- 数组间的运算
  - 与标量（单个数值）之间
  - 数组间的运算
- 合并与分割
  - 合并
    - np.hstack((*args))
    - np.vstack((*args)
    - np.concatenate((*arg),axis = )
  - 分割
    - np.split(ndarray,indices_or_sections,axis = 0)
IO操作

np.genfromtxt(fname,delimiter = ?)
通用函数

能对数组中所有元素同时进行运算的函数就是通用函数

常见的通用函数：

一元函数：接受一个数组

二元函数：接受两个数组

返回结果也是一个数组
- 一元函数
  - abs() 绝对值
  - sqrt() 平方根
  - square() 平方
  - exp() e的元素次方
  - log() 计算自然对数
  - sign() 计算各元素的正负号
  - ceil() 计算各元素的ceil值
  - floor() 计算各元素的floor值
  - rint() 计算各元素的四舍五入值
  - modf() 一数组的形式返回个元素的整数和小数部分
  - isnan() 计算各元素的正负号
  - Isinf () 无穷的布尔型数组
  - sin() cos() tan() 三角函数
- 二元数组
  - add()加
  - subtract()减
  - multiply()乘
  - divide()除（除法和向下整除法）
  - pow()数组之间的相应乘
  - max()计算最大值
  - min()计算最小值
  - mod()求模计算
    - 注意：float 的特殊值
      
      nan: 不等于任何浮点数
      
      inf：比任何浮点数都大
      
      创建特殊数值：np.nan ,np.inf
      
      数据分析中，nan常被用作表示数据缺失值
- 小知识点：ndarray 内部的元素的交换赋值
  
  a[[0,1]] = a[[1,0]]

Pandas

简介
- 定位：数据分析工具包
- 功能：
  - 数据结构DateFrame，Series
  - 集成时间序列功能
  - 提供丰富的数学运算和操作
  - 灵活处理缺失数据
- 导入
  
  import pandas pd
Series

类似于一维数组的对象，有一组数据和类似与之相关的数据标签（索引）组成
- 创建
  - 自带索引（列表）：pd.Series(list)
  - 自定义索引：pd.Series(list, index = [“a”,“b”,“c”,“d”])
  - 指定索引（字典）：pd.Series(dict)
  - 默认值+自定义索引：pd.Series(0,index=[“a”,“b”,“c”,])
- 缺失数据处理
  - 缺失值返回True：Series.isnull()
  - 不是缺时至返回True：Series.notnull()
  - 过滤掉为NaN的值：Series[Series.notnull()]
- Series特性：
  - 从ndarray创建：Series（ndarray）
  - 与标量：Series * 2
  - 两个Series：Series + Seires
  - 布尔过滤：sr[sr > 0]
  - 统计函数：mean（），sum（），cumsum（）
- 支持字典的特性：
  - 从字典创建：Series（dict）
  - 成员运算： in
  - 键索引：Series[“a”]
  - 键切片：Series[“a”:“c”]
  - 其他函数：Series.get(“a”,default = 0)
- 整数索引
  - 使用整数切片没有问题，之后索引就有问题了
  - 解决方法：
    
    方法一：以下标解释，Series.iloc[1]
    
    方法二：以标签解释，Series.iloc[0]
- 运算：add ，sub，div ，mul
  
  s5 = s2.add(s3,fill_value = 0)# 缺失填充平均值
DataFrame
- 创建
  - 等长度的列表字典
    
    pd.DataFrame({
    
    “one”:[1,2,3,4],
    
    “two”:[4,3,2,1]
    
    })
    
    或
    
    pd.DataFrame([
    
    [1,2,3],
    
    [2,3,4],
    
    [5,6,7],
    
    ],
    
    index= list(“abc”),
    
    columns = list(“ABC”)
    
    )
  - Numpy数组的字典
    
    pd.DataFrame({
    
    “one”:np.array(1,2,3,4)),
    
    “two”:np.array(2,3,4,5)
    
    })
    
    或
    
    pd.DataFrame(np.random.randn(3,3),index= list(“abc”),columns = list(“ABC”))
  - Series组成的字典
    
    pd.DataFrame({
    
    “one”:pd.Series([1,2,3,4],index = [“a”,“b”,“c”,“d”]),
    
    “two”:pd.Series([1,2,3],index=[“a”,“b”,“c”])
    
    })
    
    或
    
    pd.DataFrame(
    [pd.Series([1,2,3,4],index = list(“abcd”)),
    pd.Series([1,2,3,3],index = list(“abcd”))]
    ,index = list(“de”),
    columns = list(“abcd”)
    )
常用属性和方法
- index 获取行索引
  - columns 获取列索引
- values 获取值索引
  - head(n) 前n行
- tail(n) 后n行
  - T 转置
  - describe 获取快速统计
- 索引切片
  - 使用索引切片的两种方法
  方法1:df[“A”][0]
  
  方法2:df.loc[0,“A”]
  
  df.iloc[0,“A”]
- 行\列索引可以时常规索引、切片、布尔索引、花式索引任意搭配
时间对象处理
- 灵活处理时间的对象：dateutil 包
  
  dateutil.parser.parser(“2019 jan 2nd”)
- 成组处理时间对象：pandas
  
  ind = pd.to_datetime([“2018-01-01”,“2019-3-3”])
  
  可以指定为Series中index的索引
  
  可以转换为ndarray ：ind.to_pydatetime()
- 产生时间对象数组：data_range
  
  pd.date_range(“2019-1-1”,“2019-2-2”)
数据的分组与聚合

matplotlib

简介：强大的python绘图和数据可视化工具包
导入：import matploblib.pyplot as plt
matplotlib 详解
- 四层结构
  - canvas(画板)：位于最底层，导入模块时自动调用
  - figure(画布)：建立在canvas之上，从这一层开始设置参数
  - axes(子图)：将figure 分成不同的块，实现分面绘图
  - 图表信息构图素：添加或修改axes 上的图形信息，优化图标显示效果
- 画图步骤
  - 创建画布：
    
    pic = plt.figure(figsize=(10,10),dpi = 80)
    
    创建10 * 10的画布，像素为 80
  - 创建子图
    
    ax1 = pic.add_subplot(2,2,1)
    
    划分成 2* 1 的矩阵，并选择第一张图
  - 添加画布内容(在选择子图的下面进行添加)
    
    标题，轴名，范围，刻度，图例
  - 图形的保存与展示
    
    plt.savefig(“图像名称 + 后缀名”)
    
    plt.show()
绘图风格

通过 plt.style.available 查看

使用 plt.style.use(“classic”])

简单使用

plt.plot(list(ndarray))

plt.show()

plot()函数简介：

线性：linestyle

点型：maker

颜色：color

plt的使用：

import matplotlib.pyplot as plt
import numpy as np
fig = plt.figure(figsize=(10,10),dpi = 80)
x = np.linspace(0,1,1000)
ax1 = fig.add_subplot(2,1,1)
plt.title("y = x^2 or y =x")
plt.xlabel("x轴")
plt.ylabel("y轴")
plt.xlim(0,1)
plt.ylim(0,1)
plt.xticks([ i for i in range(0,3)])
plt.yticks([ i for i in range(0,3)])
plt.plot(x,x**2)
plt.plot(x,x)
plt.legend(["y = x^2","y =x "])
plt.savefig("绘图流程.png")
plt.show()

# 设置4 个方向的画布
fig,axes = plt.subplots(nrows= 2,ncols = 2)
axes[0,0].set(title = "upper left")
axes[0,1].set(title = "upper right")
axes[1,0].set(title = "down left")
axes[1,1].set(title = "down right")

线性图

# 表达式的形式
 x = np.arange(-100,100)
 y1 = x  # 直接写关于x的的表达式
 y2 = x**2
 plt.plot(x,y1,label="y = x")
 plt.plot(x,y2,label="y = x^2")
 plt.show()

柱状图

data = [20,30,10,88]
lis = ["a","b","c","d"]
plt.xticks([1,2,3,4],lis)
plt.bar([1,2,3,4],data)
plt.show()

横向图

 plt.yticks([1,2,3,4],lis)
 plt.barh([1,2,3,4],data)
 plt.show()

数组图

# df = pd.DataFrame({
#     "one":pd.Series([20,50,60],index=list("abc")),
#     "two":pd.Series([10,40,90],index=list("abc")),
#     "three":pd.Series([40,10,0],index=list("abc"))
# })
# df.plot.bar()
# df.plot.barh(stacked = True)

饼状图

#plt.pie(
#     [20,50,20,10],
#     labels = list("abcd"),
#     autopct = "%.2f%%",
#     explode = [0.05,0,0,0]
# )
# plt.axis("equal")
# plt.show()

散点图

# x = np.random.randn(1000)
# y = np.random.randn(1000)
# plt.scatter(x,y)

直方图

data = np.random.randn(100)
plt.hist(data)
plt.show()

步阶图

# 步阶图
x = [1,2,3,4,5]
y = [2,4,6,8,10]
plt.step(x,y)
plt.show()

数据操作

数据导入与存储
- 常用的
  - pd.read_csv() & df.to_csv()
  - pd.read_table & df.to_table()
    
    不能读取表格，只能读取特定格式的文件
    
    常用参数： path ， sep ， names ，na_values
    
    读取几行：读之前操作，pd.options.display,max_rows = int
- 不常用的
  - pd.read_pickle() & df.to_pickle()
  - pd.read_haf() & df.to_read()
数据处理
- 数据缺失值
  
  isnull（） notnull（） dropna（）
  - 针对 Series 可用
  - 针对 DataFrame
    
    dropna（）删除所有带缺失的行
    
    dropna（how= “all”）删除全为缺失的行
    
    dropna (axis = 1,how = “all”) 删除全为缺失的列
    
    fillna(0) 替换
    
    fillna(method = “ffill”) 替换
数据转换

df.duplicated() 返回布尔的Series各行是否重复

df.drop_duplicates() 删除重复

替换

df.replace([-999,…],np.nan) 将多个值设为 nan

df.replace({-999:np.nan,-1000:0})

归去来兮★

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
数据分析三个基础包

numpy简介用于快速处理任意维度的数组 numpy 提供了一个核心的数据结构 ------N维数组类型ndarray导包方式：import numpy as np生成数组的方法和数组的基本方法生成普通数组数组名 = np.array(N为数组,dtype)生成 0 和 1 的数组数组名 = np.zeros(shape ,dtype)数组名 = np.ones(shape,dtype)Shape = (3,4)：三行四列,接收以个维数dty
复制链接

扫一扫

专栏目录