数据科学笔记

最新推荐文章于 2023-09-13 14:07:24 发布

Sofice小司

最新推荐文章于 2023-09-13 14:07:24 发布

阅读量646

点赞数

文章标签：机器学习 python numpy

本文链接：https://blog.csdn.net/samsara_of_ice/article/details/106359720

版权

@Sofice

数据科学

Numpy

用于Python数值计算基础包

引用：import numpy as np

ndarray多维数组对象

生成

函数名	描述
array	将列表，元组，数组等转化为ndarray
arange	内建函数
ones	全1,给定形状和数据类型
ones_like	全1，给定数组生成一个形状一样的
zeros，zeros_like	全0
empty，empty_like	空数组
full，full_like	指定数值
eye, identity	主对角线矩阵
reshape	改变数组维度
linspace(0, 1, 5)	均匀从0-1的5个数
random.random((3, 3))	随机0-1数3*3个

属性

shape：数组每一维度数量

dtype：数据类型（每一个元素类型都相同）

ndim：维度

算术

带标量计算的算数操作，会把计算参数传递给数组的每一个元素。

不同尺寸的数组间操作会用到广播特性

索引

切片：

得到一份视图而并非拷贝（拷贝要用arr[5:8].copy()）

arr[:, i:i+1]得到第i列

arr[i]得到第i行

对切片赋值会对切出的所有元素赋值

布尔索引：

names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
data = np.random.randn(7, 4)
# 用布尔索引，产生布尔数组array([ True, False, False, True, False, False, False])
data[names == 'Bob']
# 非
data[~(names == 'Bob')]
# 或
data[(names == 'Bob') | (names == 'Will')]

神器索引：将数据复制到新数组

arr = np.arange(32).reshape((8, 4))
# 按顺序选取4，3，0，6行
arr[[4, 3, 0, 6]]
# 每一行再选取第二维度
arr[[1, 5, 7, 2], [0, 3, 1, 2]]
# 改变选取的新数组的每一行中元素的顺序
arr[[1, 5, 7, 2]][:, [0, 3, 1, 2]]

通用函数

快速的逐元素数组函数

# 平方
np.sqrt(arr)
# 返回小数，整数部分
remainder, whole_part = np.modf(arr)

一元函数

函数名	描述
abs, fabs	整数、浮点数绝对值
sqrt, square	平方根，平方
exp	exp x
log, log10, log2	以e,10,2为底的对数
sign	符号函数
ceil, floor	向上、向下取整
rint	保留到整数，并保持dtype
modf	返回小数、整数部分
isnan, isinf	是否是NaN，是否是无限
sin, cos, arcsin, arccos	三角函数，反三角函数
logical_not	按位取反

二元函数

函数名	描述
add, subtract, multiply, divide(floor_divide)	加，减，乘，除（省略余数）
power	乘方
maximum，fmax, minimum, fmin	对应元素最大最小值，fmax,fmin忽略NaN
mod	求模
copysign	复制符号值
greater, greater_equal, less,less_equal, equal, not_equal	>,>=,<,<=,=,!=
logical_and, logical_or, logical_xor	&,\|,^
concatenate([x, y])	合并
x1, x2, x3 = np.split(x, [3, 5])	分裂

面向数组操作

条件逻辑

# x if condition else y
result = np.where(cond, x, y)
# True就取第一个数组，False就取第二个数组
np.where([True, False], [1, 2], [3, 4])
# 将正值设为2，负值设为-2
np.where(arr > 0, 2, -2)

数学统计

# 两种方法计算
arr.mean()
np.mean(arr)
# 纵向计算
arr.mean(axis=0)
# 横向计算
arr.mean(axis=1)

方法	描述
sum	和
mean	平均值
std, var	标准差，方差
min, max	最大值，最小值
argmin, argmax	最大值最小值位置
cumsum, cumprod	累计和，累计积
sort,argsort	排序，原始顺序下表

布尔值数组

bools = np.array([False, False, True, False])
# 是否有True
bools.any()
# 是否全是True
bools.all()
# 按位
np.sum((inches > 0.5) & (inches < 1))
# 掩码
 x[x < 5]

集合操作

方法	描述
unique(x)	唯一值，并排序
intersect1d(x, y)	交集，并排序
union1d(x, y)	并集，并排序
in1d(x, y)	x中元素是否包含在y，返回布尔值数组
setdiff1d(x, y)	差集，在x中但不在y中
setxor1d(x, y)	异或集，在并集但不属于交集的元素

存储

# 存储
np.save('arrays', arr)
# 载入
np.load('arrays.npy')

线性代数

转置：arr.T

numpy.linalg中的方法

方法	描述
diag	返回对角元素
dot	矩阵乘法
trace	对角元素和
det	行列式
eig	特征值，特征向量
inv	逆矩阵
solve	求解Ax = b
lstsq	Ax = b的最小二乘解

随机数

numpy.random

方法	描述
seed，RandomState	随机种子，只使用一次
permutation	返回一个序列的随机排列
shuffle	随机排列一个序列
rand	0-1均匀分布（维度）
randint	给定范围的均匀分布
randn	均值0方差1的正态分布
binomial	二项分布
normal	正态（高斯）分布
beta	beta分布
chisquare	卡方分布
gamma	伽马分布
uniform	[0,1)均匀分布

结构化数组

# 创建
data = np.zeros(4, dtype={'names':('name', 'age', 'weight'), 
                          'formats':('U10', 'i4', 'f8')})
# 导入
data['name'] = ['Alice', 'Bob', 'Cathy', 'Doug']
# 获取一个实例
data[0]

Pandas

import pandas as pd

from pandas import Series, DataFrame

Series

特殊的字典，具有数据对齐特性，可切片
属性：

values：值

index：索引

name, index.name：名字

# 生成序列（字典）
obj = pd.Series([4, 7, -5, 3])
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj3 = pd.Series(sdata, index=states) #sdata是字典，index缺省时按键的字典序排序
# 索引访问
obj2[['c', 'a', 'd']]
# 运算
obj2 * 2
np.exp(obj2)
# 检测索引是否存在
'b' in obj2
# 值是否有效
obj4.isnull()

DataFrame

指定行列的二维索引

属性：

index,columns：行，列索引标签

values：返回二维ndarray

index.name, columns.name：名字

# 利用包含等长度Numpy数组列表或字典，可指定列或索引顺序
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002, 2003],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
data = {'Nevada': {2001: 2.4, 2002: 2.9},
       'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
frame = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
                      index=['one', 'two', 'three', 'four','five', 'six'])
# 访问，返回Series,视图
frame['state'];frame.values[1] # 按列
frame.loc['three'] # 按行
# 赋值
frame2['debt'] = 16.5
frame2['debt'] = np.arange(6.)
val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2['debt'] = val
# 删除列
del frame2['eastern']
# 转置
frame3.T

Index

索引对象

无法修改
可重复

方法	描述
append	将额外的索引对象粘贴到原索引，生成一个新索引
difference	差集
intersection,\|	并集
union,&	交集
isin	表示每一个值是否在传值容器中的布尔数组
delete	按索引位删除
drop	按索引值删除
insert	按位插入
is_monotonic	是否递增
is_unique	是否唯一
unique	返回唯一值序列

基本功能

索引器

避免整数索引显式隐式的混乱

loc：显式
iloc：隐式

重建索引

obj = obj.reindex(['a', 'b', 'c', 'd', 'e'])
frame.loc[['a', 'b', 'c', 'd'], states]

reindex参数

参数	描述
index	新建作为索引的序列
method	插值方式；ffill向前填充，bfill向后填充
fill_value	缺失数据时的替代值
limit	填充时，所需填充的最大尺寸间隙（以元素数量）
tolerance	填充时，所需填充的不精确匹配下的最大尺寸间隙（以绝对数字距离）
level	匹配MultiIndex级别的简单索引
copy	True，索引相同时总是复制数据

轴向删除

# series
obj.drop(['d', 'c'])
# dataframe,列删除
data.drop('two', axis=1)
#inplace=True删除原对象中值，真删除
obj.drop('c', inplace=True)

切片

# series
obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
# a 0.0
# b 1.0
# c 2.0
# d 3.0
obj['b'] # 即obj[1]
obj[2:4] # 即obj['b':'d']包括尾部
obj[obj < 2]
# dataframe
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
# 先选择行，再选择列
# loc轴标签
data.loc['Colorado', ['two', 'three']]
data.loc[:'Utah', 'two']
# iloc整数标签
data.iloc[2]
data.iloc[[1, 2], [3, 0, 1]]

缺失值

isnull() ：创建一个布尔类型的掩码标签缺失值。

notnull() ：与 isnull() 操作相反。

dropna() ：返回一个剔除缺失值的数据。可以axis选择行列

fillna() ：返回一个填充了缺失值的数据副本。

合并

pd.concat([ser1, ser2])
pd.concat([df3, df4], axis='col')

参数	说明
verify_integrity=True	捕捉错误
ignore_index=True	创建新的整数索引
keys=[‘x’, ‘y’]	多级索引

Mataplotlib

Machine Learning

借助数学模型理解数据

有监督学习（supervised learning）：对数据的若干特征与若干标签（类型）之间的关联性进行建模的过程
- 分类（classifification）
- 回归（regression）
无监督学习（unsupervised learning）：对不带任何标签的数据特征进行建模，通常被看成是一种“让数据自己介绍自己”的过程
- 聚类（clustering）
- 降维（dimensionality reduction）
半监督学习（semi-supervised learning）：在数据标签不完整时使用

Scikit-Learn

特征矩阵：通常被简记为变量 X。它是维度为 [n_samples, n_features] 的二维矩阵

样本（即每一行）通常是指数据集中的每个对象

特征（即每一列）通常是指每个样本都具有的某种量化观测值

目标数组：通常简记为 y，一般是一维数组，其长度就是样本总数 n_samples

Scikit-Learn 评估器 API 的常用步骤如下所示（后面介绍的示例都是按照这些步骤进行的）。

通过从 Scikit-Learn 中导入适当的评估器类，选择模型类。
用合适的数值对模型类进行实例化，配置模型超参数（hyperparameter）。
整理数据，通过前面介绍的方法获取特征矩阵和目标数组。
调用模型实例的 fit() 方法对数据进行拟合。
对新数据应用模型：

在有监督学习模型中，通常使用 predict() 方法预测新数据的标签；
在无监督学习模型中，通常使用 transform() 或 predict() 方法转换或推断数据的性质。

# 简单线性回归（有监督回归）
# 生成随机数
rng = np.random.RandomState(42) 
x = 10 * rng.rand(50) 
y = 2 * x - 1 + rng.randn(50) 
# 导入模型类
model = LinearRegression(fit_intercept=True)
# 规范维度
X = x[:, np.newaxis]
# 拟合
model.fit(X, y)
# 输出拟合得到的参数
print(model.coef_)
print(model.intercept_)
# 预测
xfit = np.linspace(-1, 11)
Xfit = xfit[:, np.newaxis] 
yfit = model.predict(Xfit)

plt.scatter(x, y)
plt.plot(xfit, yfit)

# 高斯贝叶斯分类（有监督分类)
iris = sns.load_dataset('iris') 

sns.pairplot(iris, hue='species', size=1.5)

X_iris = iris.drop('species', axis=1)
y_iris = iris['species']
Xtrain, Xtest, ytrain, ytest = train_test_split(X_iris, y_iris, random_state=1)

model = GaussianNB() # 2.初始化模型
model.fit(Xtrain, ytrain) # 3.用模型拟合数据
y_model = model.predict(Xtest) # 4.对新数据进行预测

print(accuracy_score(ytest, y_model))

模型验证

在选择模型和超参数之后，通过对训练数据进行学习，对比模型对已知数据的预测值与实际值的差异

留出集

 from sklearn.model_selection import train_test_split
 # 每个数据集分一半数据
 X1, X2, y1, y2 = train_test_split(X, y, random_state=0, 
 train_size=0.5) 
 # 用模型拟合训练数据
 model.fit(X1, y1) 
 # 在测试集中评估模型准确率
 y2_model = model.predict(X2) 
 accuracy_score(y2, y2_model)

交叉检验

from sklearn.model_selection import cross_val_score 
cross_val_score(model, X, y, cv=5)

最优模型

欠拟合：模型灵活性低，偏差高，模型在验证集的表现与在训练集的表现类似
过拟合：模型灵活性高，方差高，模型在验证集的表现远远不如在训练集的表现

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-UIeMHS1k-1590482866529)(C:\Users\Administrator\Desktop\md_note\Data_Science\验证曲线示意图.png)]

特征工程

找到与问题有关的任何信息，把它们转换成特征矩阵的数值

分类特征

非数值数据类型分类数据——独热编码

# data = [ 
# {'price': 850000, 'rooms': 4, 'neighborhood': 'Queen Anne'}, 
# {'price': 700000, 'rooms': 3, 'neighborhood': 'Fremont'}, 
# {'price': 650000, 'rooms': 3, 'neighborhood': 'Wallingford'}, 
# {'price': 600000, 'rooms': 2, 'neighborhood': 'Fremont'} 
# ]
from sklearn.feature_extraction import DictVectorizer
# sparse=True为稀疏矩阵
vec = DictVectorizer(sparse=False, dtype=int) 
vec.fit_transform(data)
# array([[ 0, 1, 0, 850000, 4], 
#        [ 1, 0, 0, 700000, 3], 
#        [ 0, 0, 1, 650000, 3], 
#        [ 1, 0, 0, 600000, 2]], dtype=int64)
# 查看特征名称
vec.get_feature_names()
# ['neighborhood=Fremont', 
# 'neighborhood=Queen Anne', 
# 'neighborhood=Wallingford', 
# 'price', 
# 'rooms']

朴素贝叶斯分类

需要确定一个具有某些特征的样本属于某类标签的概率，通常记为 P (L | 特征 )

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-fnvTX6RP-1590482866532)(C:\Users\sofice\Desktop\md_note\Data_Science\贝叶斯公式1.png)]

假如需要确定两种标签，定义为L1 和 L2，一种方法就是计算这两个标签的后验概率的比值：

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-fDU23fMt-1590482866533)(C:\Users\sofice\Desktop\md_note\Data_Science\贝叶斯公式2.png)]

from sklearn.naive_bayes import GaussianNB        
model = GaussianNB()      
model.fit(X, y)

Xnew = [-6, -14] + [14, 18] * rng.rand(2000, 2)        
ynew = model.predict(Xnew)
# 计算样本属于某个标签的概率
yprob = model.predict_proba(Xnew)

多项式朴素贝叶斯

假设特征是由一个简单多项式分布生成的，多项分布可以描述各种类型样本出现次数的概率，因此多项式朴素贝叶斯非常适合用于描述出现次数或者出现次数比例的特征

# 新闻文本分类
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.naive_bayes import MultinomialNB 
from sklearn.pipeline import make_pipeline
from sklearn.metrics import confusion_matrix 

categories = ['talk.religion.misc', 'soc.religion.christian', 'sci.space', 
 'comp.graphics'] 
train = fetch_20newsgroups(subset='train', categories=categories) 
test = fetch_20newsgroups(subset='test', categories=categories)

model = make_pipeline(TfidfVectorizer(), MultinomialNB())
model.fit(train.data, train.target) 
labels = model.predict(test.data)
    
mat = confusion_matrix(test.target, labels) 
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False, 
 xticklabels=train.target_names, yticklabels=train.target_names) 
plt.xlabel('true label') 
plt.ylabel('predicted label')

优点：

训练和预测的速度非常快
直接使用概率预测
通常很容易解释
可调参数（如果有的话）非常少

线性回归

from sklearn.linear_model import LinearRegression

rng = np.random.RandomState(1) 
x = 10 * rng.rand(50) 
y = 2 * x - 5 + rng.randn(50) 
plt.scatter(x, y)
 
model = LinearRegression(fit_intercept=True) 
model.fit(x[:, np.newaxis], y) 
xfit = np.linspace(0, 10, 1000) 
yfit = model.predict(xfit[:, np.newaxis]) 
plt.scatter(x, y) 
plt.plot(xfit, yfit);
#  斜率：model.coef_[0]，截距：model.intercept_

基函数回归

通过基函数对原始数据进行变换，从而将变量间的线性回归模型转换为非线性回归模型

from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline 
# 7次多项式回归模型
poly_model = make_pipeline(PolynomialFeatures(7), LinearRegression())

rng = np.random.RandomState(1) 
x = 10 * rng.rand(50) 
y = np.sin(x) + 0.1 * rng.randn(50) 
poly_model.fit(x[:, np.newaxis], y) 
yfit = poly_model.predict(xfit[:, np.newaxis]) 
plt.scatter(x, y) 
plt.plot(xfit, yfit);

支持向量机

不再画一条细线来区分类型，而是画一条到最近点边界、有宽度的线条。支持向量机其实就是一个边界最大化评估器。

 from sklearn.svm import SVC # "Support vector classifier" 
 model = SVC(kernel='linear', C=1E10) 
 model.fit(X, y)

Sofice小司

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫