PYTHON文档

angnuan123

已于 2024-02-04 00:46:37 修改

阅读量261

点赞数 2

分类专栏： python 文章标签： python

于 2021-08-11 12:53:21 首次发布

本文链接：https://blog.csdn.net/angnuan123/article/details/119603127

版权

python 专栏收录该内容

3 篇文章

订阅专栏

PYTHON 文档

Jupyter Notebook中解决一个代码框一次性输出多个结果

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

pandas

excel sheet 分布

xl = pandas.ExcelFile(你的Excel文件路径)
sheet_names = xl.sheet_names  # 所有的sheet名称
df = xl.parse(sheet_name)  # 读取Excel中sheet_name的数据

查看pandas 分布

#返回count,mean,std,min,25%,50%,75%,max
df.describe()
# 获得行索引信息
df.index
# 获得列索引信息
df.columns
# 获得df的size
df.shape
# 获得df中的值
df.values

sort

按照索引排序

df.sort_index(
	ascending=True, # ascending=True --升序排序(默认)；  
	inplace=True,	# 如果为True,则直接对原df进行操作； 
	axis=0,			# axis=0	--根据行索引排序(默认)；  axis=1	--根据列索引排序
	key=None		# (默认)None，否则在排序之前对索引值应用键函数。
)
df.sort_index(key=lambda x: x.str.lower())

按照某一列排序

##单列
df.sort_values(
	by='age',
	ascending=True, 
	inplace=True
)
##多列
df.sort_values(
	by=['age', 'group'],
	ascending=[True, False], 
	inplace=True
)

改变列的属性

df['colname']=df['colname'].astype('int')
df['col2'] = df['col2'].astype('float64')
print （df.dtypes）

删除

#删除某列：
del df[0]

drop

#删除列：
df.drop(colname,axis = 1)
df.drop(colnameList,axis = 1)
df.drop(df.columns[1,2],axis=1,inplace=True) #删除多列需给定列表

#删除某行:(不指定axis 默认删除的是行)
df.drop(0)   #若0不在index序列中，则报错
df.drop([1,2,3])   #若0不在index序列中，则报错

处理缺失值 NAN

找到缺失值的位置

#返回 一个mask，形状同df，True代表df该位置的值为空
df.isnull().values
# 某一行有n列为空，则返回n次该行
df[df.isnull().values]
# colA为空的行
df[df[colA].isnull().values]
# 判断哪些”列”存在缺失值,有缺失值的是ture
df.isnull().any()

df.dropna(axis)

df.dropna(axis = 0)  将删除包含 NaN 值的任何行
df.dropna(axis = 1)  将删除包含 NaN 值的任何列
df.dropna(how  = 'all')  将删除所有值均为 NaN 值的行
df.dropna(subset  = ['C'])  将删除C列有 NaN 值的行

fillna(target_value)

store_items.fillna(0) #将所有 NaN 值替换为 0
store_items.fillna(method = 'ffill', axis = 1)  #沿着行使用上个已知值替换 NaN 值
store_items.fillna(method = 'backfill', axis = 0)  # 向后填充列，即为NaN的列值，用其列中的后一个来填充
store_items.interpolate(method = 'linear', axis = 0) #通过 linear 插值使用沿着给定 axis 的值替换 NaN 值,

用sklearn 中的函数填补缺失值

from sklearn.preprocessing import Imputer
imr  = Imputer(missing_values = 'NaN',strategy = 'mean',axis = 0)
imr = imr.fit(df.values)
transformed_data  = imr.transform(df.values)

处理重复项

找到重复

df.duplicated()#默认所有列，无重复记录
df.duplicated('col1')#按照col1排重
df.duplicated(['col1','col2'])#按照col1，col2都重复排重

#keep='first' 保留第一个出现的，第一个出现的不标记，默认为first
#keep='last' 保留最后一个出现的，最后一个出现的不标记
#keep='False' 不保留，重复的所有全删
df.duplicated('col1','last')#第一、三、四行被标记重复

#根据索引标记
df.index.duplicated(keep='last')#第一、二、三、四被标记为重复
df[df.index.duplicated()]#获取重复记录行
df[~df.index.duplicated('last')]#获取不重复记录行

删除重复

df.drop_duplicates()
df.drop_duplicates('col1')#删除了df.duplicated('col1')标记的重复记录
df.drop_duplicates('col1',keep='last',inplace=True)#inplace=True表示在原DataFrame上执行删除操作
df = df.loc[~df.index.duplicated(keep='last'),:] #删除index重复

存多个sheet

方法一

writer = pd.ExcelWriter(r'd:test.xlsx')
df1.to_excel(writer,sheet_name="df1")
df2.to_excel(writer,sheet_name="df2")
writer.save()
# 或者writer.close()

方法二

with pd.ExcelWriter(r"d:test.xlsx") as xlsx:
	df1.to_excel(xlsx,sheet_name="df1")
 	df2.to_excel(xlsx,sheet_name="df2")

重新命名

#使用映射重命名列：
df.rename(columns={"A": "a", "B": "c"})
#使用映射重命名索引：
df.rename(index={0: "x", 1: "y", 2: "z"})
   A  B
x  1  4
y  2  5
z  3  6

#使用轴样式参数
df.rename(str.lower, axis='columns') #axis='index'
   a  b
0  1  4
1  2  5
2  3  6

筛选符合条件的办法

#在某一列表中
all_data=test[test['item_sku_id'].isin(LISTA)]   
#多个条件
all_data[(all_data['User_id'] == 1439408) & (all_data['Date'].isna())]

DataFrame的列某些值进行更改

把color栏位按照：color_tmp红色赋值1，黄色赋值2，绿色赋值3进行赋值：

df['color']  = 1
df.iloc[df[df['color_tmp']=='黄色'].index.tolist(),'color']  = 2
df.iloc[df[df['color_tmp']=='绿色'].index.tolist(),'color']  = 3
##也可一写成如下形式：
df['color'] = df['color_tmp'].apply(lambda x: 1 if x=='红色' else x).apply(lambda x: 2 if x=='黄色' else 3 )

map 的方法

函数形式

user_requried = all_data['User_id'].map(lambda x : x==1439408)
date_requried = all_data['Date'].map(lambda x : np.isnan(x))
some = all_data[user_requried & date_requried]
print(some)

字典形式

size_dic = {'M':1,'L':2,'XL':3}
df['size'] = df['size'].map(size_dic) #将size列中key 替换成value
size_dic_inv = {v:k for k,v in size_dic.items()} #逆字典
df['size'] = df['size'].map(size_dic_inv)  #将size列替换回去

对 pandas 随机抽样

DataFrame.sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None)

'''
n是要抽取的行数。（例如n=20000时，抽取其中的2W行）
frac是抽取的比列。（有一些时候，我们并对具体抽取的行数不关系，我们想抽取其中的百分比，这个时候就可以选择使用frac，例如frac=0.8，就是抽取其中80%）
replace：是否为有放回抽样，取replace=True时为有放回抽样。
axis是选择抽取数据的行还是列。axis=0的时是抽取行，axis=1时是抽取列（也就是说axis=1时，在列中随机抽取n列，在axis=0时，在行中随机抽取n行）
'''
#example
df.sample(n=20000)

pandas 时间类型

pd.to_datetime(data['ColA'],format="%Y-%m-%d")
把字符型变成时间类型

data.index=pd.to_datetime(data['ColA'],format="%Y-%m-%d")
data['2016']  #提取2016年的数据
data['2016-07':'2016-09'] #提取2016年7-9（含）月的数据

pandas 拼接

pandas的拼接分为两种：
级联：pd.concat, pd.append
合并：pd.merge, pd.join

concat

具体见：https://www.cnblogs.com/bilx/p/11535559.html

pd.concat([df1,df2]) #默认行拼接
pd.concat([df1,df2],axis = 1) #列拼接

#行拼接指定来源表：
result = pd.concat(dfs, keys=['table1', 'table2', 'table3'])

#取行索引交集拼接（列拼接）
pd.concat([df1,df2],join = 'inner',axis = 1)

#join_axex以某个DataFrame的列索引为新的列索引值
pd.concat([df1,df2],join_axes=[df2.columns])

merge

参数介绍：
how：连接方式，有inner、left、right、outer，默认为inner；
on：指的是用于连接的列索引名称，必须存在于左右两个DataFrame中，如果没有指定且其他参数也没有指定，则以两个DataFrame列名交集作为连接键；
left_on：左侧DataFrame中用于连接键的列名，这个参数左右列名不同但代表的含义相同时非常的有 用；
right_on：右侧DataFrame中用于连接键的列名；
left_index：使用左侧DataFrame中的行索引作为连接键；
right_index：使用右侧DataFrame中的行索引作为连接键；
sort：默认为True，将合并的数据进行排序，设置为False可以提高性能；
suffixes：字符串值组成的元组，用于指定当左右DataFrame存在相同列名时在列名后面附加的后缀名称，默认为('_x', '_y')；
copy：默认为True，总是将数据复制到数据结构中，设置为False可以提高性能；
indicator：显示合并数据中数据的来源情况。

#如果不指定按照哪一列合并，则按照所有共有列进行合并
#为了保险起见，一般来说会指定合并列
pd.merge(df3,df4,on='employee')
pd.merge(df3,df4,on='group',suffixes=['_A','_B']) #其余共有列编码形式
pd.merge(df3,df4,left_on='Team',right_on='group') #左表Team连接右表group

df1.merge(df2,how = 'outer')  #外连接
df1.merge(df2,how = ''right')  #保留右侧
df1.merge(df2,how = ''inner") #交集

！在进行数据匹配和拼接的过程中经常会遇到NaN值。这种情况下merge会将两个数据表中的NaN值进行交叉匹配拼接，换句话说就是将A表列中的NaN值分别与B表中列中的每一个NaN值进行匹配，然后再拼接在一张表中。

one-hot 编码

pd.get_dummies(df[cols])
# get_dummies() 仅仅作用于字符型，对于其他类型将保持不变
#返回后列名为 原列名_取值

cut

pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates='raise') #0.23.

x：array-like 输入的连续值数组，必须是一维的
bins :分组依据,可以是 int, sequence of scalars, or pandas.IntervalIndex

int：
整数，代表将x平分成bins份。x的范围在每侧扩展0.1%，以包括x的最大值和最小值。
sequence of scalars：
标量序列，标量序列定义了被分割后每一个bin的区间边缘，此时x没有扩展。
IntervalIndex：定义要使用的精确区间。

right : bool, default True
是否包含最右边的值。如果bins是[1, 2, 3, 4]，区间就是(1,2], (2,3], (3,4]。如果为False，不包含右边，区间就是(1,2), (2,3), (3,4)

labels : array or bool, optional
每组的标签，长度必须和组的长度一致。如果分组是(1,2), (2,3), (3,4)，则标签的长度必须为3，表示每组的别名。如果为False，则只返回垃圾箱(bins)，不返回out。

retbins : bool, default False。（return bins缩写）
是否返回垃圾桶(bins)，默认不返回。如果为True，cut将有两个返回值，第二个返回值类似 array([ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100])

precision : int, 小数精度，默认为3
include_lowest : bool, default False
第一个桶的初始值是否包含在内。np.arange(0, 101, 10) 默认不包含0，第一个桶为(0, 10]。如果设置为True，则包含0，第一个桶就是(-0.001, 10.0]

duplicates : {default ‘raise’, ‘drop’}, optional
如果容器边缘不是唯一的，则引发ValueError或丢弃非唯一变量

qcut

pd.qcut(x, q, labels, duplicates)

q：分位数的数量；
q=10代表切割为10个分位数区间，即百分位数，在10%、20%...位置切割
q=4代表按四分位数进行切割，即在25%、50%、75%处切割

df['colA_cut'] = pd.qcut(df['colA'],q=4,label = ['(0~25%]','(25%~50]'.'(50%~75%]','(75%~100%]'])

group by

https://www.cnblogs.com/chendongblog/p/10848270.html

group by 后看分布

quants = np.arange(.1,1,.1)
pd.concat([df.groupby('state')['sales'].quantile(x) for x in quants],axis=1,keys=[str(x) for x in quants])

按照A列group by

grp1=df1.groupby('A')
for name,group in grp1:
	print(name) #A列取值的枚举
	print(group) #dataframe
print(grp1.get_group('001')) #001是A列其中一个值。

按照多列group by

grp1=df1.groupby(['A','B'])

group by 拉平index
原本的index为每一组，参数 as_index = False 后，就拉平index成为数字。

grp1=df1.groupby('A',as_index = False).agg(sum)
等价于：
grp1=df1.groupby('A').agg(sum).reset_index()

group by 后函数运算

print('#计算指定列的均值 三种等价写法')
print(grp1.A.mean())
print(grp1.A.agg('mean')) #这里函数mean要加引号,如果是自定义函数不需要
print(grp1.agg({A:'mean'})) #这里指定了列名 返回DataFrame而非Series

print('#计算指定列的指定聚合方法')
def my_func(x):
    return max(x)-min(x)
print(grp1.A.agg(my_func))
print(grp1.A.agg(lambda x:max(x)-min(x))) #用匿名函数

print('#对指定列进行计算处理')
print(grp1.A.apply(lambda x:x+100))

##指定列进行聚合，对不同的列作用不同的函数：
df.groupby('A').agg({'A':[fun1,fun2]}) #仅对A列聚合
df.groupby('A').agg([fun1,fun2]) #所有列都返回fun1,fun2
df.groupby('A').agg({'A':[fun1,fun2],'B':[fun3]}) #对A列fun1,fun2,B列fun3

groupby 常用的函数

mean
sum
count

pivot_table

df_FM = (dsw.pivot_table(index=['X1','X2'],
                          columns='X3',
                          values='Y',          aggfunc='sum').reset_index().rename_axis(None, axis=1))

可以类似数据透视表作用,列为columns，行为index，值为values。

保存数据和模型

保存numpy

a = np.zeros([3,1])
np.save("a.npy",a)
a_load = np.load("a.npy")

保存模型

from sklearn.externals import joblib
joblib.dump(lr_model,"model.m",)
lr_model_load = joblib.load("model.m")

sklearn

plot_tree

https://scikit-learn.org/stable/modules/generated/sklearn.tree.plot_tree.html

import matplotlib.pyplot as plt
fig = plt.figure(figsize = (80,80))
pic = tree.plot_tree(tree_clf             #训练好的决策树评估器
               ,node_ids=True  #显示节点id
               ,filled=True    #给节点填充颜色
               ,rounded=True   #节点方框变成圆角
               ,fontsize=12    #节点中文本的字体大小
              )
plt.savefig(FigOutputPATH)

计算class_weight

在不平衡问题中看平衡比例

https://scikit-learn.org/stable/modules/generated/sklearn.utils.class_weight.compute_class_weight.html

import numpy as np
from sklearn.utils.class_weight import compute_class_weight
y = [1, 1, 1, 1, 0, 0]
compute_class_weight(class_weight="balanced", classes=np.unique(y), y=y)
#array([1.5 , 0.75])

选变量-forward backward

https://scikit-learn.org/1.0/modules/generated/sklearn.feature_selection.SequentialFeatureSelector.html#sklearn.feature_selection.SequentialFeatureSelector

指定变量个数n，每次挑选最适合的n个变量，forward 为0～多的顺序依次递增；backward 为n到0 逐步衰减；
不需训练模型，但需要知道模型的类型；
无监督学习不会用到Y的信息（即使也需要fit.(X,y)）

n_features_to_select:The number of features to select. 
direction{‘forward’, ‘backward’}, default=’forward’
==========

>>> from sklearn.feature_selection import SequentialFeatureSelector
>>> from sklearn.neighbors import KNeighborsClassifier
>>> from sklearn.datasets import load_iris
>>> X, y = load_iris(return_X_y=True)
>>> knn = KNeighborsClassifier(n_neighbors=3)
>>> sfs = SequentialFeatureSelector(knn, n_features_to_select=3)
>>> sfs.fit(X, y)
SequentialFeatureSelector(estimator=KNeighborsClassifier(n_neighbors=3),
                          n_features_to_select=3)
>>> sfs.get_support()
array([ True, False,  True,  True])
>>> sfs.transform(X).shape
(150, 3)
============================
>>>tic_bwd = time()
>>>sfs_backward = SequentialFeatureSelector(
    ridge, n_features_to_select=2, direction="backward"
).fit(X, y)
>>>toc_bwd = time()
>>>print(
    "Features selected by forward sequential selection: "
    f"{feature_names[sfs_forward.get_support()]}"
)
print(f"Done in {toc_fwd - tic_fwd:.3f}s")

Features selected by backward sequential selection: ['bmi' 's5']
Done in 0.351s

选变量-方差膨胀系数VIF（多重共线性检验）

VIF 是多个解释变量辅助回归的可决系数，举个例子：

假如现在的因变量为y，自变量有A、B和C，假设A和B和C之间存在共线性，我们想把他们找出来，就可以使用VIF来测算。具体的做法是：单独把A和B和C拎出来，做回归。

把A作为因变量，B和C作为自变量，做一次回归，可以算得该回归方程的 (R^2)，进而得到变量A的VIF ;把B作为因变量，A和C作为自变量，做一次回归，可以算得该回归方程的 (R^2)，进而得到变量B的VIF；变量C同理，可以得到变量C的VIF.
一般以10/100 作为弱/强多重共线性分界

variance_inflation_factor(df.values,i). ##需要计算value第i列相对其他列的VIF

from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm

def calculate_vif(df):
    vif = pd.DataFrame()
    vif['index'] = df.columns
    vif['VIF'] = [variance_inflation_factor(df.values,i) for i in range(df.shape[1])]
    return vif

# 使用一个while循环逐步剔除变量

## 先计算每个变量的vif值，再重复计算
vif = calculate_vif(df.iloc[:,:-1])
while (vif['VIF'] > 10).any():
    remove = vif.sort_values(by='VIF',ascending=False)['index'][:1].values[0]
    df.drop(remove,axis=1,inplace=True)
    vif = calculate_vif(df)

vif

模型效果评价(召回、贡献度)

import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
#准确率
train_acc = accuracy_score(y_train, pred_train)
test_acc = accuracy_score(y_test, pred_test)
print ("训练集准确率: {0:.2f}, 测试集准确率: {1:.2f}".format(train_acc, test_acc))
 
#其他模型评估指标
precision, recall, F1, _ = precision_recall_fscore_support(y_test, pred_test, average="binary")
print ("精准率: {0:.2f}. 召回率: {1:.2f}, F1分数: {2:.2f}".format(precision, recall, F1))
 
#特征重要度
features = list(X_test.columns)
importances = dtree.feature_importances_
indices = np.argsort(importances)[::-1]
num_features = len(importances)
 
#将特征重要度以柱状图展示
plt.figure()
plt.title("Feature importances")
plt.bar(range(num_features), importances[indices], color="g", align="center")
plt.xticks(range(num_features), [features[i] for i in indices], rotation='45')
plt.xlim([-1, num_features])
plt.show()

分类

函数功能
metrics.accuracy_score 准确率
metrics.precision_score 精确率
metrics.recall_score 召回率
metrics.f1_score F1 score

metrics.balanced_accuracy_score 在类别不均衡的数据集中，计算加权准确率
metrics.top_k_accuracy_score 获得可能性最高的k个类别
metrics.average_precision_score 根据预测分数计算平均精度 (AP)
metrics.brier_score_loss Brier 分数损失
metrics.log_loss 交叉熵损失
metrics.jaccard_score Jaccard 相似系数得分
metrics.roc_auc_score 根据预测分数计算 Area Under the Receiver Operating Characteristic Curve(ROC AUC) 下的面积
metrics.cohen_kappa_score 衡量注释间一致性的统计量

回归
函数功能
metrics.explained_variance_score 解释方差回归评分函数
metrics.mean_absolute_error 平均绝对误差
metrics.mean_squared_error 均方误差
metrics.mean_squared_log_error 平均平方对数误差
metrics.median_absolute_error 中位数绝对误差
metrics.r2_score R^2(确定系数)
（R-squared在统计学中又叫决定系数，用于度量因变量的变异中可由自变量解释部分所占的比例。在多元回归模型中，决定系数的取值范围在[0,1]之间，取值越接近1，说明回归模型的拟合程度越好，模型的解释能力越强。Adjust R-squared表示调整后的决定系数，是对决定系数的一个修正。）

建立回归模型后，我们首要关心的就是获得的模型是否成立，那么就要进行模型的显著性检验。模型的显著性检验主要是F检验。在一些库的回归分析输出结果中，会输出F-statistic值（F检验的统计量）和Prob(F-statistic)（F检验的P值）。
如果 P r o b < 0.05 Prob<0.05 Prob<0.05，说明在置信度为95%时，可以认为回归模型是成立的；若 P r o b > 0.1 Prob>0.1 Prob>0.1，则说明回归模型整体上没有通过显著性检验，模型不显著，需要进一步调整。

sklearn.metrics.mean_squared_error(y_true, y_pred, *, sample_weight=None, multioutput='uniform_average', squared='deprecated')

WOE编码

一般情况下，我们在处理数据时会使用One-Hot编码将分类变量转化为二进制的稀疏矩阵。但是，这会导致数据变得非常高维且稀疏，这对于模型来说是有挑战的，就像在编程中使用不合适的数据结构一样，会导致性能问题。
此时，通常的做法是将这些高维稀疏特征进行嵌入（Embedding），但嵌入后的特征在可解释性方面会受到一定的影响，就像优化性能会牺牲可读性一样。
这就是为什么我们使用WOE编码的原因。WOE编码不仅可以将分类特征转化为数值特征，还能保持可解释性。它通过计算不同类别的事件发生率与非事件发生率之比，来编码每个类别。这样，WOE编码提供了一种更具信息量和可解释性的方式来表示分类特征。
所以，就像在编程中选择最适合任务的数据结构一样，WOE编码在评分卡建模中提供了更好的特征表示，既考虑了模型性能，又保留了可解释性。这就是为什么我们在评分卡建模时常常使用WOE编码的原因。

import numpy as np
import pandas as pd
import copy
def calculate_woe_iv(dataset):
    """
    对分箱后的特征计算WOE和IV
    :param dataset:DataFrame，计算数据,需要在特征分箱后的数据
    :return:
        iv: float，iv值
        df:DataFrame，woe和IV计算后结果

    Example
    -----------------------------------------------------------------
    >>> import random
    >>> data = pd.DataFrame([[random.random(),random.randint(0,1)] for _ in range(500)],columns=['feature','label'])
    >>> df = cut_width(dataset=data,inputcol='feature',labelcol='label',bins=10)
    >>> df.rename(columns={0:'neg',1:'pos'},inpalce=True)
    >>> iv, woe_iv_df = calculate_woe_iv(dataset=df)
    >>> iv
    0.037619588549634465
    >>> woe_iv_df
    label               neg  pos  pos_rate  neg_rate       woe        iv
    feature
    (-0.000313, 0.103]   23   27  0.104869  0.103004  0.017940  0.000033
    (0.103, 0.206]       23   27  0.104869  0.103004  0.017940  0.000033
    (0.206, 0.312]       29   21  0.082397  0.128755 -0.446365  0.020693
    (0.312, 0.418]       22   28  0.108614  0.098712  0.095591  0.000947
    (0.418, 0.535]       19   31  0.119850  0.085837  0.333793  0.011353
    (0.535, 0.614]       22   28  0.108614  0.098712  0.095591  0.000947
    (0.614, 0.705]       24   26  0.101124  0.107296 -0.059249  0.000366
    (0.705, 0.8]         24   26  0.101124  0.107296 -0.059249  0.000366
    (0.8, 0.891]         22   28  0.108614  0.098712  0.095591  0.000947
    (0.891, 0.991]       25   25  0.097378  0.111588 -0.136210  0.001936
    """
    df = copy.copy(dataset)
    df['pos_rate'] = (df['pos'] + 1) / df['pos'].sum()  # 计算每个分组内的响应（Y=1）占比，加1为了防止在计算woe时分子分母为0
    df['neg_rate'] = (df['neg'] + 1) / df['neg'].sum()  # 计算每个分组内的未响应（Y=0）占比
    df['woe'] = np.log(df['pos_rate'] / df['neg_rate'])  # 计算每个分组的WOE
    df['iv'] = (df['pos_rate'] - df['neg_rate']) * df['woe']  # 计算每个分组的IV
    iv = df['iv'].sum()
    return iv, df
    
def cut_width(dataset, inputcol, labelcol='label', bins=10):
    """
    等宽分箱
    :param dataset: DataFrame，计算数据
    :param inputcol: String,待分箱列列名
    :param labelcol: String,目标列列名
    :param bins: int,正整数，分箱数
    :return:
    :return:
        df: DataFrame，分箱后结果

    Example
    -----------------------------------------------------------------
    >>> import random
    >>> data = pd.DataFrame([[random.random(),random.randint(0,1)] for _ in range(500)],columns=['feature','label'])
    >>> df = cut_width(data,inputcol='feature',labelcol='label',bins=10)
    >>> df
        label                             good  bad
    feature
    (-0.0009308000000000001, 0.0968]    23   27
    (0.0968, 0.188]                     27   23
    (0.188, 0.29]                       25   25
    (0.29, 0.385]                       32   18
    (0.385, 0.472]                      31   19
    (0.472, 0.567]                      24   26
    (0.567, 0.686]                      24   26
    (0.686, 0.778]                      24   26
    (0.778, 0.912]                      26   24
    (0.912, 0.999]                      29   21
    """
    df = copy.copy(dataset)
    df[inputcol] = pd.qcut(x=df[inputcol], q=bins)
    df = pd.crosstab(index=df[inputcol], columns=df[labelcol], margins=False)
    return df
iv, woe = calculate_woe_iv(dataset)
woe = woe[['feature','woe']]
woe = woe.set_index ('feature')
df['ID'].replace(woe.to_dict()['woe'], inplace=True)

容器类型

set

建立set

a = set()
a = set([1,2,"aa"])

添加元素

a.add('aa')

集合运算

#交集
set(li_1)&set(li_2)

#并集
set(li_1)|set(li_3)

#差集
set(li_2)-set(li_3)

#判断li_3 是否为li_1的真子集/子集
set(li_1)>set(li_3)
set(li_1)>=set(li_3)

Save

#save
dict = {'a':1,'b':2,'c':3}
np.save('my_file.npy', dict) # 注意带上后缀名
#Load
load_dict = np.load('my_file.npy').item()
print(load_dict['a'])

Counter

#筛选某一类中最多的元素：
from collections import Counter
word_counts = Counter(words)
print(word_counts.most_common(3))

#排序
C2=sorted(C1.items(),key=lambda x:x[1],reverse=True) //按照值从大到小排序

List

a = ['two', 3, 'four', 'five', 6]
del a[2:4]        #删除a从下标为2到4的元素，含头不含尾
>>> a
['two', 3, 6]

铺平：

Lis = sum(Lis,[])
#效果：[[a],[b,v],[x,c]] ----->[a,b,v,x,c]
#[[a],[b,v],[[x,c]]] ----->[a,b,v,[x,c]]
#一次去一层[]，如果列表里有非列表元素会报错

str 字符串

split 分割字符串
strA.split("sep",num = n)
sep:分隔符，默认为所有的空字符，包括空格、换行(\n)、制表符(\t)等。
num:分割次数。

str = "Line1-abcdef \nLine2-abc \nLine4-abcd";
print (str.split( )) #['Line1-abcdef', 'Line2-abc', 'Line4-abcd']
print (str.split( )[1]) #'Line2-abc'
print (str.split('\n',1))#['Line1-abcdef', '\nLine2-abc \nLine4-abcd']

replace
str.replace(oldstr, newstr,num=1)
str：源字符串；
oldstr：需要被替换的字符串；
newstr：用来替换的新字符串；
num：可选参数，指定替换次数，替换次数不超过该指定数字
字符串改成时间类型

timeArray = time.strptime(str,format)
其中format要根据str字符串的格式写；

a = "2013-10-10 23:40:00"
timeArray = time.strptime(a, "%Y-%m-%d %H:%M:%S")
#就将字符串转化成为了时间类型

strArray = time.strftime(format, timeArray)
其中format是你转化为字符串的目标格式；

otherStyleTime = time.strftime("%Y/%m/%d %H:%M:%S", timeArray)
#就将时间类型数据转化成为了字符型，且格式为年/月/日 时:分:秒

format 格式化输出

传入的参数

#指定位置
"{1} {0} {1}".format("hello", "world")  
#'world hello world'
# 通过字典设置参数
site = {"name": "菜鸟教程", "url": "www.runoob.com"}
print("网站名：{name}, 地址 {url}".format(**site))

格式

s 对字符串类型格式化。
d 十进制整数。
f 或者 F 转换为浮点数（默认小数点后保留 6 位），再格式化输出。
% 显示百分比（默认显示小数点后 6 位）。
c 将十进制整数自动转换成对应的 Unicode 字符。
e 或者 E 转换成科学计数法后，再格式化输出。

^, <, > 分别是居中、左对齐、右对齐，后面带宽度， : 号后面带填充的字符，只能是一个字符，不指定则默认是用空格填充。
“+” 表示在正数前显示 +，负数前显示 -；（空格）表示在正数前加空格 b、d、o、x 分别是二进制、十进制、八进制、十六进制。此外我们可以使用大括号 {} 来转义大括号，如下实例：

3.1415926	{:.2f}	3.14	#保留小数点后两位
-3.1415926	{:+.2f}	-3.14	#带符号保留小数点后两位
2.71828	{:.0f}	3	#不带小数
000000	{:,}	1,000,000	#以逗号分隔的数字格式
0.25	{:.2%}	25.00%	#百分比格式
1000000000	{:.2e}	1.00e+09	#指数记法

13	{:>10d}	        13	#右对齐 (默认, 宽度为10)
13	{:<10d}	13	#左对齐 (宽度为10)
13	{:^10d}	    13	#中间对齐 (宽度为10)

numpy

矩阵

shape 和 reshape

a=np.array([1,2,3])
np.shape(a)   #(,3)
np.shape(a[None,:]) #(1,3) ,等价于b=a.reshape(1,3)
np.shape(a[:,None])#(3,1),等价于b=a.reshape(3,1)

矩阵运算
A*B #对应元素相乘
np.dot(A,B) #矩阵相乘

查找定位- np.where

np.where(ListA>1)   #返回array
np.where(np.isnan(df['colname']))

存储

ndarray
np.savetxt("result.txt", numpy_data);

OS

路径

# 获取当前文件__file__的路径
 os.path.realpath(__file__)
# 获取当前文件__file__的所在目录
os.path.dirname(os.path.realpath(__file__)) 
#'/Users/heningfeng'
# 把路径分为目录+文件名
os.path.split(os.path.realpath("./2w.csv"))
#('/Users/heningfeng', '2w.csv')

新建目录

os.mkdir(path)

目录下文件

os.listdir(datapath)

移动文件（目录）

shutil.move("oldpos","newpos")

判断文件状态

# 写之前，先检验文件是否存在，存在就删掉  
if os.path.exists("dest.txt"):  
    os.remove("dest.txt")

文件读写

with open('/path/to/file', 'r') as f:
    data = f.read() #读取全部内容；
    lines = f.readlines()#把文件读入一个字符串列表，在列表中每个字符串就是一行。

file_write_obj = open("dest.txt", 'w')  
for var in mylist:  
    file_write_obj.writelines(var)  
    file_write_obj.write('\n')  
file_write_obj.close()

map, lambda,filter

filter :
filter(function, sequence)

对sequence中的item依次执行function(item）,将执行结果为True的item组成一个List/String/Tuple（取决于sequence的类型）返回：

def f(x): return x % 2 != 0 and x % 3 != 0 
filter(f, range(2, 25)) 
>>>[5, 7, 11, 13, 17, 19, 23]

map(function, sequence)
对sequence中的item依次执行function(item)，执行结果组成一个List返回

def cube(x): return x*x*x 
map(cube, range(1, 11)) 
#[1, 8, 27, 64, 125, 216, 343, 512, 729, 1000]

def cube(x) : return x + x  
map(cube , "abcde") 
#['aa', 'bb', 'cc', 'dd', 'ee']

#另外map也支持多个sequence，这就要求function也支持相应数量的参数输入：
def add(x, y): return x+y 
map(add, range(8), range(8)) 
#[0, 2, 4, 6, 8, 10, 12, 14]

lambda

lambda x,y,z...: 操作

它允许你快速定义单行的最小函数，类似与C语言中的宏，这些叫做lambda的函数，是从LISP借用来的，可以用在任何需要函数的地方：

g = lambda x: x * 2 
g(3) #6 
(lambda x: x * 2)(3) #6
lambda x,y :x*y
list(map(lambda x,y: x+y,l1,l2))

from sklearn.model_selection import train_test_split
#从原始数据中划分训练集和测试集
X_tr, X_vld, lab_tr, lab_vld = train_test_split(X_train, labels_train,
test_size=0.3, random_state = 123)

clf = LogisticRegression().fit(X_train, y_train)
# 提取出学习器的系数和截距
coefficients = clf.coef_[0]
intercept = clf.intercept_[0]
r = recall_score(pred, target, labels=labels, average=“weighted”)

确度（Precision）：精确度是指预测为“正例”（Positive）的样本中实际为“正例”的比例。计算公式为 TP/(TP+FP)。

召回率（Recall）：召回率是指实际为“正例”的样本中被预测为“正例”的比例。计算公式为 TP/(TP+FN)。

F1 值：F1 值是精确度和召回率的调和平均数，它是一个综合评价指标，可以综合考虑精确度和召回率的影响。计算公式为 2precisionrecall/(precision+recall)。

准确率（Accuracy）：准确率是指预测正确的样本占总样本数的比例。计算公式为 (TP+TN)/(TP+FP+FN+TN)。

通过计算上述指标，我们可以对分类模型在不同方面的性能表现进行评估，从而更好地优化模型。此外，混淆矩阵还可以帮助我们观察模型在不同分类情况下的表现，如是否存在偏差、误判率高的情况等。因此，混淆矩阵是评价分类模型性能的重要工具之一。

三、ROC曲线
ROC曲线（Receiver Operating Characteristic curve）是评估二分类模型性能的重要工具，通常用于检验分类器的准确性。ROC曲线是以假阳性率（False Positive Rate，FPR）为横坐标，真阳性率（True Positive Rate，TPR）为纵坐标，将所有可能的预测阈值下的TPR和FPR进行绘制得到的曲线。

ROC曲线可以展示出分类器在不同阈值下的综合性能，而曲线下面积（AUC）则是ROC曲线评价指标中的常用参数，可以衡量区分正负样本的能力。AUC越大，说明模型的性能越好。

计算公式：TPR = TP / (TP + FN) / FPR = FP / (FP + TN)

四、KS值/KS曲线
KS（Kolmogorov-Smirnov）值是对两个样本分布是否相同的一种检验方法，它的计算方法是将两个样本的累积分布函数（CDF）相减，得到最大差距，即KS值。在分类问题中，我们可以将正例和负例样本的预测概率作为“样本”，计算它们的预测概率分布，然后通过比较它们的CDF曲线来计算KS值。计算公式：KS=max(TPR−FPR)