![](https://img-blog.csdnimg.cn/20201014180756922.png?x-oss-process=image/resize,m_fixed,h_64,w_64)
数据挖掘
winner8881
这个作者很懒,什么都没留下…
展开
-
pandas 常见写法
1、填充特征值为’//N’的所有列为Nonedata[i][data[i] == '\\N'] = None2、labelencoderfrom sklearn.preprocessing import LabelEncoderle = LabelEncoder()data[cat] = le.fit_transform(data[cat])原创 2020-08-03 09:53:31 · 944 阅读 · 0 评论 -
ip处理
import numpy as npa=np.load('ip_explain_by_geoip2_china.npy',allow_pickle=True)ip_exp=a.item() temp = pd.DataFrame(list(ip_exp.items()), columns=['ip', 'ip_exp'])temp[['country','province_exp','c...原创 2019-08-13 12:26:43 · 154 阅读 · 0 评论 -
GBDT、Xgboost、Lightgbm、Catboost论文
1、GBDT,xgboost对比添加链接描述https://wenku.baidu.com/view/f3da60b4951ea76e58fafab069dc5022aaea463e.html2、xgboost论文https://arxiv.org/pdf/1603.02754.pdf3、lightgbm论文http://papers.nips.cc/paper/6907-lightg...原创 2019-08-13 12:16:30 · 833 阅读 · 0 评论 -
数据挖掘-ctr特征
def ctr_fea(train,test,feature): for fea in feature: print(fea) temp = train[['label',fea]].groupby(fea)['label'].agg({fea+'_sum':sum, ...原创 2019-08-22 13:19:46 · 767 阅读 · 0 评论 -
数据挖掘-统计特征
在def cnt_fea(data,feature,train_num): data['flag'] = '-' for fea in feature: print(fea) data[fea] = data[fea].map(data[fea].value_counts()) for i in range(len(feature)-1):...原创 2019-08-22 13:19:32 · 524 阅读 · 0 评论 -
数据挖掘-特征差异性编码
差异性编码快速写法1、取set()2、建pd.dataframe格式3、merge()arrs = ['adidmd5', 'imeimd5', 'macmd5', 'openudidmd5', 'ip']val = []for i in range(len(arrs)): val.append(list(set(train[arrs[i]].unique()) & s...原创 2019-08-13 12:27:39 · 359 阅读 · 0 评论 -
数据挖掘-众数
# 众数def get_mode(arr): mode = [] arr_appear = dict((a, arr.count(a)) for a in arr) # 统计各个元素出现的次数 if max(arr_appear.values()) == 1: # 如果最大的出现为1 return # 则没有众数 else: ...原创 2019-08-10 00:10:10 · 378 阅读 · 0 评论 -
leetcode-手动labelEncoder
for col in obj_cols: data[col].fillna('-1', inplace = True) data[col] = data[col].map(dict(zip(data[col].unique(),list(range(data[col].nunique()))))) print(col+' over...')原创 2019-08-17 17:22:38 · 217 阅读 · 0 评论 -
华为精英算法大赛决赛总结
1、华为比赛总结1、top2选手:EDA探索比赛第一步,先做EDA,发现强特具体来说,如观察某个变量对于label的分布2、top1选手:比赛理论3、自我总结理论深挖一下,如lgb模型原理,nn原理,避免侥幸。比赛不能犯懒,理论补充不能犯懒不能有依赖心理,不能仅靠依赖队友做技术需要静下心来不能有畏难心理2、Ctr总结1、EDA观察特征,比如观察uid_value_cou...原创 2019-08-26 23:06:43 · 522 阅读 · 0 评论 -
数据挖掘-geoip2工具
import geoip2.databaseimport sys # ip = input()ip = '210.32.149.0'reader = geoip2.database.Reader('./GeoLite2-City.mmdb')data = reader.city(ip)def ip_explain(ip): data = reader.city(ip) ...原创 2019-08-13 12:26:09 · 189 阅读 · 0 评论 -
数据挖掘-训练集、测试集绘制&保存
# train = data[data.label!=-1]# test = data[data.label==-1]# train = train.dropna()# test = test.dropna()# # for i in data.columns:# for i in ['city','lan', 'os', 'osv', 'ver', 'orientation', 'ca...原创 2019-08-02 16:56:37 · 500 阅读 · 0 评论 -
数据挖掘-正负样本绘制&保存
# train_pos = data[data['label']==1]# train_neg = data[data['label']==0]# train = train.dropna()# test = test.dropna()# for i in ['city','lan', 'os', 'osv', 'ver', 'orientation', 'carrier', 'ntt',...原创 2019-08-02 16:55:28 · 386 阅读 · 0 评论 -
数据挖掘-数值型特征聚类
cols = ['area','location', 'pv/uv', 'totalFloor', 'pv', 'shi']cols_kmeans = []for i in cols: data[i+'_kmeans'] = (data[i]- data[i].min())/(data[i].max() - data[i].min()) cols_kmeans.append(i...原创 2019-06-02 03:03:50 · 521 阅读 · 1 评论 -
数据挖掘-绘制分布图
import seaborn as snsimport matplotlib.pyplot as pltfor i in train.columns: try: g = sns.kdeplot(train[i], color="Red", shade = True) g = sns.kdeplot(test[i], ax =g, color="Blue"...原创 2019-06-12 08:51:02 · 1164 阅读 · 0 评论 -
数据挖掘-分层抽样
#分层抽样gbr = data.groupby("area") gbr.groups typicalFracDict = { 1: 0.2, 2: 0.4, 3: 0.6} def typicalSampling(group, typicalFracDict): name = group.name frac = typicalFracDi...转载 2019-06-12 09:37:40 · 780 阅读 · 1 评论 -
根据ip获取信息
根据ip获取信息import requestsimport IPy def get_location(ip): url = 'https://sp0.baidu.com/8aQDcjqpAAV3otqbppnN2DJv/api.php?co=&resource_id=6006&t=1529895387942&ie=utf8&oe=gbk&c...原创 2019-08-02 16:43:01 · 441 阅读 · 0 评论 -
数据挖掘-去长尾操作
# def cut_col(data, col_name, cut_list):# print('cutting', col_name)# def _trans(array):# count = array['box_counts']# for box in cut_list:# if count <= bo...原创 2019-08-02 16:45:09 · 477 阅读 · 0 评论 -
数据挖掘-常见写法(持续更...)
1、排序train_new.sort_values(by='imeimd5')train_new.sort_values(by='imeimd5')['imeimd5'].max()train_ime = train_new['imeimd5'].unique()2、迭代器进度条:tqdmtqdmcnt = 0for i in tqdm.tqdm_notebook(test_ne...原创 2019-08-13 12:25:46 · 200 阅读 · 0 评论 -
欺诈黑名单获取
import numpy as np# a=np.load('ip_dict.npy',allow_pickle=True)# data=a.item() temp = train[['ip','label']].groupby('ip')['label'].agg({'mean_label':'mean','count_label':'count','sum_label':'sum'}...原创 2019-08-12 14:05:55 · 150 阅读 · 0 评论 -
数据挖掘-feature_importanct
# 特征重要性import matplotlib.pyplot as pltimport seaborn as snscols = (feature_importance_data[["feature", "importance"]] .groupby("feature") .mean() .sort_values(by="importance...原创 2019-08-02 16:51:06 · 275 阅读 · 0 评论 -
lightgbm简单网格搜索
folds = KFold(n_splits=5, shuffle=True, random_state=1333)oof_lgb = np.zeros(len(train))predictions_lgb = np.zeros(len(test))feature_importance_data = pd.DataFrame()best_score = 0learning_rate ...原创 2019-06-02 02:59:14 · 2149 阅读 · 0 评论