OverTheMoon-CSDN博客

原创算法导论-初涉

1. 对字典的value做排序，同时输出字典的keyfre_dict = dict()sorted(fre_dict.items(), key = lambda x: x[1])sorted(fre_dict.keys(), key=lambda x:(fre_dict[x],x))2. 统计一个列表里各个元素的出现次数 O(n)from collections import...

2018-06-02 18:32:24 284

算法索引Naive BayesAutoEncoderNaive BayesIt is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors.P(c∣x)=P(x∣c)P(c)P(x)P(c|x) = \frac{P(x|c)P(c)}{P(x)}P(c∣x)=P(x)P(x∣c)P(c), where c is the target and x is

2021-07-02 14:50:43 198

原创第二次刷题整理

establish dp listsdp = [[0]*n for _ in range(n)]usage of range:range(5) means range(0,5,1)if you want to reverse it, use range(0, 5, -1)if there is boundary for output:if result>=-2**31 and result<= 2**31 -1: return resultelse: return

2021-01-16 15:40:25 190

原创深度学习结构相关

Deep LearningRNNStandard RNNLSTMAttentionSelf-attentionMulti-head AttentionCNNText-CNNTransformerBERTRNNStandard RNNLSTMhttp://colah.github.io/posts/2015-08-Understanding-LSTMs/AttentionAttentio...

2019-10-30 15:51:51 414

原创推荐系统与联邦学习

推荐系统与联邦学习基于流行度的推荐算法汤普森采样基于协同过滤的推荐算法（UserCF与ItemCF）基于内容的推荐算法基于模型的推荐算法基于混合式的推荐算法基于流行度的推荐算法基于流行度的算法非常简单粗暴，类似于各大新闻、微博热榜等，根据PV、UV、日均PV或分享率等数据来按某种热度排序来推荐给用户。注：独立访客（UV）、访问次数（VV）两个指标有什么区别？① 访问次数（VV）：记录1天内...

2019-10-30 14:59:32 1847

原创 chatbot笔记

Chatbotpaper scriptCoChat: Enabling Bot and Human Collaboration for Task Completionpaper scriptCoChat: Enabling Bot and Human Collaboration for Task Completion先用supervised learning做个inialization，之后...

2019-06-20 17:14:10 223

原创 An introduction to reinforcement learning

Having taken a quick look at several overviews of reinforcement learning, I wrote a script here to conclude and take down some key concepts and points to help myself understand the reinforcement learn...

2019-06-07 13:41:26 903

原创面试题收录

面试题收录模型篇回归树和分类树二元分类器功能快捷键合理的创建标题，有助于目录的生成如何改变文本的样式插入链接与图片如何插入一段漂亮的代码片生成一个适合你的列表创建一个表格设定内容居中、居左、居右SmartyPants创建一个自定义列表如何创建一个注脚注释也是必不可少的KaTeX数学公式新的甘特图功能，丰富你的文章UML 图表FLowchart流程图导出与导入导出导入模型篇这里收录了单个模型和多...

2019-04-12 11:22:48 463

原创 Python 特征工程

1. LabelEncoder简单来说 LabelEncoder 是对不连续的数字或者文本进行编号from sklearn.preprocessing import LabelEncoderle = LabelEncoder()le.fit([1,5,67,100])le.transform([1,1,100,67,5])输出： array([0,0,3,2,1])2. O...

2019-01-31 10:28:57 509

原创如何向外行解释一个黑盒模型

http://blog.datadive.net/interpreting-random-forests/https://github.com/andosa/treeinterpreterhttps://www.kaggle.com/learn/machine-learning-explainability 2019/02/20复盘今天和业务讲了IsolationForest异常...

2018-11-13 16:13:03 2380

原创时间戳的使用

python中时间戳的使用 import timex = '2019/9/5'# change the string time into binlog timetime.mktime(time.strptime(x, '%Y/%m/%d'))y = 1567612800# change the binlog time into stringtime.strftime(...

2018-09-27 11:11:27 770

原创调参感悟

1. GridSearchCV注意这边有一个坑，样本划分方法不是KFold，而是Stratified KFold 我的朋友写了一个sample generator来解决这个问题：from sklearn.model_selection import KFoldmyCV = []for train_index, test_index in KFold(5,shuffle=Tru...

2018-08-15 15:27:22 397

原创关于Git

今天第一次使用了一下先下载一个Git到本地。然后在想要放Git工程的地方，右键bash然后是一些命令：gitclone<地址>#新建分支gitcheckout-b<branch_name>#修改后gitadd--allgitcommit-m"somecomments"gitpushorigin<branch...

2018-03-07 17:34:34 186

原创一些shell语句

三脚猫功夫。日常记录一下遇到的语句。ls -l 打印该目录下所有文件的信息 cp -R /notebook/yuke/models/* /notebook/models/yuke 把右侧文件夹里所有文件复制到右边目录下 mkdir yuke 在该目录下创建新文件夹 pwd 查看现在的目录 ln -s /notebook/models...

2017-12-26 12:07:40 271 1

原创 LR的变量选择问题

Python中没有forward backward stepwise方法。使用RFE包原理：参数中设定需要几个变量，每次按重要性筛去变量参考：http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html心得：可以考虑使用gridsearch来调节n_features

2017-12-20 14:49:59 865

原创一些统计量

皮尔森相关系数斯皮尔曼相关系数皮尔森卡方统计量· 用于衡量两个categorical variables的关联性，其来自于列联表中的频率数似然比检验统计量F检验· 衡量的是一个连续变量和一个名义变量之间的关联性基尼方差· 三种情况：1）一个连续变量和一个名义或顺序变量；2）两个名义变量；3）两个顺序变量熵方差

2017-12-06 17:40:32 406

原创 imbalanced_learn包的使用小记

文档在此http://contrib.scikit-learn.org/imbalanced-learn/stable/这一次是使用了under-sampling。样本比例大约200：1# Resampledfrom imblearn.under_sampling import RandomUnderSamplerrus = RandomUnderSampler(ran

2017-12-01 11:19:00 6268

原创 Python建模的一些通用操作

1. 训练集测试集划分from sklearn.cross_validation import train_test_splitx_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size = 0.2, random_state = 45)2. cross...

2017-11-30 15:59:51 692

原创 Pandas的基本操作

基本都是关于DataFrame的1. 读取数据score_df = pd.read_csv('D:\\task1\Data\cleaned\\text', sep='\t', header=None)不写header表示列名为第一行注意地址中有转义字符的话需要双斜杠。比如'\t''\r'.2. 更改、添加列名score_df.colum...

2017-11-30 15:49:16 294

原创第一个工作任务小结

1. 观察数据，弄清楚需求2. 确定y3. 抽样4. 做模型5. 模型评估（KS值、cross validation）

2017-11-30 15:30:57 268

原创 K-S值

from scipy.stats import ks_2sampget_ks =lambda y_pred,y_true: ks_2samp(y_pred[y_true==1], y_pred[y_true!=1]).statisticget_ks(x,y)https://www.cnblogs.com/bregman/p/6279261.html

2017-11-30 15:09:18 3485

原创关于Hive和SQL

以下是亲测可用的：1.select name,max(id) from t1 group by name;这个结果挺明显的。选出每个name下某个id最大值SQL取补集1. select s1.mobile from table1 s1 where s1.mobile not in (select s2.phone from table2 s2);2. sele

2017-09-08 15:48:52 280

原创熵值法

是对多个方案的打分。根据每个指标的熵值，决定该指标在评价的时候的权重。https://wenku.baidu.com/view/2c70a61c0b4e767f5acfce4d.html

2017-08-30 17:23:23 7620

原创岭回归、LASSO与elastic net

今天温习了一下这三者的区别。先讲讲岭回归。用最小二乘法的回归的计算公式如下：$$\beta = (X^TX)^{-1}X^Ty$$

2017-08-29 15:21:29 1465

原创 Pipeline: chaining a PCA and a logistic regression

http://scikit-learn.org/stable/auto_examples/plot_digits_pipe.html#sphx-glr-auto-examples-plot-digits-pipe-py之后看的几篇tutorial的价值都没有上一篇文章那么大。不过还是得一个个来。有个叫pipeline的工具：http://scikit-learn.org/s

2017-08-25 14:50:59 393

原创 Classifier Comparison in Python

http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html#sphx-glr-auto-examples-classification-plot-classifier-comparison-py这篇文章信息量挺大的，一次使用了多种分类方法，详细解释很少，需要自己摸

2017-08-24 14:46:51 802

原创 Random Forests in Python

http://www.cnblogs.com/downtjs/archive/2013/08/28/3288203.htmlhttp://blog.yhat.com/posts/random-forests-in-python.html参考了第一个链接，第二个链接是原文。原文中，随机森林的代码里，由于pandas的更新，Factor这个函数不再使用。第九行“

2017-08-24 14:14:53 396

啦啦啦