2021年03月_FocusOneThread

12月 11月 10月 09月 08月 07月 06月 05月 04月 03月 02月 01月

原创 its rank is undefined, but the layer requires a defined rank

tf.layers.dense的输入的tensor要tf.reshape指定一下shape

2021-03-30 09:49:37 826

原创 CTR模型归纳来说就是三种类型的特征作为输入

int list即 id listfloat list即一些比如价格、销量等特征fixed float list即固定的embedding

2021-03-26 10:11:10 833

原创 Spark 把几列concat成新一列（来join），而不通过对所有列map

dataFrame = dataFrame.withColumn("the_key", concat_ws("-", col("column1"), col("column2")))

2021-03-17 18:02:55 855

原创 Spark java.lang.ClassCastException

详细报错信息：Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1588) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:

2021-03-17 11:38:18 365

原创 Spark 填充默认值实例

dataFrame.na.fill(Map( "column1" -> "0", "column2" -> "-1", ))

2021-03-16 16:30:47 844

原创 Spark 利用udf只对DataFrame其中几列操作，而不对所有列map

定义UDFimport org.apache.spark.sql.functions.udfdef theUDF = udf((inputColumn1: String, inputColumn2: BigInt)=>{ var resultColumn = 0 inputColumn1.split(",").foreach(item=>{ if(java.lang.Long.valueOf(item).equals(inputColumn2)) {

2021-03-16 16:23:44 550

原创 CTR任务的两种特征embedding方式

如果是user对poi的点击率，user最近买了1,2,3,4个item，poi高销的几个item是3,4,5,6，则第一种embedding方式是：每个item一个embedding，一共6个embedding第二种embedding方式是：每两个item一个embedding，一共4乘4=16个embedding...

2021-03-16 14:13:45 506

原创学数学最好的方法是做数学

The only way to learn mathematics is to do mathematics.“Doing mathematics” means a lot more than writing a solution to a math problem - it means thinkingdeeply about math, struggling with math, communicating about math, practicing math skills, andtrying

2021-03-16 09:28:10 125

原创推荐系统的 ctr cxr rpm

ctr 点击数除以曝光数，cxr 成单数除以曝光数，rpm 收入额除以曝光数，

2021-03-15 11:41:58 1606

原创 pv事件 mv事件 mc事件

pv，page view 页面打开的次数，无论用户是否浏览里面的子模块，都会记录，mv，module view 页面打开了，用户浏览了页面里的子模块的次数，mc，module click 页面打开了，用户点击了页面里的子模块的次数，...

2021-03-15 11:40:09 1656

原创 tensorflow tf.estimator 打印AUC

auc,auc_op = tf.metrics.auc(labels=labels, predictions=tf.sigmoid(logits))if mode == tf.estimator.ModeKeys.EVAL: eval_metrics = {"auc":(auc,auc_op)} output_spec = tf.estimator.EstimatorSpec( mode=mode, loss=total

2021-03-12 15:28:55 1017

原创 roc_auc_score 传参

from sklearn.metrics import roc_auc_scorey_true = [0, 0, 1, 1, 1]y_score = [0.1, 0.2, 0.7, 0.8, 0.9]print(roc_auc_score(y_true, y_score))y_score = [0.7, 0.8, 0.9, 0.1, 0.2]print(roc_auc_score(y_true, y_score))打印结果：1.00.33333

2021-03-12 11:00:03 1317

原创【笔记】推荐系统CTR模型数据正负样本准备

www.zhihu.com/question/3249860541在feed场景中，使用曝光（展示）日志时，应该选择APP的SDK埋点的日志，而不是服务器Web接口返回的日志，因为Web接口返回的日志中的后果是会增加很多无效的负样本。举个例子，Web接口每次返回10条数据，但是APP屏幕最多只能展现3条数据，剩下的7条数据需要用户在feed中滑动屏幕后，才算真正的曝光，但很多用户可能并不会滑动屏幕或者滑动屏幕幅度较小，导致剩下的7条数据并未真正在APP上曝光。2针对同一个内容在不同时间对同一个用户曝

2021-03-12 10:29:56 857

原创 Invalid argument: Key: XXX. Can‘t parse serialized Example.

一般就是维度没对上，tfrecord里的维度和代码里的维度没对上，

2021-03-11 15:37:25 1267

原创 vim ctrl+v 粘贴时错乱

先执行 :set paste 再粘贴

2021-03-11 10:12:07 385

原创作为CTR模型输入的 feature index 的意思

比如一共所有 feature value 的集合是 0.1 0.2 0.5 0.7则 feature index 就是将这些value给映射到 0 1 2 3例如一行数据是 0.5 0.7 0.1 则这条数据feature index就是 2 3 0

2021-03-04 14:36:11 402 1

原创 tensorflow index一个tensor

这个示例是不用tensor来index另一个tensor的方式import tensorflow as tfinput_tensor = tf.random_uniform([2,4,3])index = tf.placeholder(tf.int32)index2 = tf.placeholder(tf.int32)print(input_tensor[:,0:2].shape) # 一种方式output = input_tensor[:,index:index2] # 另一种方式sess

2021-03-03 09:43:14 371

原创 BORT 阅读笔记

《Optimal Subarchitecture Extraction For BERT》用神经网路搜索的方法寻找一个最优的BERT：最终寻到的最优BERT：其中D是transformer encoder层数A是attention headsH是hidden sizeI是intermediate layer size对比了普通预训练和基于蒸馏的预训练（下面第二第三列）：...

2021-03-02 10:43:21 157