- 博客(17)
- 收藏
- 关注
原创 pandas 表格根据筛选添加颜色
psi_df2.style.applymap(lambda x:'background-color:rgb(%d,0,0)'%(250-x/0.1*30) if x>0.1 else 'background-color:white' )
2021-06-16 15:36:22 1294
转载 hadoop streaming 多路输出
https://www.cnblogs.com/shapherd/archive/2012/12/21/2827860.html
2020-09-14 19:12:54 159
原创 shell常用命令
#定时清理7天前日志find $temp_path -mtime +7 -name "*.txt" -exec rm -rf {} \;
2020-07-07 12:53:40 143
原创 python 常用时间函数
import datetimeold_date=datetime.datetime.strptime("20200530","%Y%m%d")for i in range(0,15): event_day=(old_date + datetime.timedelta(days = i)).strftime("%Y-%m-%d")
2020-06-16 18:34:01 177
原创 NLP word2vector
import osimport reimport numpy as npimport pandas as pdfrom bs4 import BeautifulSoupimport nltk.data#nltk.download()#from nltk.corpus import stopwordsfrom sklearn.cluster import KMeansfrom g...
2019-06-13 14:36:31 224
原创 spark submit
export PYSPARK_PYTHON=*/bin/python $SPARK_HOME/bin/spark-submit --conf "spark.yarn.dist.archives=/home/anaconda2.tar.gz" --conf "spark.executorEnv.PYSPARK_PYTHON=/home/anaconda2.tar.gz/anaconda...
2018-12-28 17:39:37 168
原创 ssh远程执行
ssh name@10.1.1.1 sudo su - name -c sh ./qingyu/*.sh ssh name@10.1.1.1 sudo -i -u name sh ./qingyu/r
2018-12-28 14:08:11 227
原创 pyspark 根据某字段去重 取时间最新
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html参考官方文档from pyspark.sql import Windowcj_spouse_false = cj_spouse_false.withColumn("row_number", \ ...
2018-11-16 11:31:56 1424
原创 pyhive python连接hive
from pyhive import hiveconn = hive.Connection(host='1*.30',auth='LDAP',port=10000,username='*',password='*',database='*')cursor = conn.cursor()cursor.execute("select * from dp_ods.* where etl_date=...
2018-11-15 17:04:18 1103
原创 hive表同步到es
前期操作参看的hdfs://cluster/user/hive/warehouse/dm_userimage.db/f_baseinfo_online_t有几点注意,1. 插入后查询会出现类型转换异常,es和hive 类型问题,建es对应表时候 int 改为bigint date 改为timestamp float改为double 2....
2018-11-14 10:55:42 1054
原创 调用pyspark脚本
脚本内部 spark = SparkSession.builder.master('yarn-client').config('spark.executor.instances', 10).config('spark.executor.cores',2). \ config('spark.executor.memory', '4g').config('spark.sql.sh...
2018-11-13 15:21:01 1173
原创 hive查询分组添加行数
base_info = spark.sql("select index, type,id, score, idcardinput,banknoteorremit,idcard,\ accounttype,accountopendate,nameinput,subaccount,usablebalance,querydate as querydate_base,balance as ...
2018-10-22 15:54:54 964
原创 直接将hdfs 加到hive表分区 通过msck
/home/user_image/hadoop-2.7.2/bin/hadoop fs -mkdir hdfs://cluster/user/hive/warehouse/dm_userimage.db/f_userimage_messageinfo/etl_date=$yesterday/home/user_image/hadoop-2.7.2/bin/hadoop fs -cp /NS2/...
2018-10-19 15:06:58 1225
原创 jupyter 中文乱码设置编码格式 避免控制台输出
stdi, stdo, stde = sys.stdin, sys.stdout, sys.stderrreload(sys)sys.setdefaultencoding('utf-8')sys.stdin, sys.stdout, sys.stderr = stdi, stdo, stde
2018-10-17 13:43:13 5606
原创 pyspark 临时表 存hive
yf_u3.registerTempTable('yf_u3')spark.sql('DROP TABLE IF EXISTS dm_userimage.yf_u3')spark.sql('create table dm_userimage.yf_u3 select * from yf_u3')存入到指定分区INSERT OVERWRITE TABLE employeesPARTI...
2018-10-16 11:26:29 2985
原创 hdfs数据直接存分区
load data inpath '/user/yjy_research/zhangyd/phonelist_result/2017-01-01' overwrite into table f_userimage_phonelist partition(etl_date='2017-01-01');
2018-10-12 17:33:34 1787
原创 hive 建分区表
SET mapreduce.job.queuename=yjy;SET hive.cli.print.header=TRUE;set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nonstrict;use dm_userimage;create table dm_userimage.f_u...
2018-10-12 17:32:32 136
空空如也
空空如也
TA创建的收藏夹 TA关注的收藏夹
TA关注的人