一、spark用正则表达式处理需要将多个输入文件路径作为输入的问题
1、spark需要处理过去半个月的数据,每天的数据再hdfs上存到一个文件中,将近15个输入目录,此时可以采用如下正则匹配的写法,将代码简化:
import datetime
def produce_half_month(thedate):
current_day = thedate
# current_day = '20190515'
current_datetime = datetime.datetime.strptime(current_day, "%Y%m%d")
print current_datetime
i = 0
match_days = current_day
while i < 15:
if i == 0:
i += 1
continue
bf_day = (current_datetime - datetime.timedelta(days=i)).strftime('%Y%m%d')
# print bf_day
i += 1
match_days += "," + bf_day
print match_days
return match_dayshdfs_path = "/home/workdir/hdfs/log/hourly/{%s}*/click_exposure/*" % produce_half_month(thedate)
>>> print hdfs_path
/home/workdir/hdfs/log/hourly/{20190505,20190504,20190503,20190502,20190501,20190430,20190429,20190428,20190427,20190426,20190425,20190424,20190423,20190422,20190421}*/click_exposure/*# spark也能正常读出这个半个月的数据
data_rdd = sc.textFile(hdfs_path, use_unicode=False)
#hadoop命令可以同时匹配半月的日志数据:
hadoop dfs -ls /home/workdir/hdfs/log/hourly/{20190505,20190504,20190503,20190502,20190501,20190430,20190429,20190428,20190427,20190426,20190425,20190424,20190423,20190422,20190421}*/click_exposure/*
2、其他路径匹配方法
[work@datazhe/project]$ hadoop dfs -ls /home/workdir/yao/tmp/user_app_list/new_{20190401,20190403}
Found 2 items
-rw-r--r-- 3 work work 0 2019-05-19 19:37 /home/workdir/yao/tmp/user_app_list/new_20190401/_SUCCESS
-rw-r--r-- 3 work work 1420044 2019-05-19 19:37 /home/workdir/yao/tmp/user_app_list/new_20190401/part-00000.gz
Found 2 items
-rw-r--r-- 3 work work 0 2019-05-19 20:02 /home/workdir/yao/tmp/user_app_list/new_20190403/_SUCCESS
-rw-r--r-- 3 work work 656201 2019-05-19 20:02 /home/workdir/yao/tmp/user_app_list/new_20190403/part-00000.gz
[work@datazhe/project]$ hadoop dfs -ls /home/workdir/yao/tmp/user_app_list/new_201904{01,03}
Found 2 items
-rw-r--r-- 3 work work 0 2019-05-19 19:37 /home/workdir/yao/tmp/user_app_list/new_20190401/_SUCCESS
-rw-r--r-- 3 work work 1420044 2019-05-19 19:37 /home/workdir/yao/tmp/user_app_list/new_20190401/part-00000.gz
Found 2 items
-rw-r--r-- 3 work work 0 2019-05-19 20:02 /home/workdir/yao/tmp/user_app_list/new_20190403/_SUCCESS
-rw-r--r-- 3 work work 656201 2019-05-19 20:02 /home/workdir/yao/tmp/user_app_list/new_20190403/part-00000.gz[work@datazhe/project]$ hadoop dfs -ls /home/workdir/yao/tmp/user_app_list/new_2019040[1-3]
Found 2 items
-rw-r--r-- 3 work work 0 2019-05-19 19:37 /home/workdir/yao/tmp/user_app_list/new_2019040/new_20190401/_SUCCESS
-rw-r--r-- 3 work work 1420044 2019-05-19 19:37 /home/workdir/yao/tmp/user_app_list/new_2019040/new_20190401/part-00000.gz
Found 2 items
-rw-r--r-- 3 work work 0 2019-05-19 19:46 /home/workdir/yao/tmp/user_app_list/new_2019040/new_20190402/_SUCCESS
-rw-r--r-- 3 work work 1291293 2019-05-19 19:46 /home/workdir/yao/tmp/user_app_list/new_2019040/new_20190402/part-00000.gz
Found 2 items
-rw-r--r-- 3 work work 0 2019-05-19 20:02 /home/workdir/yao/tmp/user_app_list/new_2019040/new_20190403/_SUCCESS
-rw-r--r-- 3 work work 656201 2019-05-19 20:02 /home/workdir/yao/tmp/user_app_list/new_2019040/new_20190403/part-00000.gz
二、hadoop 输入路径用正则表达式被默认处理为多个参数的问题
运行命令
hadoop jar wordcount.jar com.WordCount /inpath/*{beijing,shanghai,guangzhou}* /outpath/
这个/inpath/*{beijing,shanghai,guangzhou}* 地址,hadoop自己会解析为多个参数,判定第二个参数,不是输出路径
解决方式:
hadoop jar wordcount.jar com.WordCount /inpath/'{*beijing*,*shanghai*,*guangzhou*}' /outpath/
这样就可以了。