hive处理文本数据时需要指定分隔符,一般来说都是用逗号来做分隔,当某个字段的内容是字符串时,特别是有"{}"双引号括起来的json那种,hive处理时会直接将某个字段中的字符串内容中逗号也当成分隔符来处理,造成hive表格字段内容的异常,这里就需要用将字符串中的逗号替换掉。代码如下:
# -*- coding: utf-8 -*-
import re,os,sys
def alter(file_path, new_str):
#pat = re.compile(r"\"[a-zA-Z0-9_\-\+\s/\"\(\);]+,+[a-zA-Z0-9_\-\+\s,/\"\(\);]*\"")
pat = re.compile(r"\"[a-zA-Z0-9_\-\+\s/\"\(\);\.]+,+[a-zA-Z0-9_\-\+\s,/\"\(\);\.]*\"")
with open(file_path, 'r') as f, open("%s.bak" % file_path, "a") as f2:
for line in f:
res = pat.findall(line)
if len(res) == 0:
f2.write(line)
for r in res:
print r
r = r.replace(',', new_str)
print r
f2.write(re.sub(pat,