Hive 使用 Python 的UDF 对大量日志进行分词统计

Hive命令行中可以使用
add file /path/python/script.py
来添加脚本

Hive会把查询结果输入到标准输入, 在map阶段Python从标准输入中读取, 逐行处理, 返回结果
比如
select TRANSFORM(col1, col2) using 'python script.py'  as (newcol1, newcol2, newcol3) from tb 
使用 "\t" 来分割输出的列

下面这段代码利用NLTK的词组分词模块, 对日志进行分词:



  • import sys
     


  • from nltk.tokenize import MWETokenizer
     



  •  


  • Synonyms = [
     


  •     ['iq scripts', 'i-q scripts'],
     


  •     ['acl lab', 'acl mosaiq lab']
     


  • ]
     



  •  


  • # handle phrase
     


  • tokenizer = MWETokenizer()
     


  • tokenizer.add_mwe(('windows', '7'))
     


  • tokenizer.add_mwe(('service', 'pack'))
     


  • tokenizer.add_mwe(('shanghai', 'team'))
     


  • tokenizer.add_mwe(('2.64', '104'))
     


  • tokenizer.add_mwe(('tier', '5'))
     


  • tokenizer.add_mwe(('tx', 'plans'))
     


  • tokenizer.add_mwe(('acl', 'lab'))
     


  • tokenizer.add_mwe(('iq', 'scripts'))
     


  • tokenizer.add_mwe(('order', 'set'))
     



  •  


  • # cannot use nltk.stop_words
     


  • # from nltk.corpus import stopwords
     


  • # stop_words = set(stopwords.words('english'))
     


  • stop_words = [u'i', u'me', u'my', u'myself', u'we', u'our', u'ours', u'ourselves', u'you', u"you're", u"you've", u"you'll", u"you'd", u'your', u'yours', u'yourself', u'yourselves', u'he', u'him', u'his', u'himself', u'she', u"she's", u'her', u'hers', u'herself', u'it', u"it's", u'its', u'itself', u'they', u'them', u'their', u'theirs', u'themselves', u'what', u'which', u'who', u'whom', u'this', u'that', u"that'll", u'these', u'those', u'am', u'is', u'are', u'was', u'were', u'be', u'been', u'being', u'have', u'has', u'had', u'having', u'do', u'does', u'did', u'doing', u'a', u'an', u'the', u'and', u'but', u'if', u'or', u'because', u'as', u'until', u'while', u'of', u'at', u'by', u'for', u'with', u'about', u'against', u'between', u'into', u'through', u'during', u'before', u'after', u'above', u'below', u'to', u'from', u'up', u'down', u'in', u'out', u'on', u'off', u'over', u'under', u'again', u'further', u'then', u'once', u'here', u'there', u'when', u'where', u'why', u'how', u'all', u'any', u'both', u'each', u'few', u'more', u'most', u'other', u'some', u'such', u'no', u'nor', u'not', u'only', u'own', u'same', u'so', u'than', u'too', u'very', u's', u't', u'can', u'will', u'just', u'don', u"don't", u'should', u"should've", u'now', u'd', u'll', u'm', u'o', u're', u've', u'y', u'ain', u'aren', u"aren't", u'couldn', u"couldn't", u'didn', u"didn't", u'doesn', u"doesn't", u'hadn', u"hadn't", u'hasn', u"hasn't", u'haven', u"haven't", u'isn', u"isn't", u'ma', u'mightn', u"mightn't", u'mustn', u"mustn't", u'needn', u"needn't", u'shan', u"shan't", u'shouldn', u"shouldn't", u'wasn', u"wasn't", u'weren', u"weren't", u'won', u"won't", u'wouldn', u"wouldn't"]
     



  •  


  • for line in sys.stdin:
     


  •     line = line.lower()
     



  •  


  •     # handle symbols cannot use re
     


  •     line = line.replace('{', ' ')
     


  •     line = line.replace('.', ' ')
     


  •     line = line.replace('[', ' ')
     


  •     line = line.replace(']', ' ')
     


  •     line = line.replace('}', ' ')
     


  •     line = line.replace('(', ' ')
     


  •     line = line.replace(')', ' ')
     


  •     line = line.replace('/', ' ')
     


  •     line = line.replace('?', ' ')
     


  •     line = line.replace(',', ' ')
     


  •     line = line.replace('\'s', ' ')
     


  •     line = line.replace('\'', ' ')
     


  •     line = line.replace('"', ' ')
     



  •  


  •     # handle synonyms
     


  •     for s in Synonyms:
     


  •         for i in s[1:]:
     


  •             line = line.replace(i, s[0])
     



  •  


  •     words = [w for w in tokenizer.tokenize(line.split()) if w not in stop_words]
     



  •  


  •     rs = {}
     



  •  


  •     for w in words:
     


  •         rs.setdefault(w, 0)
     


  •         rs[w] += 1
     



  •  


  •     l = []
     


  •     for i in rs.items():
     


  •         l.append(i[0])
     


  •         l.append(i[1])
     



  •  


  •     print str(l)[1:-1]

     


---------------------本文来自 爱知菜 的CSDN 博客 ,全文地址请点击:https://blog.csdn.net/rav009/art ... 735?utm_source=copy 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值