nlp
yangdelu855
算法工程师
展开
-
如何使用hadoop进行Bert tokenize
任务是统计bert tokenize的后的word count需要代码mapper,reducer,Shell脚本首先是实现Bert tokenizer 通过sys.stdin 读取文件,将结果直接输出# coding=utf-8# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.## Licensed under the Apache License, Version ..原创 2020-07-02 11:24:32 · 1544 阅读 · 0 评论 -
YDL | Tips for NLP-1
想要跑一个baseline发现输入是json文件所以需要把txt文件处理一下代码# -*- coding: utf-8 -*-import jsonimport randomf1 = open('new_train_withintro.txt', encoding='utf-8')c = open('temp.json', 'w', encoding='utf-8')readline1 ...原创 2018-06-27 23:54:43 · 175 阅读 · 0 评论 -
爬虫初尝试 | 易车网文章url爬取
目标网站:news.bitauto.com/由于推荐页的加载更多不方便操作选择单项页面爬取 例如新车页在页面右键选择 检查找到目标位置/html/body/div[3]/div/div[1]/div[3]/div/div/h2/a (推荐使用Xpath helper 可以直接复制Xpath)#coding: utf8from selenium impo...原创 2019-08-19 17:50:45 · 1062 阅读 · 2 评论 -
淘口令正则匹配
匹配带淘口令的Query# coding:utf-8import ref=open("query","r")w=open("tkl_in_query","w")readline=f.readlines()pat_list=["₳","$","¢","₴","€","₤","¥","$","《"]patt=[]for key in pat_list: pat=re.co...原创 2019-08-28 15:49:34 · 6178 阅读 · 0 评论