最近需要写一些网站词典的爬虫,下面就来梳理一下步骤:
对两个网站的词典需要爬下来(网站就不具体说了,只是说明一下思路):
(1)https://aaaaaa/bbbbbb
(2)https://xxxxxx/yyyyyy
流程共分为5步:
(1)先将所有词典下载下来,再解析,再生成基本词库(分别是两个网站的词典脚本):
ResortCrawler_aaaaaa.py
ResortCrawler_xxxxxx.py(爬虫如下)
#!/usr/bin/python
# -*- coding: UTF-8 -*-
########################################################
## 此文件用来: 爬词典
########################################################
## 我的正则
## 正则: https://www.runoob.com/python/python-reg-expressions.html
## pattern = re.compile("^<a{2}[^a-z\s\S]*>[.]</a>?")
## 1. ^ (1)放在第一位: 修饰后面的字符,字符串必须是以<开头. (2)非
## 2. {2}: 作为修饰符,表示匹配两次
## 3. []: 括起来的表示一位, \s: 一个空字符. \S一个非空字符
## 4. *: 作为修饰符,匹配0次或多次,直到遇到*后面的>
## 5. +: 作为修饰符,匹配至少一次
## 6. ?: 作为修饰符,懒惰匹配,只匹配到第一个,无?则匹配到最后一个
## 7. (): 代表一组:
# s = '<li><a href="https://en.bab.la">A</a></li>'
# urls = re.findall(re.compile(pattern), s)
# (1) pattern = re.compile('<a href="https:[\s\S]+?">A</a>')
# 结果是:['<a href="https://en.bab.la">A</a>']
# (2) pattern = re.compile('<a href="(https:[\s\S]+?">A</a>)')
# 结果是:['https://en.bab.la']
########################################################
## 爬虫心得:
## 1. 知道网页编码格式 - Console中 - document.charset
## 2. 脚本所在路径 os.path.dirname(os.path.realpath(__file__))
## 3. Python2和Python3设置编码格式的方式不同,因为是写工具,所以要适配Python2/3(方法:IsPython2())
## 4. 正则: 找到所有符合条件的字符串 -- ([\S\s]+?)中的将会被找出
# exp = re.compile('class="result-container">([\S\s]+?)<')
# words = re.findall(expr, string)
## 5. html解析工具 BeautifulSoup
# soup = BeautifulSoup(str(data), 'lxml') ## 解析成BeautifulSoup对象
# (1) 找出class为'letter-nav'的部分(为一个list,因为可能有多个'letter-nav')
# divList = soup.select('.letter-nav')[0]
# (2) 还可以多个条件一起使用 1中2中3中4
# aFileDivList = soup.select('.content-column .content .dict-select-wrapper .dict-select-column')
## 6. 因为这是个工具,要在Mac/Windows上运行,Windows上很多用的pycharm,所以在一个脚本中运行另外一个脚本,要写绝对路径
## 7. pandas: 爬下来的词典一般会写入到本地(excel),而且还会进行多步操作,用Pandas进行Excel的处理本方便
########################################################
import re
import os
import sys
import random
import requests
from optparse import OptionParser
from bs4 import BeautifulSoup
import platform
def IsPython2():
return platform.python_version().startswith('2.7')
if IsPython2():
reload(sys)
sys.setdefaultencoding('utf-8')
global m_ForceDownload2
m_ForceDownload2 = True
# 得到随机头
USER_AGENT = [
"Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Mobile Safari/537.36"
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
"Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
"Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
"Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
"Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
"Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
"Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 LBBROWSER",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; 360SE)",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
"Mozilla/5.0 (iPad; U; CPU OS 4_2_1 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8C148 Safari/6533.18.5",
"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0b13pre) Gecko/20110307 Firefox/4.0b13pre",
"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",
"Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
]
def getheaders():
len_ = len(USER_AGENT)
index_random = random.randint(0, len_-1)
agent = USER_AGENT[index_random]
headers = {'User-Agent': agent}
return headers
def getHtml(url):
headers = getheaders() # 1.随机头
html = ""
try:
page = requests.get(url, headers=headers, timeout=50)
html = page.content.decode("UTF-8")
# print("能用ip:"+ip)
except Exception as e:
print("请求error:"+str(e))
pass
return html
def read(fileName):
if IsPython2():
f = open(fileName, "r")
else:
f = open(fileName, "r", encoding="utf-8")
str_ = f.read()
f.close()
return str_
def readlines(fileName):
if IsPython2():
f = open(fileName, "r")
else:
f = open(fileName, "r", encoding="utf-8")
str_ = f.readlines()
f.close()
return str_
def write(fileName, str_):
if IsPython2():
f = open(fileName, "w")
else:
f = open(fileName, "w", encoding="utf-8")
str_ = f.write(str_)
f.close()
def printRed(stri):
print("\033[1;35;40m" + str(stri))
def NeedDownload2(needCount):
global m_ForceDownload2
if m_ForceDownload2:
m_ForceDownload2 = False
return True
hasCount = 0
for fn in os.listdir(PATH_2_A_Z):
if '.html' in fn:
hasCount += 1
return hasCount != needCount
def GetWord(word): # 去掉首位空格
# 1. 去掉首位空格
while word[0] == ' ':
word = word[1:]
while word[-1] == ' ':
word = word[0:-1]
# dddd = 'aaaa'
# print('[' + dddd[0:] + ']')
# 2. 去掉首位空格
return word
def Url3DownloadedCount():
hasCount = 0
for fn in os.listdir(PATH_3_A_Z):
if '.html' in fn:
hasCount += 1
return hasCount
def NeedDownload3(needCount):
hasCount = Url3DownloadedCount()
print('三级url------[' + str(hasCount) + '/' + str(needCount) + ']')
return hasCount != needCount
def DownloadParseDictionary():
# 1. *************************************** 下载一级url ***************************************
print("检查是否下载。。。。。。url。。。。。。frist:" + m_RUL_PRE)
filename = os.path.join(PATH_1_A_Z, "1.html")
if os.path.isfile(filename):
print("文件已存在-不需再次下载:"+filename)
else:
html = getHtml(m_RUL_PRE)
if len(str(html)) > 1:
write(filename, html)
print("babla_down_success(First):"+m_RUL_PRE)
else:
print("babla_down_fail(First):"+m_RUL_PRE)
html = read(filename)
soup = BeautifulSoup(html, 'lxml')
data = str(soup.select('.letter-nav')[0])
print(data)
urls = re.findall(re.compile('(http[\S\s]+?)"'), data)
m_urls = []
for url in urls:
if '0-9' not in url:
m_urls.append(url)
print('总共字母数--[' + str(len(m_urls)) + ']')
if len(m_urls) < 5:
print('第一个网页下载出问题------找lxz')
sys.exit()
# 2. *************************************** 下载二级url ***************************************
filename2s = []
while NeedDownload2(len(m_urls)):
filename2s = []
for url in m_urls:
_ = url.split('/')
filename = os.path.join(PATH_2_A_Z, _[-2] + '_' + _[-1] + '.html')
filename2s.append(filename)
if os.path.isfile(filename):
print("不需下载。。。。。。url。。。。。。second。。。。。。:" + url)
continue
print("下载。。。。。。url。。。。。。second。。。。。。:" + url)
html = getHtml(url)
if len(str(html)) > 500:
write(filename, html)
write(filename.replace(PATH_2_A_Z, PATH_3_A_Z), html)
print("babla_down_success(Second):"+url)
else:
print("babla_down_fail(Second):"+url)
# 3. *************************************** 解析出三级url ***************************************
url3s = []
for filename in filename2s:
_char = filename.split(os.path.sep)[-1].split('_')[0]
data = read(filename)
soup = BeautifulSoup(data, 'lxml')
print("解析出三级url。。。。。。" + filename)
data = str(soup.select('.dict-pag')[0])
# print('dict-pag......' + data)
# 上面找出url部分
urls = re.findall(re.compile('href="([\S\s]+?)"'), data)
if len(urls) == 0 or len(urls) == 1:
url3s.append(m_RUL_PRE + _char + '/' + str(1))
else:
print(urls)
maxIndexData = urls[len(urls) - 1].split('/')
maxIndex = maxIndexData[-1]
for i in range(1, int(maxIndex) + 1):
url3s.append(m_RUL_PRE + _char + '/' + str(i))
urlss = ''
for url in url3s:
urlss += (url + '\n')
write(os.path.join(PATH_2_A_Z, "1onlySee3url.csv"), urlss)
# 4. *************************************** 下载出三级url ***************************************
while NeedDownload3(len(url3s)):
_url3HasCount = Url3DownloadedCount()
for url in url3s:
filename = url.split('/')[-2] + '_' + url.split('/')[-1] + '.html'
filename = os.path.join(PATH_3_A_Z, filename)
print(filename)
if os.path.isfile(filename):
print("无需下载。。。。。。url。。。。。。third。。。。。。:" + url)
continue
print("正在下载。。。。。。url。。。。。。third。。。。。。:" + url)
html = getHtml(url)
if len(str(html)) > 200:
write(filename, html)
_url3HasCount += 1
print("babla_down_success(Third):"+url)
else:
print("babla_down_fail(Third):"+url)
print("下载进度------------url------------3------------:[" + str(_url3HasCount) + '/' + str(len(url3s)) + ']')
# 5. *************************************** 解析出三级url中的单词 ***************************************
all_words = []
_parse_count = 0
for url in url3s:
_parse_count += 1
filename = url.split('/')[-2] + '_' + url.split('/')[-1] + '.html'
if _parse_count % 20 == 0 or _parse_count == len(url3s):
print("正在解析------------3级网页------------[" + filename + "]------[" + str(_parse_count) + '/' + str(len(url3s)) + ']')
filename = os.path.join(PATH_3_A_Z, filename)
soup = BeautifulSoup(read(filename), 'lxml')
aFileDivList = soup.select('.content-column .content .dict-select-wrapper .dict-select-column')
for aDiv in aFileDivList:
aDivWords = re.findall(re.compile('</span>([\s\S]+?)</a>'), str(aDiv))
for aWord in aDivWords:
aWord = GetWord(aWord)
if ' ' not in aWord:
all_words.append(aWord)
dic_words = ''
for aWord in all_words:
dic_words += (aWord + '\n')
write(m_RESULT, dic_words)
if __name__ == '__main__':
parser = OptionParser()
parser.add_option(
"-t",
"--transLan",
dest="TransLan",
default="french-english",
help="翻译选项[french-english/russian-english]")
(opts, args) = parser.parse_args()
TransLan = opts.TransLan
## 配置 准备工作
RootPath = os.path.dirname(os.path.realpath(__file__))
print(RootPath)
PATH_1_A_Z = os.path.join(RootPath, "babla", TransLan, "1_a_z")
PATH_2_A_Z = os.path.join(RootPath, "babla", TransLan, "2_a_z")
PATH_3_A_Z = os.path.join(RootPath, "babla", TransLan, "3_a_z")
PATH_4_A_Z = os.path.join(RootPath, "babla", TransLan, "4_a_z")
## 作为全局变量来用 -- start
m_RUL_PRE = "https://en.bab.la/dictionary/" + TransLan + "/"
m_RESULT = os.path.join(RootPath, "babla", TransLan, "Dictionary.csv")
## 作为全局变量来用 -- end
if not os.path.exists(PATH_1_A_Z):
os.makedirs(PATH_1_A_Z)
if not os.path.exists(PATH_2_A_Z):
os.makedirs(PATH_2_A_Z)
if not os.path.exists(PATH_3_A_Z):
os.makedirs(PATH_3_A_Z)
# 1. 下载并解析出所有单词
DownloadParseDictionary()
# 2. 单词去重
print("Remove-Same-Word")
os.system(os.path.join(RootPath, 'RemoveSameWord.py -p babla', TransLan, 'Dictionary.csv'))
# 3. 删除非法单词
print("Remove-Illeage-Word")
os.system(os.path.join(RootPath, 'RemoveIllegalWords.py -p babla', TransLan))
# 4. google翻译
google_lan_dic = {
'indonesian-english': 'id',
'french-english': 'fr',
'german-english': 'de',
'italian-english': 'it',
'spanish-english': 'es',
'russian-english': 'ru',
'portuguese-english': 'pt',
'norwegian-english': 'no',
'czech-english': 'cs',
'danish-english': 'da',
'dutch-english': 'nl',
'polish-english': 'pl',
'swedish-english': 'sv',
'japanese-english': 'ja',
}
if TransLan not in google_lan_dic.keys():
sys.exit()
print("--------GooggleTranslate语言:" + google_lan_dic[TransLan])
print("GoogleTranslate-Word")
os.system(os.path.join(RootPath, 'GoogleTranslate.py -f ' + google_lan_dic[TransLan] + ' -p babla', TransLan))
# 5. 删除敏感词
print("Remove-Sensitive-Word")
os.system(os.path.join(RootPath, 'RemoveSensitiveWord.py -p babla', TransLan))
os.makedirs(os.path.join(RootPath, "babla", TransLan, TransLan+"-dic"))
(2)将基本词库单词去重:
RemoveSameWord.py
#!/usr/bin/python
# -*- coding: UTF-8 -*-
import re
import os
import sys
import random
import requests
from optparse import OptionParser
reload(sys)
sys.setdefaultencoding('utf-8')
DICTIONARY = "Dictionary.csv"
DICTIONARY_LEGAL = "Dictionary_legal.csv"
RootPath = os.path.dirname(os.path.realpath(__file__))
print(RootPath)
def read(fileName):
f = open(fileName, "r")
str_ = f.read()
f.close()
return str_
def readlines(fileName):
f = open(fileName, "r")
str_ = f.readlines()
f.close()
return str_
def write(fileName, str_):
f = open(fileName, "w")
str_ = f.write(str_)
f.close()
if __name__ == '__main__':
parser = OptionParser()
parser.add_option(
"-p",
"--dicPath",
dest="dicPath",
default="cambridge/indonesian-english/Dictionary_legal_google.csv",
help="单词去重路径-去重后依然写入原文件[cambridge/indonesian-english/Dictionary_legal.csv]")
(opts, args) = parser.parse_args()
DicPath = opts.dicPath
DicPath = os.path.join(RootPath, DicPath)
print(DicPath)
if not os.path.exists(DicPath):
print("没有该字典路径:" + DicPath)
sys.exit(1)
# 1. 去重
data = readlines(DicPath)
m_words = []
words = ''
for word in data:
word = word.replace('\n', '')
if word not in m_words:
m_words.append(word)
words += (word + '\n')
write(DicPath, words)
(3)根据一定的规则,生成想要的合法词库:
RemoveIllegalWords.py
#!/usr/bin/python
# -*- coding: UTF-8 -*-
import re
import os
import sys
import random
import requests
from optparse import OptionParser
reload(sys)
sys.setdefaultencoding('utf-8')
DICTIONARY = "Dictionary.csv"
DICTIONARY_LEGAL = "Dictionary_legal.csv"
RootPath = os.path.dirname(os.path.realpath(__file__))
print(RootPath)
def read(fileName):
f = open(fileName, "r")
str_ = f.read()
f.close()
return str_
def readlines(fileName):
f = open(fileName, "r")
str_ = f.readlines()
f.close()
return str_
def write(fileName, str_):
f = open(fileName, "w")
str_ = f.write(str_)
f.close()
def printRed(stri):
print("\033[1;35;40m" + str(stri))
def FilterLegalWord():
# 特殊符号等包含:”,” ”.” ”/” ”?” ”;” ”@” ”#” ”%” ”-” ”!” ”(” ”)” 0-9 大写字母
illegalChars = [' ', ',', '.', '/', '\\', '?', ';', '@', '#', '%', '-', '!', '(', ')', '$', "|"]
data = readlines(os.path.join(DicPath, DICTIONARY))
words = ''
for word in data:
word = word.replace('\n', '')
# 1. 长度3-7
if len(word) < 3 or len(word) > 7:
printRed('单词违法:-长度-[' + word + '](' + str(len(word)) + ')')
continue
# 2. 不包含数字
if bool(re.search(r'\d', word)):
printRed('单词违法:-包含数字-[' + word + ']')
continue
# 3. 不包含大写
if not word.islower():
printRed('单词违法:-包含大写-[' + word + ']')
continue
# 4. 不包含非法字符
hasIllegalChar = False
for illegalChar in illegalChars:
if illegalChar in word:
printRed('单词违法:-包含违法字母-[' + word + '](' + illegalChar +')')
hasIllegalChar = True
break
if hasIllegalChar:
continue
# 剩下的词都合法
words += (word + '\n')
write(os.path.join(DicPath, DICTIONARY_LEGAL), words)
if __name__ == '__main__':
parser = OptionParser()
parser.add_option(
"-p",
"--dicPath",
dest="dicPath",
default="",
help="爬下来的词典路径[cambridge/indonesian-english -- 用来拼成Dictionary.csv路径]")
(opts, args) = parser.parse_args()
DicPath = opts.dicPath
DicPath = os.path.join(RootPath, DicPath)
print(DicPath)
if not os.path.exists(DicPath):
printRed("没有该字典路径:" + DicPath)
sys.exit(1)
FilterLegalWord()
(4)将合法词库用Google翻译成中文(Google翻译也是自己写的爬虫),省去手动翻译的麻烦:
GoogleTranslate.py
#!/usr/bin/python
# -*- coding: UTF-8 -*-
import sys
import re
import os
import time
from optparse import OptionParser
if (sys.version_info[0] < 3):
import urllib2
import urllib
import HTMLParser
else:
import html
import urllib.request
import urllib.parse
reload(sys)
sys.setdefaultencoding('utf-8')
def read(fileName):
f = open(fileName, "r")
str_ = f.read()
f.close()
return str_
def readlines(fileName):
f = open(fileName, "r")
str_ = f.readlines()
f.close()
return str_
def write(fileName, str_):
f = open(fileName, "w")
str_ = f.write(str_)
f.close()
def printRed(stri):
print("\033[1;35;40m" + str(stri))
agent = {'User-Agent':
"Mozilla/4.0 (\
compatible;\
MSIE 6.0;\
Windows NT 5.1;\
SV1;\
.NET CLR 1.1.4322;\
.NET CLR 2.0.50727;\
.NET CLR 3.0.04506.30\
)"}
def unescape(text):
if (sys.version_info[0] < 3):
parser = HTMLParser.HTMLParser()
else:
parser = html
return (parser.unescape(text))
def Translate(to_translate, from_language="auto", to_language="auto"):
base_link = "http://translate.google.cn/m?tl=%s&sl=%s&q=%s"
if (sys.version_info[0] < 3):
to_translate = urllib.quote_plus(to_translate)
link = base_link % (to_language, from_language, to_translate)
request = urllib2.Request(link, headers=agent)
raw_data = urllib2.urlopen(request).read()
else:
to_translate = urllib.parse.quote(to_translate)
link = base_link % (to_language, from_language, to_translate)
request = urllib.request.Request(link, headers=agent)
raw_data = urllib.request.urlopen(request).read()
data = raw_data.decode("utf-8")
expr = r'(?s)class="(?:t0|result-container)">(.*?)<'
re_result = re.findall(expr, data)
if (len(re_result) == 0):
result = ""
else:
result = unescape(re_result[0])
return (result)
def NeedTranslate():
for word in m_WordDictionary:
if m_WordDictionary[word] == '':
return True
return False
def GoogleTranslate(): # 可以一次翻译多个 但是一次翻译多个 很可能大部分是翻译不出来 所以每次翻译有时需要适当减少单词
# Goggle翻译 每次最多翻译5000字符
TranslateCount = 0
while NeedTranslate() and TranslateCount < 20:
TranslateCount += 1
if TranslateCount == 20 and m_onceTranslateWord != 2:
m_onceTranslateWord == 1
TranslateCount = 18
count_debug = 0
mergeWordList = []
for word in m_WordDictionary:
print("已经翻译完成[" + str(count_debug) + '/' + str(len(m_WordDictionary)) + ']')
if m_WordDictionary[word] != '':
print("已经翻译了:" + word)
count_debug += 1
continue
if len(mergeWordList) == 0:
mergeWordList.append([])
if len(mergeWordList[len(mergeWordList) - 1]) >= m_onceTranslateWord:
mergeWordList.append([])
mergeWordList[len(mergeWordList) - 1].append(word)
for aWordList in mergeWordList:
# 开始这次请求
words = ''
for i in range(0, len(aWordList)):
word = aWordList[i]
if i == 0:
words += word
else:
words += ('|' + word)
words_translate = Translate(words, Lan_From, Lan_To)
if words_translate != '':
word_list = words.split('|')
word_translate_list = words_translate.split('|')
if len(word_list) != len(word_translate_list):
continue
print("word-------------" + words)
print("words_translate--" + words_translate)
for i in range(0, len(word_list)):
if hasChar(word_translate_list[i]):
continue
m_WordDictionary[word_list[i]] = word_translate_list[i]
count_debug += 1
print("已经翻译完成[" + str(count_debug) + '/' + str(len(m_WordDictionary)) + ']')
WriteTranslateToDictionary()
print('已经GoogleTranslate的轮数:[' + str(TranslateCount) + ']')
def WriteTranslateToDictionary():
words = ''
for word in m_WordList:
words += (word + ',' + m_WordDictionary[word] + '\n')
write(os.path.join(SavePath, "Dictionary_legal_google.csv"), words)
def hasChar(word):
return (len(re.findall(r'[a-zA-Z]', word)) > 0 and word.islower())
if __name__ == '__main__':
parser = OptionParser()
parser.add_option(
"-f",
"--FromLan",
dest="FromLan",
default="en",
help="翻译-原始语言[en/fr/de/id/zh-CN]")
parser.add_option(
"-t",
"--ToLan",
dest="ToLan",
default="zh-CN",
help="翻译-目标语言[en/fr/de/id/zh-CN]")
parser.add_option(
"-p",
"--savePath",
dest="savePath",
default="cambridge/indonesian-english",
help="存储路径[cambridge/indonesian-english]")
(opts, args) = parser.parse_args()
Lan_From = opts.FromLan
Lan_To = opts.ToLan
# Lan_Str = opts.string
SavePath = opts.savePath
## 配置 准备工作
RootPath = os.path.dirname(os.path.realpath(__file__))
SavePath = os.path.join(RootPath, SavePath)
if not os.path.exists(SavePath):
print("SavePath路径不存在:" + SavePath)
sys.exit(1)
print(SavePath)
#################### 全局变量 ####################
m_WordList = [] # 单词列表--用于排序
m_WordDictionary = {} # 单词字典--[key:word value:translate]
m_onceTranslateWord = 20 # 每次对多翻译多少单词
#################### 全局变量 ####################
# 1. 创建翻译表
if not os.path.isfile(os.path.join(SavePath, "Dictionary_legal_google.csv")):
data = readlines(os.path.join(SavePath, "Dictionary_legal.csv"))
words = ''
for word in data:
words += (word.replace('\n', '') + ',' + '\n')
write(os.path.join(SavePath, "Dictionary_legal_google.csv"), words)
# 2. 得到所有单词的list和dic
data = readlines(os.path.join(SavePath, "Dictionary_legal_google.csv"))
for word in data:
word = word.replace('\n', '')
wod_ = word.split(',')
m_WordList.append(wod_[0])
m_WordDictionary[wod_[0]] = wod_[1]
# 3. 开始Google翻译
GoogleTranslate()
(5)将Google翻译成的中文,删除敏感词(敏感词库SensitiveWords.csv),就是我们想要的最终词库:
RemoveSensitiveWord.py
#!/usr/bin/python
# -*- coding: UTF-8 -*-
import re
import os
import sys
import random
import requests
from optparse import OptionParser
reload(sys)
sys.setdefaultencoding('utf-8')
DICTIONARY = "Dictionary.csv"
DICTIONARY_LEGAL = "Dictionary_legal.csv"
RootPath = os.path.dirname(os.path.realpath(__file__))
print(RootPath)
def read(fileName):
f = open(fileName, "r")
str_ = f.read()
f.close()
return str_
def readlines(fileName):
f = open(fileName, "r")
str_ = f.readlines()
f.close()
return str_
def write(fileName, str_):
f = open(fileName, "w")
str_ = f.write(str_)
f.close()
def printRed(stri):
print("\033[1;35;40m" + str(stri))
def FilterLegalWord():
# 特殊符号等包含:”,” ”.” ”/” ”?” ”;” ”@” ”#” ”%” ”-” ”!” ”(” ”)” 0-9 大写字母
illegalChars = [' ', ',', '.', '/', '\\', '?', ';', '@', '#', '%', '-', '!', '(', ')', '$', "|"]
data = readlines(os.path.join(DicPath, DICTIONARY))
words = ''
for word in data:
word = word.replace('\n', '')
# 1. 长度3-7
if len(word) < 3 or len(word) > 7:
printRed('单词违法:-长度-[' + word + '](' + str(len(word)) + ')')
continue
# 2. 不包含数字
if bool(re.search(r'\d', word)):
printRed('单词违法:-包含数字-[' + word + ']')
continue
# 3. 不包含大写
if not word.islower():
printRed('单词违法:-包含大写-[' + word + ']')
continue
# 4. 不包含非法字符
hasIllegalChar = False
for illegalChar in illegalChars:
if illegalChar in word:
printRed('单词违法:-包含违法字母-[' + word + '](' + illegalChar +')')
hasIllegalChar = True
break
if hasIllegalChar:
continue
# 剩下的词都合法
words += (word + '\n')
write(os.path.join(DicPath, DICTIONARY_LEGAL), words)
def IsSensitiveWord(checkWord):
for sensitiveWord in m_SensitiveWords:
if sensitiveWord in checkWord:
return True
return False
if __name__ == '__main__':
parser = OptionParser()
parser.add_option(
"-p",
"--dicPath",
dest="dicPath",
default="cambridge/indonesian-english",
help="爬下来的词典路径[cambridge/indonesian-english -- 用来拼成Dictionary_legal_google.csv路径]")
(opts, args) = parser.parse_args()
DicPath = opts.dicPath
DicPath = os.path.join(RootPath, DicPath)
print(DicPath)
if not os.path.exists(DicPath):
print("没有该字典路径:" + DicPath)
sys.exit(1)
######### 全局变量 #########
m_SensitiveWords = []
m_words = ''
######### 全局变量 #########
filename_sensitive = os.path.join(RootPath, "SensitiveWords.csv")
filename_from = os.path.join(DicPath, "Dictionary_legal_google.csv")
filename_to = os.path.join(DicPath, "Dictionary_legal_google_RemoveSensitive.csv")
# 1. 获取敏感词
data = readlines(filename_sensitive)
for word in data:
m_SensitiveWords.append(word.replace('\n', ''))
# 2. 词库删除敏感词
data = readlines(filename_from)
for checkword in data:
checkword = checkword.replace('\n', '')
if not IsSensitiveWord(checkword):
m_words += (checkword + '\n')
else:
print("删除敏感词:" + checkword)
write(filename_to, m_words)