简单爬虫实例

最近需要写一些网站词典的爬虫,下面就来梳理一下步骤:
对两个网站的词典需要爬下来(网站就不具体说了,只是说明一下思路):
(1)https://aaaaaa/bbbbbb
(2)https://xxxxxx/yyyyyy

流程共分为5步:
(1)先将所有词典下载下来,再解析,再生成基本词库(分别是两个网站的词典脚本):
        ResortCrawler_aaaaaa.py
        ResortCrawler_xxxxxx.py(爬虫如下)

#!/usr/bin/python
# -*- coding: UTF-8 -*-

########################################################
## 此文件用来: 爬词典
########################################################
## 我的正则
## 正则: https://www.runoob.com/python/python-reg-expressions.html
## pattern = re.compile("^<a{2}[^a-z\s\S]*>[.]</a>?")
## 1. ^ (1)放在第一位: 修饰后面的字符,字符串必须是以<开头. (2)非
## 2. {2}: 作为修饰符,表示匹配两次
## 3. []: 括起来的表示一位, \s: 一个空字符. \S一个非空字符
## 4. *: 作为修饰符,匹配0次或多次,直到遇到*后面的>
## 5. +: 作为修饰符,匹配至少一次
## 6. ?: 作为修饰符,懒惰匹配,只匹配到第一个,无?则匹配到最后一个
## 7. (): 代表一组:
#       s = '<li><a href="https://en.bab.la">A</a></li>'
#       urls = re.findall(re.compile(pattern), s)
#       (1) pattern = re.compile('<a href="https:[\s\S]+?">A</a>')
#           结果是:['<a href="https://en.bab.la">A</a>']
#       (2) pattern = re.compile('<a href="(https:[\s\S]+?">A</a>)')
#           结果是:['https://en.bab.la']
########################################################
## 爬虫心得:
## 1. 知道网页编码格式 - Console中 - document.charset
## 2. 脚本所在路径 os.path.dirname(os.path.realpath(__file__))
## 3. Python2和Python3设置编码格式的方式不同,因为是写工具,所以要适配Python2/3(方法:IsPython2())
## 4. 正则: 找到所有符合条件的字符串 -- ([\S\s]+?)中的将会被找出
#       exp = re.compile('class="result-container">([\S\s]+?)<')
#       words = re.findall(expr, string)
## 5. html解析工具 BeautifulSoup
#       soup = BeautifulSoup(str(data), 'lxml')     ## 解析成BeautifulSoup对象
#   (1) 找出class为'letter-nav'的部分(为一个list,因为可能有多个'letter-nav')
#       divList = soup.select('.letter-nav')[0]
#   (2) 还可以多个条件一起使用 1中2中3中4
#       aFileDivList = soup.select('.content-column .content .dict-select-wrapper .dict-select-column')
## 6. 因为这是个工具,要在Mac/Windows上运行,Windows上很多用的pycharm,所以在一个脚本中运行另外一个脚本,要写绝对路径
## 7. pandas: 爬下来的词典一般会写入到本地(excel),而且还会进行多步操作,用Pandas进行Excel的处理本方便
########################################################





import re
import os
import sys
import random
import requests
from optparse import OptionParser
from bs4 import BeautifulSoup
import platform
def IsPython2():
    return platform.python_version().startswith('2.7')


if IsPython2():
    reload(sys)
    sys.setdefaultencoding('utf-8')


global m_ForceDownload2
m_ForceDownload2 = True


# 得到随机头
USER_AGENT = [
    "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Mobile Safari/537.36"
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
    "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
    "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
    "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
    "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
    "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
    "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
    "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
    "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER",
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)",
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 LBBROWSER",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)",
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; 360SE)",
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",
    "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
    "Mozilla/5.0 (iPad; U; CPU OS 4_2_1 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8C148 Safari/6533.18.5",
    "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0b13pre) Gecko/20110307 Firefox/4.0b13pre",
    "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",
    "Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
]




def getheaders():
    len_ = len(USER_AGENT)
    index_random = random.randint(0, len_-1)
    agent = USER_AGENT[index_random]
    headers = {'User-Agent': agent}
    return headers

def getHtml(url):
    headers = getheaders()  # 1.随机头
    html = ""
    try:
        page = requests.get(url, headers=headers, timeout=50)
        html = page.content.decode("UTF-8")
        # print("能用ip:"+ip)
    except Exception as e:
        print("请求error:"+str(e))
        pass
    return html


def read(fileName):
    if IsPython2():
        f = open(fileName, "r")
    else:
        f = open(fileName, "r", encoding="utf-8")
    str_ = f.read()
    f.close()
    return str_


def readlines(fileName):
    if IsPython2():
        f = open(fileName, "r")
    else:
        f = open(fileName, "r", encoding="utf-8")
    str_ = f.readlines()
    f.close()
    return str_


def write(fileName, str_):
    if IsPython2():
        f = open(fileName, "w")
    else:
        f = open(fileName, "w", encoding="utf-8")
    str_ = f.write(str_)
    f.close()


def printRed(stri):
    print("\033[1;35;40m" + str(stri))


def NeedDownload2(needCount):
    global m_ForceDownload2
    if m_ForceDownload2:
        m_ForceDownload2 = False
        return True
    hasCount = 0
    for fn in os.listdir(PATH_2_A_Z):
        if '.html' in fn:
            hasCount += 1
    return hasCount != needCount

def GetWord(word): # 去掉首位空格
    # 1. 去掉首位空格
    while word[0] == ' ':
        word = word[1:]
    while word[-1] == ' ':
        word = word[0:-1]
    # dddd = 'aaaa'
    # print('[' + dddd[0:] + ']')
    # 2. 去掉首位空格
    return word

def Url3DownloadedCount():
    hasCount = 0
    for fn in os.listdir(PATH_3_A_Z):
        if '.html' in fn:
            hasCount += 1
    return hasCount

def NeedDownload3(needCount):
    hasCount = Url3DownloadedCount()
    print('三级url------[' + str(hasCount) + '/' + str(needCount) + ']')
    return hasCount != needCount


def DownloadParseDictionary():
    # 1. *************************************** 下载一级url ***************************************
    print("检查是否下载。。。。。。url。。。。。。frist:" + m_RUL_PRE)
    filename = os.path.join(PATH_1_A_Z, "1.html")
    if os.path.isfile(filename):
        print("文件已存在-不需再次下载:"+filename)
    else:
        html = getHtml(m_RUL_PRE)
        if len(str(html)) > 1:
            write(filename, html)
            print("babla_down_success(First):"+m_RUL_PRE)
        else:
            print("babla_down_fail(First):"+m_RUL_PRE)
    html = read(filename)
    soup = BeautifulSoup(html, 'lxml')
    data = str(soup.select('.letter-nav')[0])
    print(data)
    urls = re.findall(re.compile('(http[\S\s]+?)"'), data)
    m_urls = []
    for url in urls:
        if '0-9' not in url:
            m_urls.append(url)
    print('总共字母数--[' + str(len(m_urls)) + ']')
    if len(m_urls) < 5:
        print('第一个网页下载出问题------找lxz')
        sys.exit()
    
    # 2. *************************************** 下载二级url ***************************************
    filename2s = []
    while NeedDownload2(len(m_urls)):
        filename2s = []
        for url in m_urls:
            _ = url.split('/')
            filename = os.path.join(PATH_2_A_Z, _[-2] + '_' + _[-1] + '.html')
            filename2s.append(filename)
            if os.path.isfile(filename):
                print("不需下载。。。。。。url。。。。。。second。。。。。。:" + url)
                continue
            print("下载。。。。。。url。。。。。。second。。。。。。:" + url)
            html = getHtml(url)
            if len(str(html)) > 500:
                write(filename, html)
                write(filename.replace(PATH_2_A_Z, PATH_3_A_Z), html)
                print("babla_down_success(Second):"+url)
            else:
                print("babla_down_fail(Second):"+url)
    # 3. *************************************** 解析出三级url ***************************************
    url3s = []
    for filename in filename2s:
        _char = filename.split(os.path.sep)[-1].split('_')[0]
        data = read(filename)
        soup = BeautifulSoup(data, 'lxml')
        print("解析出三级url。。。。。。" + filename)
        data = str(soup.select('.dict-pag')[0])
        # print('dict-pag......' + data)
        # 上面找出url部分
        urls = re.findall(re.compile('href="([\S\s]+?)"'), data)
        if len(urls) == 0 or len(urls) == 1:
            url3s.append(m_RUL_PRE + _char + '/' + str(1))
        else:
            print(urls)
            maxIndexData = urls[len(urls) - 1].split('/')
            maxIndex = maxIndexData[-1]
            for i in range(1, int(maxIndex) + 1):
                url3s.append(m_RUL_PRE + _char + '/' + str(i))
    urlss = ''
    for url in url3s:
        urlss += (url + '\n')
    write(os.path.join(PATH_2_A_Z, "1onlySee3url.csv"), urlss)
    # 4. *************************************** 下载出三级url ***************************************
    while NeedDownload3(len(url3s)):
        _url3HasCount = Url3DownloadedCount()
        for url in url3s:
            filename = url.split('/')[-2] + '_' + url.split('/')[-1] + '.html'
            filename = os.path.join(PATH_3_A_Z, filename)
            print(filename)
            if os.path.isfile(filename):
                print("无需下载。。。。。。url。。。。。。third。。。。。。:" + url)
                continue
            print("正在下载。。。。。。url。。。。。。third。。。。。。:" + url)
            html = getHtml(url)
            if len(str(html)) > 200:
                write(filename, html)
                _url3HasCount += 1
                print("babla_down_success(Third):"+url)
            else:
                print("babla_down_fail(Third):"+url)
            print("下载进度------------url------------3------------:[" + str(_url3HasCount) + '/' + str(len(url3s)) + ']')
    # 5. *************************************** 解析出三级url中的单词 ***************************************
    all_words = []
    _parse_count = 0
    for url in url3s:
        _parse_count += 1
        filename = url.split('/')[-2] + '_' + url.split('/')[-1] + '.html'
        if _parse_count % 20 == 0 or _parse_count == len(url3s):
            print("正在解析------------3级网页------------[" + filename + "]------[" + str(_parse_count) + '/' + str(len(url3s)) + ']')
        filename = os.path.join(PATH_3_A_Z, filename)
        soup = BeautifulSoup(read(filename), 'lxml')
        aFileDivList = soup.select('.content-column .content .dict-select-wrapper .dict-select-column')
        for aDiv in aFileDivList:
            aDivWords = re.findall(re.compile('</span>([\s\S]+?)</a>'), str(aDiv))
            for aWord in aDivWords:
                aWord = GetWord(aWord)
                if ' ' not in aWord:
                    all_words.append(aWord)
    dic_words = ''
    for aWord in all_words:
        dic_words += (aWord + '\n')
    write(m_RESULT, dic_words)
    

if __name__ == '__main__':
    parser = OptionParser()
    parser.add_option(
        "-t",
        "--transLan",
        dest="TransLan",
        default="french-english",
        help="翻译选项[french-english/russian-english]")
    (opts, args) = parser.parse_args()
    TransLan = opts.TransLan
    ## 配置 准备工作
    RootPath = os.path.dirname(os.path.realpath(__file__))
    print(RootPath)
    PATH_1_A_Z = os.path.join(RootPath, "babla", TransLan, "1_a_z")
    PATH_2_A_Z = os.path.join(RootPath, "babla", TransLan, "2_a_z")
    PATH_3_A_Z = os.path.join(RootPath, "babla", TransLan, "3_a_z")
    PATH_4_A_Z = os.path.join(RootPath, "babla", TransLan, "4_a_z")
    
    ## 作为全局变量来用 -- start
    m_RUL_PRE = "https://en.bab.la/dictionary/" + TransLan + "/"
    m_RESULT = os.path.join(RootPath, "babla", TransLan, "Dictionary.csv")
    ## 作为全局变量来用 -- end

    if not os.path.exists(PATH_1_A_Z):
        os.makedirs(PATH_1_A_Z)
    if not os.path.exists(PATH_2_A_Z):
        os.makedirs(PATH_2_A_Z)
    if not os.path.exists(PATH_3_A_Z):
        os.makedirs(PATH_3_A_Z)
    


    # 1. 下载并解析出所有单词
    DownloadParseDictionary()

    # 2. 单词去重
    print("Remove-Same-Word")
    os.system(os.path.join(RootPath, 'RemoveSameWord.py -p babla', TransLan, 'Dictionary.csv'))

    # 3. 删除非法单词
    print("Remove-Illeage-Word")
    os.system(os.path.join(RootPath, 'RemoveIllegalWords.py -p babla', TransLan))

    # 4. google翻译
    google_lan_dic = {
        'indonesian-english': 'id',
        'french-english': 'fr',
        'german-english': 'de',
        'italian-english': 'it',
        'spanish-english': 'es',
        'russian-english': 'ru',
        'portuguese-english': 'pt',
        'norwegian-english': 'no',
        'czech-english': 'cs',
        'danish-english': 'da',
        'dutch-english': 'nl',
        'polish-english': 'pl',
        'swedish-english': 'sv',
        'japanese-english': 'ja',
    }
    if TransLan not in google_lan_dic.keys():
        sys.exit()
    print("--------GooggleTranslate语言:" + google_lan_dic[TransLan])
    print("GoogleTranslate-Word")
    os.system(os.path.join(RootPath, 'GoogleTranslate.py -f ' + google_lan_dic[TransLan] + ' -p babla', TransLan))

    # 5. 删除敏感词
    print("Remove-Sensitive-Word")
    os.system(os.path.join(RootPath, 'RemoveSensitiveWord.py -p babla', TransLan))
    os.makedirs(os.path.join(RootPath, "babla", TransLan, TransLan+"-dic"))

(2)将基本词库单词去重:
        RemoveSameWord.py

#!/usr/bin/python
# -*- coding: UTF-8 -*-


import re
import os
import sys
import random
import requests
from optparse import OptionParser
reload(sys)
sys.setdefaultencoding('utf-8')


DICTIONARY = "Dictionary.csv"
DICTIONARY_LEGAL = "Dictionary_legal.csv"

RootPath = os.path.dirname(os.path.realpath(__file__))
print(RootPath)

def read(fileName):
    f = open(fileName, "r")
    str_ = f.read()
    f.close()
    return str_


def readlines(fileName):
    f = open(fileName, "r")
    str_ = f.readlines()
    f.close()
    return str_


def write(fileName, str_):
    f = open(fileName, "w")
    str_ = f.write(str_)
    f.close()

if __name__ == '__main__':
    parser = OptionParser()
    parser.add_option(
        "-p",
        "--dicPath",
        dest="dicPath",
        default="cambridge/indonesian-english/Dictionary_legal_google.csv",
        help="单词去重路径-去重后依然写入原文件[cambridge/indonesian-english/Dictionary_legal.csv]")
    (opts, args) = parser.parse_args()
    DicPath = opts.dicPath
    DicPath = os.path.join(RootPath, DicPath)
    print(DicPath)
    if not os.path.exists(DicPath):
        print("没有该字典路径:" + DicPath)
        sys.exit(1)

    # 1. 去重
    data = readlines(DicPath)
    m_words = []
    words = ''
    for word in data:
        word = word.replace('\n', '')
        if word not in m_words:
            m_words.append(word)
            words += (word + '\n')
    write(DicPath, words)

(3)根据一定的规则,生成想要的合法词库:
        RemoveIllegalWords.py

#!/usr/bin/python
# -*- coding: UTF-8 -*-


import re
import os
import sys
import random
import requests
from optparse import OptionParser
reload(sys)
sys.setdefaultencoding('utf-8')


DICTIONARY = "Dictionary.csv"
DICTIONARY_LEGAL = "Dictionary_legal.csv"

RootPath = os.path.dirname(os.path.realpath(__file__))
print(RootPath)

def read(fileName):
    f = open(fileName, "r")
    str_ = f.read()
    f.close()
    return str_


def readlines(fileName):
    f = open(fileName, "r")
    str_ = f.readlines()
    f.close()
    return str_


def write(fileName, str_):
    f = open(fileName, "w")
    str_ = f.write(str_)
    f.close()

def printRed(stri):
    print("\033[1;35;40m" + str(stri))

def FilterLegalWord():
    # 特殊符号等包含:”,”  ”.”  ”/”  ”?”  ”;”  ”@”  ”#”  ”%”  ”-”  ”!”  ”(”  ”)”  0-9  大写字母
    illegalChars = [' ', ',', '.', '/', '\\', '?', ';', '@', '#', '%', '-', '!', '(', ')', '$', "|"]
    data = readlines(os.path.join(DicPath, DICTIONARY))
    words = ''
    for word in data:
        word = word.replace('\n', '')
        # 1. 长度3-7
        if len(word) < 3 or len(word) > 7:
            printRed('单词违法:-长度-[' + word + '](' + str(len(word)) + ')')
            continue
        # 2. 不包含数字
        if bool(re.search(r'\d', word)):
            printRed('单词违法:-包含数字-[' + word + ']')
            continue
        # 3. 不包含大写
        if not word.islower():
            printRed('单词违法:-包含大写-[' + word + ']')
            continue
        # 4. 不包含非法字符
        hasIllegalChar = False
        for illegalChar in illegalChars:
            if illegalChar in word:
                printRed('单词违法:-包含违法字母-[' + word + '](' + illegalChar +')')
                hasIllegalChar = True
                break
        if hasIllegalChar:
            continue
        # 剩下的词都合法
        words += (word + '\n')
    write(os.path.join(DicPath, DICTIONARY_LEGAL), words)

if __name__ == '__main__':
    parser = OptionParser()
    parser.add_option(
        "-p",
        "--dicPath",
        dest="dicPath",
        default="",
        help="爬下来的词典路径[cambridge/indonesian-english -- 用来拼成Dictionary.csv路径]")
    (opts, args) = parser.parse_args()
    DicPath = opts.dicPath
    DicPath = os.path.join(RootPath, DicPath)
    print(DicPath)
    if not os.path.exists(DicPath):
        printRed("没有该字典路径:" + DicPath)
        sys.exit(1)
    FilterLegalWord()

(4)将合法词库用Google翻译成中文(Google翻译也是自己写的爬虫),省去手动翻译的麻烦:
        GoogleTranslate.py

#!/usr/bin/python
# -*- coding: UTF-8 -*-

import sys
import re
import os
import time
from optparse import OptionParser
if (sys.version_info[0] < 3):
    import urllib2
    import urllib
    import HTMLParser
else:
    import html
    import urllib.request
    import urllib.parse
reload(sys)
sys.setdefaultencoding('utf-8')


def read(fileName):
    f = open(fileName, "r")
    str_ = f.read()
    f.close()
    return str_


def readlines(fileName):
    f = open(fileName, "r")
    str_ = f.readlines()
    f.close()
    return str_


def write(fileName, str_):
    f = open(fileName, "w")
    str_ = f.write(str_)
    f.close()


def printRed(stri):
    print("\033[1;35;40m" + str(stri))


agent = {'User-Agent':
         "Mozilla/4.0 (\
compatible;\
MSIE 6.0;\
Windows NT 5.1;\
SV1;\
.NET CLR 1.1.4322;\
.NET CLR 2.0.50727;\
.NET CLR 3.0.04506.30\
)"}


def unescape(text):
    if (sys.version_info[0] < 3):
        parser = HTMLParser.HTMLParser()
    else:
        parser = html
    return (parser.unescape(text))


def Translate(to_translate, from_language="auto", to_language="auto"):
    base_link = "http://translate.google.cn/m?tl=%s&sl=%s&q=%s"
    if (sys.version_info[0] < 3):
        to_translate = urllib.quote_plus(to_translate)
        link = base_link % (to_language, from_language, to_translate)
        request = urllib2.Request(link, headers=agent)
        raw_data = urllib2.urlopen(request).read()
    else:
        to_translate = urllib.parse.quote(to_translate)
        link = base_link % (to_language, from_language, to_translate)
        request = urllib.request.Request(link, headers=agent)
        raw_data = urllib.request.urlopen(request).read()
    data = raw_data.decode("utf-8")
    expr = r'(?s)class="(?:t0|result-container)">(.*?)<'
    re_result = re.findall(expr, data)
    if (len(re_result) == 0):
        result = ""
    else:
        result = unescape(re_result[0])
    return (result)

def NeedTranslate():
    for word in m_WordDictionary:
        if m_WordDictionary[word] == '':
            return True
    return False

def GoogleTranslate(): # 可以一次翻译多个 但是一次翻译多个 很可能大部分是翻译不出来  所以每次翻译有时需要适当减少单词
    # Goggle翻译 每次最多翻译5000字符
    TranslateCount = 0
    while NeedTranslate() and TranslateCount < 20:
        TranslateCount += 1
        if TranslateCount == 20 and m_onceTranslateWord != 2:
            m_onceTranslateWord == 1
            TranslateCount = 18
        count_debug = 0
        mergeWordList = []
        for word in m_WordDictionary:
            print("已经翻译完成[" + str(count_debug) + '/' + str(len(m_WordDictionary)) + ']')
            if m_WordDictionary[word] != '':
                print("已经翻译了:" + word)
                count_debug += 1
                continue
            if len(mergeWordList) == 0:
                mergeWordList.append([])
            if len(mergeWordList[len(mergeWordList) - 1]) >= m_onceTranslateWord:
                mergeWordList.append([])
            mergeWordList[len(mergeWordList) - 1].append(word)
        for aWordList in mergeWordList:
            # 开始这次请求
            words = ''
            for i in range(0, len(aWordList)):
                word = aWordList[i]
                if i == 0:
                    words += word
                else:
                    words += ('|' + word)
            words_translate = Translate(words, Lan_From, Lan_To)
            if words_translate != '':
                word_list = words.split('|')
                word_translate_list = words_translate.split('|')
                if len(word_list) != len(word_translate_list):
                    continue
                print("word-------------" + words)
                print("words_translate--" + words_translate)
                for i in range(0, len(word_list)):
                    if hasChar(word_translate_list[i]):
                        continue
                    m_WordDictionary[word_list[i]] = word_translate_list[i]
                    count_debug += 1
                    print("已经翻译完成[" + str(count_debug) + '/' + str(len(m_WordDictionary)) + ']')
                WriteTranslateToDictionary()
        print('已经GoogleTranslate的轮数:[' + str(TranslateCount) + ']')

def WriteTranslateToDictionary():
    words = ''
    for word in m_WordList:
        words += (word + ',' + m_WordDictionary[word] + '\n')
    write(os.path.join(SavePath, "Dictionary_legal_google.csv"), words)


def hasChar(word):
    return (len(re.findall(r'[a-zA-Z]', word)) > 0 and word.islower())


if __name__ == '__main__':
    parser = OptionParser()
    parser.add_option(
        "-f",
        "--FromLan",
        dest="FromLan",
        default="en",
        help="翻译-原始语言[en/fr/de/id/zh-CN]")
    parser.add_option(
        "-t",
        "--ToLan",
        dest="ToLan",
        default="zh-CN",
        help="翻译-目标语言[en/fr/de/id/zh-CN]")
    parser.add_option(
        "-p",
        "--savePath",
        dest="savePath",
        default="cambridge/indonesian-english",
        help="存储路径[cambridge/indonesian-english]")
    (opts, args) = parser.parse_args()
    Lan_From = opts.FromLan
    Lan_To = opts.ToLan
    # Lan_Str = opts.string
    SavePath = opts.savePath
    ## 配置 准备工作
    RootPath = os.path.dirname(os.path.realpath(__file__))
    SavePath = os.path.join(RootPath, SavePath)
    if not os.path.exists(SavePath):
        print("SavePath路径不存在:" + SavePath)
        sys.exit(1)
    print(SavePath)
    
    ####################  全局变量  ####################
    m_WordList = []             # 单词列表--用于排序
    m_WordDictionary = {}       # 单词字典--[key:word value:translate]
    m_onceTranslateWord = 20    # 每次对多翻译多少单词
    ####################  全局变量  ####################

    # 1. 创建翻译表
    if not os.path.isfile(os.path.join(SavePath, "Dictionary_legal_google.csv")):
        data = readlines(os.path.join(SavePath, "Dictionary_legal.csv"))
        words = ''
        for word in data:
            words += (word.replace('\n', '') + ',' + '\n')
        write(os.path.join(SavePath, "Dictionary_legal_google.csv"), words)
    # 2. 得到所有单词的list和dic
    data = readlines(os.path.join(SavePath, "Dictionary_legal_google.csv"))
    for word in data:
        word = word.replace('\n', '')
        wod_ = word.split(',')
        m_WordList.append(wod_[0])
        m_WordDictionary[wod_[0]] = wod_[1]
    # 3. 开始Google翻译
    GoogleTranslate()

(5)将Google翻译成的中文,删除敏感词(敏感词库SensitiveWords.csv),就是我们想要的最终词库:
        RemoveSensitiveWord.py

#!/usr/bin/python
# -*- coding: UTF-8 -*-



import re
import os
import sys
import random
import requests
from optparse import OptionParser
reload(sys)
sys.setdefaultencoding('utf-8')



DICTIONARY = "Dictionary.csv"
DICTIONARY_LEGAL = "Dictionary_legal.csv"

RootPath = os.path.dirname(os.path.realpath(__file__))
print(RootPath)

def read(fileName):
    f = open(fileName, "r")
    str_ = f.read()
    f.close()
    return str_


def readlines(fileName):
    f = open(fileName, "r")
    str_ = f.readlines()
    f.close()
    return str_


def write(fileName, str_):
    f = open(fileName, "w")
    str_ = f.write(str_)
    f.close()

def printRed(stri):
    print("\033[1;35;40m" + str(stri))

def FilterLegalWord():
    # 特殊符号等包含:”,”  ”.”  ”/”  ”?”  ”;”  ”@”  ”#”  ”%”  ”-”  ”!”  ”(”  ”)”  0-9  大写字母
    illegalChars = [' ', ',', '.', '/', '\\', '?', ';', '@', '#', '%', '-', '!', '(', ')', '$', "|"]
    data = readlines(os.path.join(DicPath, DICTIONARY))
    words = ''
    for word in data:
        word = word.replace('\n', '')
        # 1. 长度3-7
        if len(word) < 3 or len(word) > 7:
            printRed('单词违法:-长度-[' + word + '](' + str(len(word)) + ')')
            continue
        # 2. 不包含数字
        if bool(re.search(r'\d', word)):
            printRed('单词违法:-包含数字-[' + word + ']')
            continue
        # 3. 不包含大写
        if not word.islower():
            printRed('单词违法:-包含大写-[' + word + ']')
            continue
        # 4. 不包含非法字符
        hasIllegalChar = False
        for illegalChar in illegalChars:
            if illegalChar in word:
                printRed('单词违法:-包含违法字母-[' + word + '](' + illegalChar +')')
                hasIllegalChar = True
                break
        if hasIllegalChar:
            continue
        # 剩下的词都合法
        words += (word + '\n')
    write(os.path.join(DicPath, DICTIONARY_LEGAL), words)

def IsSensitiveWord(checkWord):
    for sensitiveWord in m_SensitiveWords:
        if sensitiveWord in checkWord:
            return True
    return False

if __name__ == '__main__':
    parser = OptionParser()
    parser.add_option(
        "-p",
        "--dicPath",
        dest="dicPath",
        default="cambridge/indonesian-english",
        help="爬下来的词典路径[cambridge/indonesian-english -- 用来拼成Dictionary_legal_google.csv路径]")
    (opts, args) = parser.parse_args()
    DicPath = opts.dicPath
    DicPath = os.path.join(RootPath, DicPath)
    print(DicPath)
    if not os.path.exists(DicPath):
        print("没有该字典路径:" + DicPath)
        sys.exit(1)
    
    ######### 全局变量 #########
    m_SensitiveWords = []
    m_words = ''
    ######### 全局变量 #########
    filename_sensitive = os.path.join(RootPath, "SensitiveWords.csv")
    filename_from = os.path.join(DicPath, "Dictionary_legal_google.csv")
    filename_to = os.path.join(DicPath, "Dictionary_legal_google_RemoveSensitive.csv")
    # 1. 获取敏感词
    data = readlines(filename_sensitive)
    for word in data:
        m_SensitiveWords.append(word.replace('\n', ''))
    # 2. 词库删除敏感词
    data = readlines(filename_from)
    for checkword in data:
        checkword = checkword.replace('\n', '')
        if not IsSensitiveWord(checkword):
            m_words += (checkword + '\n')
        else:
            print("删除敏感词:" + checkword)
    write(filename_to, m_words)

  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值