Python 爬虫之Google翻译实现

版权声明:本文为博主原创文章,转载请注明出处。http://blog.csdn.net/yingshukun https://blog.csdn.net/yingshukun/article/details/53470424

用过一些翻译工具,发现还是Google翻译最准确,但是Google翻译现在没有免费的API,网上的一些爬虫资料也太过陈旧了,Google翻译的机制早都改了,完全无用。这里简单的提供一下实现,需要更多功能可以去增加。


谷歌翻译在国内是可以是可以使用的,我这里没有使用任何vpn或者代理,访问无问题。

1、抓个包分析一下


发现是get请求方式,那就更简单了,直接拼接URL,分析一下参数,发现最关键的是q,q后面带的就是待翻译的文本,我这里只需要将英文翻译成汉语,其他的参数就直接照抄了,如果有其他需求,直接修改相关参数即可

         

http://translate.google.cn/translate_a/single?client=t&sl=en&tl=zh-CN&hl=zh-CN&dt=at&dt=bd&dt=ex&dt=ld&dt=md&dt=qca&dt=rw&dt=rm&dt=ss&dt=t&ie=UTF-8&oe=UTF-8&otf=1&srcrom=0&ssel=0&tsel=0&kc=5&tk=196711.345729&q=替换为待翻译字符串


2、在浏览器中使用拼接好的URL,发现请求失败,在网页中多次测试,发现Google果然做了一些工作。其中有个关键的参数tk,该参数是利用js代码计算出来的,不同的翻译内容,计算出的值就不同,没办法,只能去研究一下js的代码了,但是博主不太会js,而且该tk值的算法也是相当复杂的,最后只能借助万能的Google搜索,果然不出所料,已经有牛人提取了该tk值的算法代码。github地址:https://github.com/cocoa520/Google_TK

好了,进入正题了,附上Python代码

import execjs

class Py4Js():
    
    def __init__(self):
        self.ctx = execjs.compile("""
        function TL(a) {
        var k = "";
        var b = 406644;
        var b1 = 3293161072;
        
        var jd = ".";
        var $b = "+-a^+6";
        var Zb = "+-3^+b+-f";
    
        for (var e = [], f = 0, g = 0; g < a.length; g++) {
            var m = a.charCodeAt(g);
            128 > m ? e[f++] = m : (2048 > m ? e[f++] = m >> 6 | 192 : (55296 == (m & 64512) && g + 1 < a.length && 56320 == (a.charCodeAt(g + 1) & 64512) ? (m = 65536 + ((m & 1023) << 10) + (a.charCodeAt(++g) & 1023),
            e[f++] = m >> 18 | 240,
            e[f++] = m >> 12 & 63 | 128) : e[f++] = m >> 12 | 224,
            e[f++] = m >> 6 & 63 | 128),
            e[f++] = m & 63 | 128)
        }
        a = b;
        for (f = 0; f < e.length; f++) a += e[f],
        a = RL(a, $b);
        a = RL(a, Zb);
        a ^= b1 || 0;
        0 > a && (a = (a & 2147483647) + 2147483648);
        a %= 1E6;
        return a.toString() + jd + (a ^ b)
    };
    
    function RL(a, b) {
        var t = "a";
        var Yb = "+";
        for (var c = 0; c < b.length - 2; c += 3) {
            var d = b.charAt(c + 2),
            d = d >= t ? d.charCodeAt(0) - 87 : Number(d),
            d = b.charAt(c + 1) == Yb ? a >>> d: a << d;
            a = b.charAt(c) == Yb ? a + d & 4294967295 : a ^ d
        }
        return a
    }
    """)
        
    def getTk(self,text):
        return self.ctx.call("TL",text)

  

在Python 中调用js代码的方式有不少种,比如PyV8库,但是目前不支持Python3,还可以使用微软的ScriptControl,只需要安装Python的win32库,但是API有些繁琐,这里我使用API最简洁的PyExecJS库,下载安装完成后,导入execjs模块即可。(给个github地址:https://github.com/doloopwhile/PyExecJS)

这里封装一个获取tk值的方法,传入的参数就是待翻译的字符串


接下来就是具体实现的Python3代码:

import urllib.request  
from HandleJs import Py4Js  
  
def open_url(url):  
    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}    
    req = urllib.request.Request(url = url,headers=headers)  
    response = urllib.request.urlopen(req)  
    data = response.read().decode('utf-8')  
    return data  
  
def translate(content,tk):  
    if len(content) > 4891:  
        print("翻译的长度超过限制!!!")  
        return   
      
    content = urllib.parse.quote(content)  
      
    url = "http://translate.google.cn/translate_a/single?client=t"\  
    "&sl=en&tl=zh-CN&hl=zh-CN&dt=at&dt=bd&dt=ex&dt=ld&dt=md&dt=qca"\  
    "&dt=rw&dt=rm&dt=ss&dt=t&ie=UTF-8&oe=UTF-8&clearbtn=1&otf=1&pc=1"\  
    "&srcrom=0&ssel=0&tsel=0&kc=2&tk=%s&q=%s"%(tk,content)  
      
    #返回值是一个多层嵌套列表的字符串形式,解析起来还相当费劲,写了几个正则,发现也很不理想,
    #后来感觉,使用正则简直就是把简单的事情复杂化,这里直接切片就Ok了  
    result = open_url(url)  
      
    end = result.find("\",")  
    if end > 4:  
        print(result[4:end])  
  
def main():  
    js = Py4Js()  
      
    while 1:  
        content = input("输入待翻译内容:")  
          
        if content == 'q!':  
            break  
          
        tk = js.getTk(content)  
        translate(content,tk)  
      
if __name__ == "__main__":  
    main()


需要注意,使用get请求,URL携带的数据有限,这里经过测试,翻译的文本长度大概4891个字符左右


到这里,基本功能就已经实现了,如果有其他的需求,可以做一些完善,比如直接将文本读取,翻译结果返回后再保存为文本文件,方便阅读,特别是做一些英文文档的翻译

对于一些朋友提出的问题,我这里重新做出了修改,之前想尽可能知识单一,没有用强大的requests库,另外对于返回结果没有细看,失误失误,返回结果确实是一个json数组,直接解析为Python的列表类型即可

import requests  
from HandleJs import Py4Js    
    
def translate(tk,content):   
    if len(content) > 4891:    
        print("翻译的长度超过限制!!!")    
        return  

    param = {'tk': tk, 'q': content}

    result = requests.get("""http://translate.google.cn/translate_a/single?client=t&sl=en
        &tl=zh-CN&hl=zh-CN&dt=at&dt=bd&dt=ex&dt=ld&dt=md&dt=qca&dt=rw&dt=rm&dt=ss
        &dt=t&ie=UTF-8&oe=UTF-8&clearbtn=1&otf=1&pc=1&srcrom=0&ssel=0&tsel=0&kc=2""", params=param)

    #返回的结果为Json,解析为一个嵌套列表
    for text in result.json():
        print(text)
     
    
def main():    
    js = Py4Js()    
         
    content = """Beautiful is better than ugly.
        Explicit is better than implicit.
        Simple is better than complex.
        Complex is better than complicated.
        Flat is better than nested.
        Sparse is better than dense.
        Readability counts.
        Special cases aren't special enough to break the rules.
        Although practicality beats purity.
        Errors should never pass silently.
        Unless explicitly silenced.
        In the face of ambiguity, refuse the temptation to guess.
        There should be one-- and preferably only one --obvious way to do it.
        Although that way may not be obvious at first unless you're Dutch.
        Now is better than never.
        Although never is often better than *right* now.
        If the implementation is hard to explain, it's a bad idea.
        If the implementation is easy to explain, it may be a good idea.
        Namespaces are one honking great idea -- let's do more of those!"""

    tk = js.getTk(content)    
    translate(tk,content)    
        
if __name__ == "__main__":    
    main()




展开阅读全文

没有更多推荐了,返回首页