python 爬虫获取谷歌翻译结果
2021-09更新:谷歌翻译接口已修改,以下代码可能无法获取翻译内容,新的爬取接口可查看该页面https://blog.csdn.net/qq_40846669/article/details/119926079
以下原文就不删了,请大家直接忽略吧
相关参考链接:https://blog.csdn.net/yingshukun/article/details/53470424
网上早就有了相关的代码,但是都有大大小小的问题,不能拿来就用,所以在这里整理备份下:
ps:本文讲解较简单,未考虑翻译文本的字数限制,据上面参考博客所说,翻译文本长度大概4800个字符左右,如翻译量较大可考虑内容切片后再提交翻译。
以下为源码:
import requests
import urllib.parse
import json
import sys
import execjs # 可通过pip install PyExecJS安装,用来执行js脚本
class Py4Js():
def __init__(self):
self.ctx = execjs.compile("""
xo=function(a,b){
for(var c=0;c<b.length-2;c+=3)
{var d=b.charAt(c+2);d="a"<=d?d.charCodeAt(0)-87:Number(d);d="+"==b.charAt(c+1)?a>>>d:a<<d;a="+"==b.charAt(c)?a+d&4294967295:a^d}return a}
function TL(a){
var wo=function(a){return function(){return a}}
b=wo(String.fromCharCode(84));
var c=wo(String.fromCharCode(75));
b=[b(),b()];b[1]=c();
b="750.0";
var d=wo(String.fromCharCode(116));
c=wo(String.fromCharCode(107));
d=[d(),d()];
c="&"+d.join("")+ "=";
d=b.split(".");
b=6;
for(var e=[],f=0,g=0;g<a.length;g++)
{var k=a.charCodeAt(g);128>k?e[f++]=k:(2048>k?e[f++]=k>>6|192:(55296==(k&64512)&&g+1<a.length&&56320==(a.charCodeAt(g+1)&64512)?(k=65536+((k&1023)<<10)+(a.charCodeAt(++g)&1023),e[f++]=k>>18|240,e[f++]=k>>12&63|128):e[f++]=k>>12|224,e[f++]=k>>6&63|128),e[f++]=k&63|128)}a=b;for(f=0;f<e.length;f++)a+=e[f],a=xo(a,"+-a^+6");a=xo(a,"+-3^+b+-f");a^=Number(d[1])||0;0>a&&(a=(a&2147483647)+2147483648);a%=1E6;
return c+(a.toString()+"."+ (a^b))}
""")
def getTk(self, text):
return self.ctx.call("TL", text)
def buildUrl(text, tk):
baseUrl = "https://translate.google.cn/translate_a/single?client=webapp&sl=zh-CN&tl=en&hl=zh-CN&dt=at&dt=bd&dt=ex&dt=ld&dt=md&dt=qca&dt=rw&dt=rm&dt=ss&dt=t&swap=1&otf=2&ssel=5&tsel=5&kc=1&"
baseUrl+= 'tk=' + str(tk) + '&'
baseUrl += 'q=' + urllib.parse.quote(text)
print(baseUrl)
return baseUrl
def translate(js,text):
header = {
'cookie': 'NID=188=Nx_B7MPjOKKUBKu4LByiqdUEwcO4goXhVKB0vtqhvJycCD3TIPTgA7HU80AQ4LJXfrAjV8gvawvSDMKgS52MkV3JB44kgzNq9aHp41EuL8-2Cns1re4xCgQvPr1jMI9JPZxFU9fdHtymXto3qCv64HVBIkQ8vfBRMxKeZl0XS4g',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36'
}
url = buildUrl(text, js.getTk(text))
res = ''
try:
r = requests.get(url)
result = json.loads(r.content.decode("utf-8"))
res = result[0][0][0]
except Exception as e:
res = ''
print(url)
print("翻译失败:" + text)
print(e)
finally:
return res
if __name__ == '__main__':
text = "中文内容"
js = Py4Js()
res = translate(js,text)
print(res)
解释说明:
1.本来以为网上的代码不能用,所以代码中的js部分是我自己从谷歌翻译的js中提取的,没想到网上的js代码也是可以用的,真是白白浪费了精力啊,也在这附上网上常见的js代码:
self.ctx = execjs.compile("""
function TL(a) {
var k = "";
var b = 406644;
var b1 = 3293161072;
var jd = ".";
var $b = "+-a^+6";
var Zb = "+-3^+b+-f";
for (var e = [], f = 0, g = 0; g < a.length; g++) {
var m = a.charCodeAt(g);
128 > m ? e[f++] = m : (2048 > m ? e[f++] = m >> 6 | 192 : (55296 == (m & 64512) && g + 1 < a.length && 56320 == (a.charCodeAt(g + 1) & 64512) ? (m = 65536 + ((m & 1023) << 10) + (a.charCodeAt(++g) & 1023),
e[f++] = m >> 18 | 240,
e[f++] = m >> 12 & 63 | 128) : e[f++] = m >> 12 | 224,
e[f++] = m >> 6 & 63 | 128),
e[f++] = m & 63 | 128)
}
a = b;
for (f = 0; f < e.length; f++) a += e[f],
a = RL(a, $b);
a = RL(a, Zb);
a ^= b1 || 0;
0 > a && (a = (a & 2147483647) + 2147483648);
a %= 1E6;
return a.toString() + jd + (a ^ b)
};
function RL(a, b) {
var t = "a";
var Yb = "+";
for (var c = 0; c < b.length - 2; c += 3) {
var d = b.charAt(c + 2),
d = d >= t ? d.charCodeAt(0) - 87 : Number(d),
d = b.charAt(c + 1) == Yb ? a >>> d: a << d;
a = b.charAt(c) == Yb ? a + d & 4294967295 : a ^ d
}
return a
}
""")
-
上文代码仅用于中译英,如需其他翻译方式可自行替换 buildUrl 中相关参数:以下为英译中的url,后面的tk 以及q 参数不需改变。
https://translate.google.cn/translate_a/single?client=webapp&sl=en&tl=zh-CN&hl=zh-CN&dt=at&dt=bd&dt=ex&dt=ld&dt=md&dt=qca&dt=rw&dt=rm&dt=ss&dt=t&otf=2&ssel=5&tsel=5&kc=1&
-
谷歌翻译返回的字段为json格式,可通过json.loads() 转换一下。然后就可以像列表一样按需自取啦,当然按双引号切片也是一个办法,不过如果想取其他信息就会不太方便。