Python有道翻译小工具(selenium自动化)
1.前言
本文记录了一种,基于python语言,selenium自动化模块,有道翻译的翻译小工具。
撰写背景:爬取数据存在英文,需要进行英译中,尝试使用有道翻译API:https://ai.youdao.com/gw.s。但有道提供的API存在费用需求,虽然送了50但数据量太大,并不足够。
2.逻辑分析
虽然API接口存在限制,但网页翻译是无限次的。
尝试抓包发现有道对返回信息进行了加密处理,处理起来比较麻烦,因此选用其他方式。
我们可以借助python-selenium,类似自动化爬虫的方式,模拟人工输入翻译,再获取翻译结果。
并且有道有新的AI翻译功能可以体验,效果也还可以
3.逻辑实现
环境准备:python 3+,以及chrome浏览器与版本对应的驱动
代码实现:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
# selenium驱动获取
def get_driver(proxies=None, headless=True, strategy=None):
chrome_options = Options()
if strategy == 'none':
chrome_options.page_load_strategy = 'none'
if strategy == 'eager':
chrome_options.page_load_strategy = 'eager'
# chrome_options.binary_location = "E:\\Chrome\\chrome.exe"
chrome_options.add_argument("--start-maximized")
# 无痕
if headless:
chrome_options.add_argument("--headless")
# 代理设置
if proxies:
chrome_options.add_argument("--proxy-server=" + proxies.get("https"))
driver_service = Service('E:/桌面/ChromeDerive104/chromedriver')
driver = webdriver.Chrome(options=chrome_options, service=driver_service)
# driver = webdriver.Chrome(options=chrome_options, executable_path="E:/桌面/ChromeDerive104/chromedriver")
# driver = webdriver.Chrome(options=chrome_options)
driver.implicitly_wait(5)
return driver
# 有道翻译主函数
# original_text : 要翻译的文本
# ai_switch : ai翻译开关,默认关闭
def get_translated_text(original_text, ai_switch=False):
# 登录后获取的cookie,作用于ai翻译
cookie = {
"DICT_SESS": xxx,
"DICT_LOGIN": xxx,
"DICT_DOCTRANS_SESSION_ID": xxx
}
driver = get_driver(headless=True)
driver.get('https://fanyi.youdao.com/indexLLM.html#/')
if ai_switch:
for name in cookie:
cookie_dict = {"domain": ".youdao.com", "path": "/", "name": name, "value": cookie[name]}
driver.add_cookie(cookie_dict=cookie_dict)
driver.get('https://fanyi.youdao.com/indexLLM.html#/')
input_box = driver.find_element(By.XPATH, '//*[@id="js_fanyi_input"]')
input_box.send_keys(original_text)
output_text = ''
if ai_switch:
driver.find_element(By.XPATH, '//*[@id="bottom"]/div/div[2]/span[2]').click()
while True:
try:
driver.find_element(By.XPATH,
'//*[@class="menu-item disabled color_text_5 disabled generating color_text_5"]')
except Exception:
output_text = driver.find_element(By.XPATH, '//*[@class="origin-text color_text_1"]').get_attribute(
'innerHTML')
break
else:
continue
else:
time.sleep(2)
output_texts = driver.find_elements(By.XPATH, '//*[@id="js_fanyi_output_resultOutput"]/p/span')
for i in output_texts:
output_text += i.get_attribute('innerHTML')
return output_text
if __name__ == '__main__':
text = get_translated_text(
"[{'title': 'Acute toxicity', 'conclusion': 'LD50 Oral - Rat - 1.560 mg/kg|| Remarks: Behavioral:Coma.'}, {'title': 'Skin corrosion/irritation', 'conclusion': 'Skin - Rabbit|| Result: Severe skin irritation - 24 h (Draize Test)'}, {'title': 'Serious eye damage/eye irritation', 'conclusion': 'No data available'}, {'title': 'Respiratory or skin sensitisation', 'conclusion': 'No data available'}, {'title': 'Germ cell mutagenicity', 'conclusion': 'No data available'}, {'title': 'Carcinogenicity', 'conclusion': 'IARC: No component of this product present at levels greater than or equal to 0.1% is identified as probable, possible or confirmed human carcinogen by IARC.'}, {'title': 'Reproductive toxicity', 'conclusion': 'No data available'}, {'title': 'Specific target organ toxicity - single exposure', 'conclusion': 'Inhalation - May cause respiratory irritation.'}, {'title': 'Specific target organ toxicity - repeated exposure', 'conclusion': 'No data available'}, {'title': 'Aspiration hazard', 'conclusion': 'No data available'}, {'title': 'Additional Information', 'conclusion': 'RTECS: SL7875000|| To the best of our knowledge, the chemical, physical, and toxicological properties have not been thoroughly investigated.'}, {'title': 'Toxicity', 'conclusion': 'LD50 orally in rats: 1560 mg/kg (Jenner)'}]",ai_switch=True)
print(text)
注意,cookie要修改为自己登录后网页获取的cookie
以我本次爬虫项目为例效果如下(ai翻译开启有利于格式保持不变):