再进阶版（依旧low的） -- python翻译爬虫详解

最新推荐文章于 2022-09-01 18:26:45 发布

Changod

最新推荐文章于 2022-09-01 18:26:45 发布

阅读量339

点赞数

分类专栏： python爬虫

本文链接：https://blog.csdn.net/Changod/article/details/89633831

版权

python爬虫专栏收录该内容

7 篇文章 0 订阅

订阅专栏

这次更新支持了短文的查询

下面是过程：

首先找到一篇短文：

Today, my mother has something to do, so she needs to meet her friends. I make a promise that I will clean the house when she returns. But I become lazy soon. I start to play computer games and forget my promise. When I remember, I start my work quickly. I feel so bad to almost let my mother down. Luckily, I finish my job and my mother feels happy.

放到youdao翻译网站上，生成的url是：

http://www.youdao.com/w/Today%2C%20my%20mother%20has%20something%20to%20do%2C%20so%20she%20needs%20to%20meet%20her%20friends.%20I%20make%20a%20promise%20that%20I%20will%20clean%20the%20house%20when%20she%20returns.%20But%20I%20become%20lazy%20soon.%20I%20start%20to%20play%20computer%20games%20and%20forget%20my%20promise.%20When%20I%20remember%2C%20I%20start%20my%20work%20quickly.%20I%20feel%20so%20bad%20to%20almost%20let%20my%20mother%20down.%20Luckily%2C%20I%20finish%20my%20job%20and%20my%20mother%20feels%20happy.%20/#keyfrom=dict2.top

仔细对比网站生成的url会发现，它把所有的空格换成了"%20"，所有的','换成了"%2C"

因此我们简单的跑一个for循环，就能生成符合条件的url

此时对比生成的url和网站的url

跑一个c++程序进行字符串判断

#include<iostream>
#include<cstring>
#include<cstdio>
#include<algorithm>
#include<vector>
#include<set>
#include<map>
#include<queue>
#include<stack>
#include<cmath>
#define ll long long
#define mod 1000000007
#define inf 0x3f3f3f3f
using namespace std;
int main()
{
    char a[1005], b[1005];
    scanf("%s%s", a, b);
    bool flag = 1;
    for(int i = 0; i < strlen(a); i ++)
    {
        if(a[i] != b[i])
        {
            printf("%d\n", i);
            flag = 0;
            break;
        }
    }
    if(flag) printf("YES\n");
    else printf("NO\n");
    return 0;
}

输出YES！表示猜想正确，url完全一致

接下来阅读页面的HTML码

很容易我们就找到了位于<div id = "fanyiToggle"><p>标签下的有效信息，然后我们就可以利用bs4进行筛选和爬取了

以下是源码，封装成了函数，整体容错率较高，但是不支持空白行信息的查询

基本上这个自己制作的第一个小爬虫就更新到这里了，也很满意了，可以完善的就是多种语言互译了，这个以后看心情再搞吧

感谢youdao翻译！！！仅供学习，禁止用于商业

import requests
from bs4 import BeautifulSoup
import bs4
url1 = "http://www.youdao.com/w/"
url2 = "/#keyfrom=dict2.top"
kv = {'user-agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Firefox/31.0' }
def getword(url): #获取单词的HTML文本
    try:
        r = requests.get(url, headers = kv)
        r.raise_for_status()
        text = r.text
        return text
    except:
        print("单词翻译失败")
        return ""
def printword(text): #输出单词的有效信息
    soup = BeautifulSoup(text, "html.parser")
    for word in soup.find('div', id="phrsListTab").children:
        if isinstance(word, bs4.element.Tag):
            tds = word('li')
            if (tds):
                for i in range(len(tds)):
                    print(tds[i].contents[0])
def getarticleurl(url): #获取短文的有效url链接
    text = ""
    for ch in url:
        if ch == ' ':
            text = text + "%20"
        elif ch == ',':
            text = text + "%2C"
        else:
            text = text + ch
    # url = url1 + text + url2
    return url1 + text + url2

def getarticle(url): #获取短文的HTML文本
    try:
        r = requests.get(url, headers=kv)
        r.raise_for_status()
        text = r.text
        return text
    except:
        print("短文翻译失败")

def printarticle(text): #输出短文的有效信息
    soup = BeautifulSoup(text, "html.parser")
    for word in soup.find('div', id="fanyiToggle").children:
        if isinstance(word, bs4.element.Tag):
            tds = word('p')
            if (tds):
                # for i in range(len(tds)):
                print(tds[1].contents[0])

def main(): #主函数
    print("\n")
    print("查询单词请按1， 查询短文请按2（请输入成段信息，不要出现空白行）")
    key = int(input())
    if key == 1:
        print("请输入想翻译的单词：")
        printword(getword(url1 + input() + url2))
    elif key == 2:
        print("请输入想翻译的短文：")
        printarticle(getarticle(getarticleurl(input())))
    else:
        print("请输入合法的指令（1 or 2）！")

while 1:
    main()