再进阶版(依旧low的) -- python翻译爬虫详解

这次更新支持了短文的查询

下面是过程:

首先找到一篇短文:

Today, my mother has something to do, so she needs to meet her friends. I make a promise that I will clean the house when she returns. But I become lazy soon. I start to play computer games and forget my promise. When I remember, I start my work quickly. I feel so bad to almost let my mother down. Luckily, I finish my job and my mother feels happy. 

放到youdao翻译网站上,生成的url是:

http://www.youdao.com/w/Today%2C%20my%20mother%20has%20something%20to%20do%2C%20so%20she%20needs%20to%20meet%20her%20friends.%20I%20make%20a%20promise%20that%20I%20will%20clean%20the%20house%20when%20she%20returns.%20But%20I%20become%20lazy%20soon.%20I%20start%20to%20play%20computer%20games%20and%20forget%20my%20promise.%20When%20I%20remember%2C%20I%20start%20my%20work%20quickly.%20I%20feel%20so%20bad%20to%20almost%20let%20my%20mother%20down.%20Luckily%2C%20I%20finish%20my%20job%20and%20my%20mother%20feels%20happy.%20/#keyfrom=dict2.top

仔细对比网站生成的url会发现,它把所有的空格换成了"%20",所有的','换成了"%2C"

因此我们简单的跑一个for循环,就能生成符合条件的url

此时对比生成的url和网站的url

跑一个c++程序进行字符串判断

#include<iostream>
#include<cstring>
#include<cstdio>
#include<algorithm>
#include<vector>
#include<set>
#include<map>
#include<queue>
#include<stack>
#include<cmath>
#define ll long long
#define mod 1000000007
#define inf 0x3f3f3f3f
using namespace std;
int main()
{
    char a[1005], b[1005];
    scanf("%s%s", a, b);
    bool flag = 1;
    for(int i = 0; i < strlen(a); i ++)
    {
        if(a[i] != b[i])
        {
            printf("%d\n", i);
            flag = 0;
            break;
        }
    }
    if(flag) printf("YES\n");
    else printf("NO\n");
    return 0;
}

输出YES!表示猜想正确,url完全一致

接下来阅读页面的HTML码

很容易我们就找到了位于<div id = "fanyiToggle"><p>标签下的有效信息,然后我们就可以利用bs4进行筛选和爬取了

以下是源码,封装成了函数,整体容错率较高,但是不支持空白行信息的查询

基本上这个自己制作的第一个小爬虫就更新到这里了,也很满意了,可以完善的就是多种语言互译了,这个以后看心情再搞吧

感谢youdao翻译!!!仅供学习,禁止用于商业

import requests
from bs4 import BeautifulSoup
import bs4
url1 = "http://www.youdao.com/w/"
url2 = "/#keyfrom=dict2.top"
kv = {'user-agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Firefox/31.0' }
def getword(url): #获取单词的HTML文本
    try:
        r = requests.get(url, headers = kv)
        r.raise_for_status()
        text = r.text
        return text
    except:
        print("单词翻译失败")
        return ""
def printword(text): #输出单词的有效信息
    soup = BeautifulSoup(text, "html.parser")
    for word in soup.find('div', id="phrsListTab").children:
        if isinstance(word, bs4.element.Tag):
            tds = word('li')
            if (tds):
                for i in range(len(tds)):
                    print(tds[i].contents[0])
def getarticleurl(url): #获取短文的有效url链接
    text = ""
    for ch in url:
        if ch == ' ':
            text = text + "%20"
        elif ch == ',':
            text = text + "%2C"
        else:
            text = text + ch
    # url = url1 + text + url2
    return url1 + text + url2

def getarticle(url): #获取短文的HTML文本
    try:
        r = requests.get(url, headers=kv)
        r.raise_for_status()
        text = r.text
        return text
    except:
        print("短文翻译失败")

def printarticle(text): #输出短文的有效信息
    soup = BeautifulSoup(text, "html.parser")
    for word in soup.find('div', id="fanyiToggle").children:
        if isinstance(word, bs4.element.Tag):
            tds = word('p')
            if (tds):
                # for i in range(len(tds)):
                print(tds[1].contents[0])

def main(): #主函数
    print("\n")
    print("查询单词请按1, 查询短文请按2(请输入成段信息,不要出现空白行)")
    key = int(input())
    if key == 1:
        print("请输入想翻译的单词:")
        printword(getword(url1 + input() + url2))
    elif key == 2:
        print("请输入想翻译的短文:")
        printarticle(getarticle(getarticleurl(input())))
    else:
        print("请输入合法的指令(1 or 2)!")

while 1:
    main()

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值