用python爬取考研词汇及其近反义词与例句

最新推荐文章于 2024-05-02 06:13:14 发布

是强筱华哇！

最新推荐文章于 2024-05-02 06:13:14 发布

阅读量2.2k

点赞数 3

分类专栏： python 网络爬虫文章标签： python 爬虫

本文链接：https://blog.csdn.net/hua_you_qiang/article/details/115224398

版权

python 网络爬虫专栏收录该内容

3 篇文章 3 订阅

订阅专栏

前期准备

运行环境：jupyter notebook 或者 pycharm
python版本：python3.x
浏览器：chrome
需要用到的库：

requests
bs4
os
enchant
json
time

可能需要安装的库

pip install requests -i https://pypi.tsinghua.edu.cn/simple
pip install Beautifulsoup4 -i https://pypi.tsinghua.edu.cn/simple
pip install pyenchant -i https://pypi.tsinghua.edu.cn/simple

思路

先找到有考研单词的网站，爬取词汇。
将爬取的词汇放进单词搜索网站上检索，爬取其近反义词及其例句
最终保存到json格式的文件里。
在这里插入图片描述

爬取流程

爬取词汇

爬取网址：http://word.iciba.com/?action=courses&classid=13

导入相应的库

import requests
from bs4 import BeautifulSoup
import os
import enchant
import json
from time import sleep
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36", 
          "cookie": "UM_distinctid=1785a4cc9f7303-069f2552a41de7-5771031-144000-1785a4cc9f856b", 
           "upgrade-insecure-requests": "1"
          }

获取网页源代码

# 获取网页源代码
def get_html(url):
    try:
        r = requests.get(url=url, headers=headers)
        r.raise_for_status()
        print("text状态：", r.raise_for_status)
        r.encoding = r.apparent_encoding
        return r.text
    except Exception as result:
        print("错误原因0：", result)
        return ''

获取单词和意思

这部分是爬取单词的核心部分，我们一步一步分析看看.
打开网址，可以看到这里有一个词汇列表。

在这里插入图片描述
按Ctrl+Shitf+i,打开开发者调试页面，按Ctrl+Shift+C,然后将鼠标悬空在列表页面上

可以快速定位到我们需要分析的内容
我们可以发现course_id是有规律的在变化。
我们点开第一个列表

可以看到url里也有一个course，将course=1改成course=2。
在这里插入图片描述
我们就跳转到第二页，这个规律就找到了。我们就可以通过遍历的方式爬取所有单词。

接下来就是爬取每一个单词和词汇。
按Ctrl+Shift+i,再定位到某个单词上
在这里插入图片描述
可以看到这里有个< li>列表，点开看看里面有什么。

在这里插入图片描述
可以看到我们想要的东西就在< span>里（音标在< strong>里, 这里就不爬音标了，想要的可以自行添加代码）。
通过以上分析我们可以得出分析结果：

我们可以通过修改course的值来进行翻页。每个单词和意思都在< li>里，而单词和意思在< li>里的< span>里

所以代码可以如下这样写

# 获取单词和意思
def get_words(text):
    word_dict = dict()
    soup = BeautifulSoup(text, 'lxml')
    for each in soup.find_all('li' ):
        span = each.select('span')
        word_dict[span[0].attrs['title']] = span[1].attrs['title']
    return word_dict

一个小细节
本来可以直接span.text.strip()来获取文本，为什么最终是用了span.arrts[‘title’]来获取文本呢？
看了这图就不言而喻了。
在这里插入图片描述
爬取词汇的核心部分解决了，那么就来分析后面该怎么做了。
方案一：
先爬取所有的词汇，然后保存。然后将单词从本地取出来放入单词搜索网站上，一个一个爬下来，最后再保存。
方案二：
爬一页词汇列表，然后将该页词汇列表里的单词放入搜索网站上搜索，爬取后保存。

这两个方法都有利有弊，因为我都试过了。第一次用的是方案一，因为忘记设置sleep，导致爬取速度过快，被封ip了。所以我这次就用第二种，交替爬取两个网站，再设置sleep，降低访问频率。

爬取单词的近反义词及其例句

网址：http://dict.cn/
在这里插入图片描述
随便输入一个单词

可以看到搜索后就是在网址后直接加入被搜索单词，那么这个也可以通过遍历搜索所有单词的近反义词和例句。

分析例句

在这里插入图片描述

通过定位可以找到在< div class=“layout sort”>容器里的< ol>下有< li>这< li>里就是例句
在这里插入图片描述
核心代码就可以这样写

div = soup.find('div', class_="layout sort")
for li in div.select('li'):
	print(li.text)

爬取近反义词

定位到近反义词部分
在这里插入图片描述
我们仔细分析一下可以发现，近反义词的词汇都在< div class=“layout nfo”>里，近反义词的词汇都存储在各自的< ul>里，如果通过< ul>的位置顺序来判断近反义词，也许会出错。例如data这个单词，只有近义词。
在这里插入图片描述
所以我们需要另外一种方法
比如在< div class=“layout nfo”> 里还有< div>，里面存放的就是**【近义词】和【反义词】**,那么我们可以通过这个来顺藤摸瓜，直到遇见< ul>节点。然后我们就可以爬到< ul> 节点下的所有< li>子节点
在这里插入图片描述
< li>里还有一个< a>，里面既有英文又有中文，但是good这个单词里的近反义词就没有中文。

所以就需要用列表来存储。

word_li = []
for i in soup.find_all('li'):
    if i.text.strip():
        word_li.append(i.text.strip().split())

那么核心部分都分析完了，就要分析存储结构了。

word_all_dict[word] = {"translate":"", 
                        "homoionym":{[英文单词列表]}, 
                        "antonym":{[英文单词列表]},
                        "sentence":[英文句子中文翻译]}

这是我的存储结构。

开始撸代码

import requests
from bs4 import BeautifulSoup
import os
import enchant
import json
from time import sleep
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36", 
          "cookie": "UM_distinctid=1785a4cc9f7303-069f2552a41de7-5771031-144000-1785a4cc9f856b", 
           "upgrade-insecure-requests": "1"
          }

# 获取网页源代码
def get_html(url):
    try:
        r = requests.get(url=url, headers=headers)
        r.raise_for_status()
        print("text状态：", r.raise_for_status)
        r.encoding = r.apparent_encoding
        return r.text
    except Exception as result:
        print("错误原因0：", result)
        return ''

# 获取单词和意思
def get_words(text):
    word_dict = dict()
    soup = BeautifulSoup(text, 'lxml')
    for each in soup.find_all('li' ):
        span = each.select('span')
        word_dict[span[0].attrs['title']] = span[1].attrs['title']
    return word_dict

# 获取近反义词的关键函数
def check_word(tag):
    soup = BeautifulSoup(str(tag), 'lxml')
    word_li = []
    for i in soup.find_all('li'):
        if i.text.strip():
            word_li.append(i.text.strip().split())
    return word_li

def get_word_all(word, translate, search_text):
    """
    word: 关键词
    translate: word的中文意思
    search_text:word的爬虫文本
    """
    # 创建一个字典
    word_all_dict = dict()
#     word_all_dict[word] = {"translate":"", 
#                            "homoionym":{[英文单词列表]}, 
#                            "antonym":{[英文单词列表]},
#                            "sentence":[英文句子中文翻译]}
    word_all_dict[word] = {"translate":translate, "homoionym":list(), "antonym":list(), "sentence":list()}
    # 制作一碗毒鸡汤
    soup = BeautifulSoup(search_text, 'lxml')
    # 获取例句
    for each in soup.find('div', class_="layout sort").select('li'):
        word_all_dict[word]["sentence"].append(each.text)
    # 爬取近反义词
    soup = soup.find('div', class_="layout nfo")
    for tmp in soup.select('div'):
        print(tmp.text)
        if "近义词" in tmp.text:
            word_all_dict[word]["homoionym"] = check_word(tmp.next.next.next)
        if "反义词" in tmp.text:
            word_all_dict[word]["antonym"] = check_word(tmp.next.next.next)
    return word_all_dict

# 保存成json格式
def save_word_json(word, word_dict):
    path = 'D:/English/'
    if not os.path.exists(path):
        os.makedirs(path)
    str_dict = json.dumps(word_dict)
    fp = open(path+'Words.json', 'a+')
    fp.write(str_dict)
    fp.write('\n')
    fp.close()
    print(f"{word}保存成功！")

# 读取加载json文件
def load_json():
    path = 'D:/English/WordPhonetic.json'
    with open(path, 'r') as fp:
        data = fp.readlines()
    for each in data:
        word_dict = json.loads(each)
        print(word_dict)

def main():
    count = 0
    for i in range(1, 275):
        print(f"正在爬取第{i}个单词网页")
        url = f'http://word.iciba.com/?action=words&class=13&course={i}'
        text = get_html(url)
        all_word_dict = get_words(text)
        for word in all_word_dict:
            print(f"正在爬取第{count+1}个单词:{word}")
            wurl = 'http://dict.cn/'+word
            search_text = get_html(wurl)
            if not search_text:
                continue
            try:
                word_all = get_word_all(word, all_word_dict[word], search_text)
                save_word_json(word, word_all)
            except Exception as result:
                print(f"{word}保存失败，错误为：", result)
            sleep(2)
            count += 1

main()

注意：

需要读取的话就用load_json()函数读取。
最终爬取的结果会被保存到D:/English/目录下，没有此目录，代码也会自动创建。
写入模式是追加模式，切不可重复运行，不然内容只会追加，不会覆盖。
可以修改for i in range(1, 275)里的数字，进行分步爬取。
可以下载一个Visual Studio Code 软件，打开json文件。
要花大概两个小时才能运行完

是强筱华哇！

关注

3
点赞
踩
16

收藏

觉得还不错? 一键收藏
打赏
5
评论
用python爬取考研词汇及其近反义词与例句

前期准备运行环境：jupyter notebook 或者 pycharmpython版本：python3.x浏览器：chrome需要用到的库：requestsbs4osenchantjsontime可能需要安装的库pip install requests -i https://pypi.tsinghua.edu.cn/simplepip install Beautifulsoup4 -i https://pypi.tsinghua.edu.cn/simplepip insta
复制链接

扫一扫