Python实战 | 爬取37000+四字成语 BeautifulSoup+requests+多线程

最新推荐文章于 2022-12-22 12:58:16 发布

丶Xylon

最新推荐文章于 2022-12-22 12:58:16 发布

阅读量1.6k

点赞数

分类专栏： Python Re:从零开始的Python爬虫之路文章标签：爬虫 Python BeautifulSoup 多线程

本文链接：https://blog.csdn.net/Xylon_/article/details/99288764

版权

Python 同时被 2 个专栏收录

34 篇文章

订阅专栏

Re:从零开始的Python爬虫之路

16 篇文章

订阅专栏

Github项目地址：https://github.com/xylon666/idiom

效果展示：

所需环境

IDE：Pycharm

第三方库：requests，BeautifulSoup

浏览器：Chrome

爬取目标：

成语大全网全部四字成语：http://chengyu.tqnxs.com/

一、分析页面

网站通过首位字母检索—>拼音检索，然后展示所有成语

同时我们注意到，部分拼音开头的成语较多，需要翻页查询

因此我们的爬取思路是：

1.访问http://chengyu.tqnxs.com + 首字母，获取所有拼音

2.访问http://chengyu.tqnxs.com + '/' + 首字母 + '/' +拼音，获取完整页数，拼成完整链接

3.访问完整链接，下载所有成语

二、初始准备

头文件以及链接

import requests,os
from bs4 import BeautifulSoup
import concurrent.futures

headers = {
    "user-agent":"Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36",
}
url = 'http://chengyu.tqnxs.com/'
Linkfile = 'linkfile.txt'       #存放链接文件
Wordfile = 'wordfile.txt'       #存放下载的成语

由于需要多次访问页面，因此将请求函数单独封装

def WordHtml(url):
    try:
        response = requests.get(url , headers=headers)
        if response.status_code == 200:
            return response.text
    except:
        return None

三、获取首字母及拼音

首字母我们可以通过循环来直接访问

右键检查拼音，发现其存在于class=‘yingList clearfix’中

下方的a标签中有直链，BeautifulSoup可以直接获取herf链接，但是由于ang开头的成语较少，直接存放在原页面中，我们仍需访问专属页面获取ang开头的成语

因此执行原方案，获取标签内文本内容组成链接

def WordLink(url):
    if not os.path.exists(Linkfile):        #创建链接文件
        for i in range(0, 26):
            t = 65 + i
            # print(chr(t))
            turl = url + chr(t) + '/'
            html = WordHtml(turl)           #访问页面
            if html:
                soup = BeautifulSoup(html, 'lxml')        #解析页面
                txt = soup.find('div', class_='yingList clearfix').text  # 获取首位拼音
                list = txt.split()          #提取字符为列表
                print(list)
                with open(Linkfile,'a') as f:             #将链接写入文件
                    for i in list:
                        f.write(turl + str(i))
                        f.write('\n')

四、获取所有页面的链接

对翻页按钮右键检查，可以看到，能够翻页的页面有一个class="page"属性，而只有一页的页面则没有

因此我们可以通过查找这个class判断有没有翻页，然后查找最后一页的页码，也就是"下一页"的前一个标签的内容'15'

获取完整链接的代码我们这样写：

for i in lines:
    html = WordHtml(str(i))
    if html:
        soup = BeautifulSoup(html, 'lxml')
        pages = soup.find('div', class_='pages')
        if pages:
            page = soup.find(class_='pages').find_all('a')[-2].string
            # print(page)

通过BeautifulSoup分析页面后，查找class='page'，然后再找到所有的a标签

page = soup.find(class_='pages').find_all('a')[-2].string

然后获取下标为[-2]，也就是倒数第二个标签的标签内容

再遍历循环到这个最大页面，就可以获得完整的地址了

与此同时，我们开一个多线程让下载效率更高：

def WordRun():
    WordLink(url)
    lines = []
    with open(Linkfile,'r') as f:                #读取之前获取到的链接文件
        while True:
            line = f.readline()
            #print(line)
            if not line:
                break
            line = line.strip('\n')
            lines.append(line)
    print(lines)
    with concurrent.futures.ThreadPoolExecutor(len(lines)) as x:    #开启多线程x
        for i in lines:                          #访问首字母+拼音链接
            html = WordHtml(str(i))
            if html:
                x.submit(Download,str(i))        #下载第一页
                soup = BeautifulSoup(html, 'lxml')
                pages = soup.find('div', class_='pages')
                if pages:
                    page = soup.find(class_='pages').find_all('a')[-2].string #获取完整页面链接
                    # print(page)
                    if page:
                        for p in range(2, int(page) + 1):
                            x.submit(Download,str(i) + '/' + str(p) + '.html') #下载

五、下载成语

右键检查成语，发现成语存放于ul标签class='ulLi120 fsc16'的下级标签<li>/<a>中

同时注意筛选符合的四字成语，将其写入文件

def Download(url):
    html = WordHtml(url)
    soup = BeautifulSoup(html, 'lxml')
    words = soup.find('ul', class_='ulLi120 fsc16').find_all('li')
    for item in words:
        word = item.find('a').string
        print(word)
        if len(word) > 4 or len(word) < 4:        #筛选出四字成语
            continue
        with open(Wordfile, 'a', encoding='utf-8') as f:    #写入文件
            f.write(item.find('a').string)
            f.write('\n')