异步爬取小说

惬霑

已于 2024-07-13 16:02:38 修改

阅读量504

点赞数 14

分类专栏： python爬虫文章标签：爬虫 python

于 2024-07-13 16:00:42 首次发布

本文链接：https://blog.csdn.net/weixin_62861064/article/details/140394346

版权

python爬虫专栏收录该内容

3 篇文章 0 订阅

订阅专栏

异步爬取小说

异步

我们知道爬虫是 IO密集型任务，在使用requests 库来爬取某个站点，当发出一个请求后，程序必须等待网站返回响应，才能接着运行，而在等待响应的过程中，整个爬虫程序是一直在等待的，实际上没有做任何事情。对于这种情况，我们有没有优化方案呢?
异步爬虫可以在发送请求后不阻塞等待响应，而是继续执行其他任务，从而提升了数据抓取效率。

页面解析

这里想爬取取得url为我的模拟长生路
鼠标右键点击检查（或者F12）
在这里插入图片描述
假设你想要获取某一章小说的信息，直接定位到具体的标签，爬取网页信息我认为就是定位具体标签，获取标签内信息，然后数据清洗除去无用数据，最后数据展示。

使用requests 爬取，这种方法爬取的太慢了，要不就直接开多线程，但多线程受电脑性能，不能开太多。
多进程就是利用CPU的多核优势，在同一时间并行执行多个任务、可以大大提高执行效率。

def fetch(url,headers):
    response = requests.get(url,headers=headers)
    text=response.content.decode('gbk')
    print(text)
if __name__ == '__main__':
    url ='https://www.hafuktxt.com/book/91817982/'
    headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36 Edg/126.0.0.0'
    }
    fetch(url,headers)

在这里插入图片描述

获取章节url,章节内容

将之前的requests 爬取改成aiohttp，利用aiohttp库里ClientSession 类的 get 方法进行请求，并使用xpath获取标签信息。
async定义的方法会变成一个无法直接执行的协程对象，必须将此对象注册到事件循环中才可以执行。

如果一个对象可以在 await 语句中使用，那么它就是可等待对象。许多 asyncio API 都被设计为接受可等待对象。

可等待对象有三种主要类型: 协程, 任务和 Future.
tasks 列表包含多个协程对象 ;asyncio.gather是一个用于并发执行多个协程并收集它们结果的函数 *tasks是将tasks列表解包为位置参数的方式，这样asyncio.gather可以接收多个协程作为参数。

async def fetch(session, url):
    async with semaphore:
        async with session.get(url, headers=headers) as response:
            text = await response.text(encoding='gbk', errors='ignore')
            print(url)
            return etree.HTML(text)

async def get_data_async(url):
    async with aiohttp.ClientSession(connector=aiohttp.TCPConnector(ssl=False), trust_env=True) as session:
        title_urls = (await fetch(session, url)).xpath(
            '//div[@class="container border3-2 mt8 mb20"]//div[@class="info-chapters flex flex-wrap"]/a/@href')
        str_list1 = [i.replace('.html', '_2.html') for i in title_urls]
        str_list2 = [temp for i in zip(title_urls, str_list1) for temp in i]

        async def get_title_and_content(title_url):
            html = await fetch(session, 'https://www.hafuktxt.com/' + title_url)
            content_title = html.xpath('//h1/text()')
            content_text = html.xpath('//article[@class="content"]/p//text()')
            print(f'{content_title[0]}下载完成')
            return content_title[0] if content_title else '', [str(t.replace('\u3000', '')) for t in content_text]

        tasks = [get_title_and_content(title_url) for title_url in str_list2]
        results = await asyncio.gather(*tasks)

将结果写入title.txt文件中

 with open('title.txt', 'w', encoding='utf-8') as f:
            for title, contents in results:
                f.write(title + '\n')
                contents = contents
                for content in contents:
                    f.write(content + '\n')

主函数运行py文件

 if __name__ == '__main__':

    headers = {
        'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
    }
    url = 'https://www.hafuktxt.com/book/91817982/'

    start = time.time()
    # asyncio.run(get_data_async(url))
    asyncio.get_event_loop().run_until_complete(get_data_async(url))
    end = time.time()
    print(f'耗时：{end - start}')

这时候你会发现程序运行着，就会报错aiohttp.client_exceptions.ClientOSError: [WinError 10054] 远程主机强迫关闭了一个现有的连接。这是触发了网站的保护机制，为了防止高并发对IP做出了限制。

在这里插入图片描述
通过对网页源码查看知道当请求过于频繁时，它会封ip,30s才能再次访问，那我们可以使用代理ip来解决ip被封的问题。

在这里插入图片描述

限制访问请求频率

由于不知道网站的限制，我所给出的限制时30s内3000个请求。你可以自己改下请求频率。

# 假设的限制条件
MAX_REQUESTS_PER_30S = 3000
CONCURRENCY = 20
REQUEST_INTERVAL = 30 / MAX_REQUESTS_PER_30S  # 根据限制计算请求间隔

# 限制并发数量
semaphore = asyncio.Semaphore(CONCURRENCY)
request_count = 0
last_request_time = time.time()


async def fetch(session, url):
    global request_count, last_request_time

    # 速率限制逻辑
    current_time = time.time()
    elapsed_time = current_time - last_request_time
    if elapsed_time < REQUEST_INTERVAL:
        await asyncio.sleep(REQUEST_INTERVAL - elapsed_time)
    last_request_time = time.time()
    request_count += 1
    if request_count % MAX_REQUESTS_PER_30S == 0:
        print(f"已达到30秒内的请求限制，等待下一个30秒窗口...")
        await asyncio.sleep(30)  # 等待下一个30秒窗口
        last_request_time = time.time()  # 重置时间戳以开始新的30秒窗口
    async with semaphore:
        async with session.get(url, headers=headers) as response:
            text = await response.text(encoding='gbk', errors='ignore')
            print(url)
            return etree.HTML(text)

在这里插入图片描述

获取代理ip

以·http://www.ip3366.net为例

def get_proxies():
    # 定义proxy_ips列表存储代理地址
    proxy_ips = []
    # 设置headers
    ua = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
    headers = {"User-Agent": ua}
    # 从第一页开始循环访问
    for page in range(1, 7):
        # print(f"正在爬取第{page}页!")
        url2=f"http://www.ip3366.net/?stype=1&page={page}"
        url = "https://www.89ip.cn/index_{page}.html"
        res = requests.get(url2, headers=headers)
        # 使用.text属性获取网页内容，赋值给html
        html = res.text
        # 用BeautifulSoup()传入变量html和解析器lxml，赋值给soup
        soup = BeautifulSoup(html, "html.parser")
        # 使用find_all()方法查找类名为layui-table的标签
        table = soup.find_all(class_="table table-bordered table-striped")[0]
        # 使用find_all()方法查找tr标签
        trs = table.find_all("tr")
        # 使用for循环逐个访问trs列表中的tr标签,一个tr代表一行，第一行为表头，不记录
        for i in range(1, len(trs)):
            # 使用find_all()方法查找td标签
            ip = trs[i].find_all("td")[0].text.strip()
            port = trs[i].find_all("td")[1].text.strip()
            # 拼接代理地址
            proxy_ip = f"http://{ip}:{port}"
            # print(proxy_ip)
            # 将获取的代理地址保存到proxy_ips列表
            proxy_ips.append(proxy_ip)
        # print(proxy_ips)
    return proxy_ips

测试ip是否可用（一般免费的ip能用的很少）：

def test_proxy(ip):
    # 设置headers
    headers = {
        'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
    }
    url = "http://httpbin.org/get"
    # url = "https://www.baidu.com"
    # 设置代理信息
    proxies = {"http": ip}
    # 通过请求百度首页来验证代理地址是否有效
    try:
        res = requests.get(url, headers=headers, proxies=proxies, timeout=3)
    except requests.exceptions.Timeout:
        # 超过3秒未返回，则请求超时
        print(f"请求{ip}超时")
        result_code = 0
    else:
        if res.status_code == 200:
            print(f"代理地址{ip}有效")
            result_code = 200
            print(res.text)

        else:
            print(f"代理地址{ip}无效,状态{res.status_code}")
            result_code = res.status_code

    # finally:
    #     return res.status_code
    # 返回请求状态
    return result_code

使用代理

requests中使用代理

  import requests

    proxy = '127.0.0.1:7890'
    headers = {
        'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
    }
    proxies ={
        'http':'http://'+ proxy,
        'https':'http://'+ proxy,
    }
    try:
        response =requests.get('http://www.httpbin.arg/get',headers=headers, proxies = proxies)
        print(response.text)
    except requests.exceptions.ConnectionError as e:
        print('Error', e.args)

异步中使用代理

proxy = 'http://127.0.0.1:7890'
async def main():
    async with aiohttp.ClientSession() as session:
        async with session.get('https://www.httpbin.org/get', proxy=proxy) as response:
            print(await response.text())
if __name__=='__main__':
    asyncio.get_event_loop().run_until_complete(main())

selenium代理

from selenium import webdriver

proxy='127.0.0.1:7890'
options =webdriver.ChromeOptions()
options.add_argument('--proxy-server-http://'+proxy)
browser=webdriver.Chrome(options=options)
browser.get("https://ww.httpbin.org/get")
print(browser.page_source)
browser.close()

惬霑

关注

14
点赞
踩
13

收藏

觉得还不错? 一键收藏
0
评论
异步爬取小说

这是触发了网站的保护机制，为了防止高并发对IP做出了限制。我们知道爬虫是 IO密集型任务，在使用requests 库来爬取某个站点，当发出一个请求后，程序必须等待网站返回响应，才能接着运行，而在等待响应的过程中，整个爬虫程序是一直在等待的，实际上没有做任何事情。假设你想要获取某一章小说的信息，直接定位到具体的标签，爬取网页信息我认为就是定位具体标签，获取标签内信息，然后数据清洗除去无用数据，最后数据展示。使用requests 爬取，这种方法爬取的太慢了，要不就直接开多线程，但多线程受电脑性能，不能开太多。
复制链接

扫一扫