爬虫练习——爬取笔趣阁

任务

在这里插入图片描述

  • 爬取上面这 4 本小说
  • 使用 requests 库
  • 不能漏掉 1 章
  • 在有限的时间内爬完
  • 以下面这个形式保存
    在这里插入图片描述

ip 代理的设置

免费 ip 代理网站:

代理测试网站:

测试代理是否可用

import requests

proxy = ['221.131.158.246:8888','183.245.8.185:80','218.7.171.91:3128',
         '223.82.106.253:3128','58.250.21.56:3128','221.6.201.18:9999',
         '27.220.51.34:9000','123.149.136.187:9999','125.108.127.160:9000',
         '1.197.203.254:9999','42.7.30.35:9999','175.43.56.24:9999',
         '125.123.154.223:3000','27.43.189.161:9999','123.169.121.100:9999']
for i in proxy:
    proxies = {
            'http':'http://'+i,
            'https':'https://'+i
        }
    print(proxies)
    try:
        response = requests.get("http://httpbin.org/",proxies=None)
        print(response.text)
    except requests.exceptions.ConnectionError as e:
        print('Error',e.args)

随机选取 1 个 ip

import requests
from random import choice

def get_proxy():
    proxy = ['221.131.158.246:8888','183.245.8.185:80','218.7.171.91:3128',
         '223.82.106.253:3128','58.250.21.56:3128','221.6.201.18:9999',
         '27.220.51.34:9000','123.149.136.187:9999','125.108.127.160:9000',
         '1.197.203.254:9999','42.7.30.35:9999','175.43.56.24:9999',
         '125.123.154.223:3000','27.43.189.161:9999','123.169.121.100:9999']
    
    return choice(proxy)

proxy = get_proxy()

proxies = {
        'http':'http://'+proxy,
        'https':'https://'+proxy
    }
print(proxies)
try:
    response = requests.get("http://httpbin.org/",proxies=None)
    print(response.text)
except requests.exceptions.ConnectionError as e:
    print('Error',e.args)

完整代码

import requests
import re
import os
import threading
from random import choice

def get_proxy():
	# 获得代理ip
    proxy = ['221.131.158.246:8888','218.7.171.91:3128','58.250.21.56:3128']
    return choice(proxy)

def getHTMLText(url,timeout = 100):
    try:
        headers = {
            "user-agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36",
        }
        proxy = get_proxy()
        print(proxy)
        proxies = {
            'http':'http://'+proxy,
            'https':'https://'+proxy
        }
        r = requests.get(url,headers=headers,proxies=proxies)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return '错误'

def write_file(file,content):
    # 小说标题和内容
    title_content = re.findall(r'<h1>(.*?)</h1>[\s\S]*?<div id="content">([\s\S]*?)<p>',content)
    for title,content in title_content:
        # 小说内容处理
        content = content.replace('&nbsp;',' ').replace('<br />','\n')
        #print(title,content)
        with open(file,'w',encoding='utf-8') as f:
            f.write('\t\t\t\t'+title+'\n\n\n\n')
            f.write(content)

def download(book,title,href):
    '''
    book: 小说名称
    title: 章节标题
    href: 小说内容的url
    '''
    content = getHTMLText(href)
    write_file(book+"\\"+title+'.txt',content)

def main():
    threads = []
    url = "http://www.xbiquge.la"
    html = getHTMLText(url)
    # 获取小说的名称和小说目录url
    novel_info = re.findall(r'<div class="item">[\s\S]*?<dt>.*?<a href="(.*?)">(.*?)</a>',html)
    for href,book in novel_info:
        print(href,book)
        # ---------------------------------------------------------- #
        #                    创建文件夹 名字为书名
        if os.path.exists(book):
            pass                                           
        else:
            os.mkdir(book) 
        # ---------------------------------------------------------- #
        novel = getHTMLText(href)
        # 获取小说内容url和章节标题
        chapter_info = re.findall(r"<dd><a href='(.*?)' >(.*?)</a>",novel)
        # http://www.xbiquge.la/10/10489/4534454.html
        for href,title in chapter_info:
            href = url + href
            print(href,title)
            # ---------------------------------------------------------- #
            #                   多线程爬取
            T = threading.Thread(target=download,args=(book,title,href))
            T.setDaemon(False)  # 后台模式
            T.start()
            threads.append(T)
            # ---------------------------------------------------------- #
            #download(book,title,href)	# 不使用多线程爬取

    for T in threads:
        T.join()
            
        
if __name__ == "__main__":
    main()

效果

在这里插入图片描述

总结

  • 免费 ip 代理不好用
  • 该程序的鲁棒性较差

解决方法

  1. 这次可不用 ip 代理,或用付费 ip 代理,构造自己的代理池
  2. 增加超时处理
  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

码上行舟

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值