python爬虫---起点中文网免费小说爬取下载-实战项目

 前言:

晚上刷抖音,刷到一个关于python爬虫的视频,想着重温一下之前学过的爬虫基础,好别忘记了,于是就从网上找了一个小说网站-起点中文网,由于学艺不精,搞不定vip接口,所以只能爬一爬免费的小说咯.

网站url:免费小说大全_小说免费在线阅读-起点中文网 (qidian.com)icon-default.png?t=N7T8https://www.qidian.com/free/all/

第一步:我定义了四个列表分别存储爬取到的数据

a = []#小说链接存储列表
b = []#小说名列表
c = []#小说章节链接列表
d = []#小说章节名列表

第二步:我定义了第一个函数,用来爬取小说名和小说详情页链接,并将爬取到的链接存储到a列表,小说名存储到b列表,之后会派上用场。

代码如下:

def qd():
    m = int(input('请输入要爬取的页数:'))
    for n in range(m):
        url = 'https://www.qidian.com/free/all/page{}/'.format(n)
        headers = {
            'Cookie': '_yep_uuid=d8b52da3-2e99-6aa7-6f40-b764179d5cd7; supportwebp=true; x-waf-captcha-referer=; _csrfToken=zOBoyYeRJQmHZ9hoQUBiTL2WMoK5X07zciU9oBYG; newstatisticUUID=1718464987_786877310; fu=1199187808; traffic_utm_referer=; Hm_lvt_f00f67093ce2f38f215010b699629083=1718464988; Hm_lpvt_f00f67093ce2f38f215010b699629083=1718464988; _ga_FZMMH98S83=GS1.1.1718464988.1.0.1718464988.0.0.0; _ga_PFYW0QLV3P=GS1.1.1718464988.1.0.1718464988.0.0.0; _ga=GA1.2.1668981357.1718464988; _gid=GA1.2.1024042542.1718464988; _gat_gtag_UA_199934072_2=1; w_tsfp=ltvgWVEE2utBvS0Q6KLtkk+nHj47Z2R7xFw0D+M9Os09CKcnV56F1Yd5vNfldCyCt5Mxutrd9MVxYnGJUtAgfREQRM+Sb5tH1VPHx8NlntdKRQJtA8iOXlEYcr1yujNPKG9ccBS02j4rcdBCxLBkg1AOtiJ937ZlCa8hbMFbixsAqOPFm/97DxvSliPXAHGHM3wLc+6C6rgv8LlSgWmesA7uLi0lX+ZX0jPShH9KD20hlwOnZrgPa0Pvfp/zSucirTPzwjn3apCs2RYj4VA3sB49AtX02TXKL3ZEIAtrZVignqx3I/v4baVg5CMTTq8dGgoW/QsctbQ9rxZHDSHpMyPdBPIo5AMAT/YK/8j6a2Y=',
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0'
        }
        reponse = requests.get(url, headers=headers)
        html = BeautifulSoup(reponse.text, 'lxml')
        htmls = html.select('div.main-content-wrap  div.all-book-list div.book-img-text ul li  h2 a ')
        htmlss = html.select('div.main-content-wrap  div.all-book-list div.book-img-text ul li  p a.name')
        for i, k in zip(htmls, htmlss):
            a.append('https:' + i['href'])
            b.append(i['title'])
            print('小说名为:' + i['title'] + '   ---|---   ' + '小说链接为:' + 'https:' + i['href'])

第三步:我定义了第二个函数,让用户自己选择要下载的小说,并爬取选择的小说的章节链接以及章节名,并将章节链接存储在c列表,章节名存储在d列表,并且我在这一步以选择的小说名为名创建了一个文件夹。

代码如下:

def xz1():
    url = input('请选择你要下载的小说,将url复制到此:')
    headers = {
        'Cookie': '_yep_uuid=d8b52da3-2e99-6aa7-6f40-b764179d5cd7; supportwebp=true; x-waf-captcha-referer=; _csrfToken=zOBoyYeRJQmHZ9hoQUBiTL2WMoK5X07zciU9oBYG; newstatisticUUID=1718464987_786877310; fu=1199187808; traffic_utm_referer=; Hm_lvt_f00f67093ce2f38f215010b699629083=1718464988; Hm_lpvt_f00f67093ce2f38f215010b699629083=1718464988; _ga_FZMMH98S83=GS1.1.1718464988.1.0.1718464988.0.0.0; _ga_PFYW0QLV3P=GS1.1.1718464988.1.0.1718464988.0.0.0; _ga=GA1.2.1668981357.1718464988; _gid=GA1.2.1024042542.1718464988; _gat_gtag_UA_199934072_2=1; w_tsfp=ltvgWVEE2utBvS0Q6KLtkk+nHj47Z2R7xFw0D+M9Os09CKcnV56F1Yd5vNfldCyCt5Mxutrd9MVxYnGJUtAgfREQRM+Sb5tH1VPHx8NlntdKRQJtA8iOXlEYcr1yujNPKG9ccBS02j4rcdBCxLBkg1AOtiJ937ZlCa8hbMFbixsAqOPFm/97DxvSliPXAHGHM3wLc+6C6rgv8LlSgWmesA7uLi0lX+ZX0jPShH9KD20hlwOnZrgPa0Pvfp/zSucirTPzwjn3apCs2RYj4VA3sB49AtX02TXKL3ZEIAtrZVignqx3I/v4baVg5CMTTq8dGgoW/QsctbQ9rxZHDSHpMyPdBPIo5AMAT/YK/8j6a2Y=',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0'
    }
    reponse = requests.get(url,headers=headers)
    html = BeautifulSoup(reponse.text,'lxml')
    htmls = html.select('div.catalog-volume ul li a')
    print('你选择的小说章节如下:')
    for h in htmls:
        print('章节名为:' + h.text + '  ---+---  ' + '章节链接为:' 'https:' + h['href'])
        c.append('https:' + h['href'])
        d.append(h.text)
    global urls
    urls = a.index(url)
    #print(urls)
    if not os.path.exists(b[urls]):
        os.makedirs(b[urls])

 第四步:创建了第三个函数,用来下载小说,并将内容下载到上一步创建的文件夹中,以txt文件的形式,txt文件名为章节名。

代码如下:

def xz2():
    for i in c:
        url = i
        headers = {
            'Cookie': '_yep_uuid=d8b52da3-2e99-6aa7-6f40-b764179d5cd7; supportwebp=true; x-waf-captcha-referer=; _csrfToken=zOBoyYeRJQmHZ9hoQUBiTL2WMoK5X07zciU9oBYG; newstatisticUUID=1718464987_786877310; fu=1199187808; traffic_utm_referer=; Hm_lvt_f00f67093ce2f38f215010b699629083=1718464988; Hm_lpvt_f00f67093ce2f38f215010b699629083=1718464988; _ga_FZMMH98S83=GS1.1.1718464988.1.0.1718464988.0.0.0; _ga_PFYW0QLV3P=GS1.1.1718464988.1.0.1718464988.0.0.0; _ga=GA1.2.1668981357.1718464988; _gid=GA1.2.1024042542.1718464988; _gat_gtag_UA_199934072_2=1; w_tsfp=ltvgWVEE2utBvS0Q6KLtkk+nHj47Z2R7xFw0D+M9Os09CKcnV56F1Yd5vNfldCyCt5Mxutrd9MVxYnGJUtAgfREQRM+Sb5tH1VPHx8NlntdKRQJtA8iOXlEYcr1yujNPKG9ccBS02j4rcdBCxLBkg1AOtiJ937ZlCa8hbMFbixsAqOPFm/97DxvSliPXAHGHM3wLc+6C6rgv8LlSgWmesA7uLi0lX+ZX0jPShH9KD20hlwOnZrgPa0Pvfp/zSucirTPzwjn3apCs2RYj4VA3sB49AtX02TXKL3ZEIAtrZVignqx3I/v4baVg5CMTTq8dGgoW/QsctbQ9rxZHDSHpMyPdBPIo5AMAT/YK/8j6a2Y=',
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0'
        }
        reponse = requests.get(url, headers=headers)
        # print(reponse.text)
        html = BeautifulSoup(reponse.text, 'lxml')
        htmls = html.select('div.print main p ')
        for j in htmls:
            #print(j.text + '\'')
            for k in d:
                with open(f'{b[urls]}/{k}.txt','a')as f:
                    f.write(j.text + '\ ')

最后调用函数:

if __name__ == '__main__':
    qd()
    xz1()
    xz2()

 好看看效果:

这里我选择爬取两页

这里我选择的小说为 综漫:从无职转生开始最新章节免费在线阅读

此时正在下载了

完整代码:

import requests
from bs4 import BeautifulSoup
import os
from lxml import etree
a = []#小说链接存储列表
b = []#小说名列表
c = []#小说章节链接列表
d = []#小说章节名列表
def qd():
    m = int(input('请输入要爬取的页数:'))
    for n in range(m):
        url = 'https://www.qidian.com/free/all/page{}/'.format(n)
        headers = {
            'Cookie': '_yep_uuid=d8b52da3-2e99-6aa7-6f40-b764179d5cd7; supportwebp=true; x-waf-captcha-referer=; _csrfToken=zOBoyYeRJQmHZ9hoQUBiTL2WMoK5X07zciU9oBYG; newstatisticUUID=1718464987_786877310; fu=1199187808; traffic_utm_referer=; Hm_lvt_f00f67093ce2f38f215010b699629083=1718464988; Hm_lpvt_f00f67093ce2f38f215010b699629083=1718464988; _ga_FZMMH98S83=GS1.1.1718464988.1.0.1718464988.0.0.0; _ga_PFYW0QLV3P=GS1.1.1718464988.1.0.1718464988.0.0.0; _ga=GA1.2.1668981357.1718464988; _gid=GA1.2.1024042542.1718464988; _gat_gtag_UA_199934072_2=1; w_tsfp=ltvgWVEE2utBvS0Q6KLtkk+nHj47Z2R7xFw0D+M9Os09CKcnV56F1Yd5vNfldCyCt5Mxutrd9MVxYnGJUtAgfREQRM+Sb5tH1VPHx8NlntdKRQJtA8iOXlEYcr1yujNPKG9ccBS02j4rcdBCxLBkg1AOtiJ937ZlCa8hbMFbixsAqOPFm/97DxvSliPXAHGHM3wLc+6C6rgv8LlSgWmesA7uLi0lX+ZX0jPShH9KD20hlwOnZrgPa0Pvfp/zSucirTPzwjn3apCs2RYj4VA3sB49AtX02TXKL3ZEIAtrZVignqx3I/v4baVg5CMTTq8dGgoW/QsctbQ9rxZHDSHpMyPdBPIo5AMAT/YK/8j6a2Y=',
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0'
        }
        reponse = requests.get(url, headers=headers)
        html = BeautifulSoup(reponse.text, 'lxml')
        htmls = html.select('div.main-content-wrap  div.all-book-list div.book-img-text ul li  h2 a ')
        htmlss = html.select('div.main-content-wrap  div.all-book-list div.book-img-text ul li  p a.name')
        for i, k in zip(htmls, htmlss):
            a.append('https:' + i['href'])
            b.append(i['title'])
            print('小说名为:' + i['title'] + '   ---|---   ' + '小说链接为:' + 'https:' + i['href'])

def xz1():
    url = input('请选择你要下载的小说,将url复制到此:')
    headers = {
        'Cookie': '_yep_uuid=d8b52da3-2e99-6aa7-6f40-b764179d5cd7; supportwebp=true; x-waf-captcha-referer=; _csrfToken=zOBoyYeRJQmHZ9hoQUBiTL2WMoK5X07zciU9oBYG; newstatisticUUID=1718464987_786877310; fu=1199187808; traffic_utm_referer=; Hm_lvt_f00f67093ce2f38f215010b699629083=1718464988; Hm_lpvt_f00f67093ce2f38f215010b699629083=1718464988; _ga_FZMMH98S83=GS1.1.1718464988.1.0.1718464988.0.0.0; _ga_PFYW0QLV3P=GS1.1.1718464988.1.0.1718464988.0.0.0; _ga=GA1.2.1668981357.1718464988; _gid=GA1.2.1024042542.1718464988; _gat_gtag_UA_199934072_2=1; w_tsfp=ltvgWVEE2utBvS0Q6KLtkk+nHj47Z2R7xFw0D+M9Os09CKcnV56F1Yd5vNfldCyCt5Mxutrd9MVxYnGJUtAgfREQRM+Sb5tH1VPHx8NlntdKRQJtA8iOXlEYcr1yujNPKG9ccBS02j4rcdBCxLBkg1AOtiJ937ZlCa8hbMFbixsAqOPFm/97DxvSliPXAHGHM3wLc+6C6rgv8LlSgWmesA7uLi0lX+ZX0jPShH9KD20hlwOnZrgPa0Pvfp/zSucirTPzwjn3apCs2RYj4VA3sB49AtX02TXKL3ZEIAtrZVignqx3I/v4baVg5CMTTq8dGgoW/QsctbQ9rxZHDSHpMyPdBPIo5AMAT/YK/8j6a2Y=',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0'
    }
    reponse = requests.get(url,headers=headers)
    html = BeautifulSoup(reponse.text,'lxml')
    htmls = html.select('div.catalog-volume ul li a')
    print('你选择的小说章节如下:')
    for h in htmls:
        print('章节名为:' + h.text + '  ---+---  ' + '章节链接为:' 'https:' + h['href'])
        c.append('https:' + h['href'])
        d.append(h.text)
    global urls
    urls = a.index(url)
    #print(urls)
    if not os.path.exists(b[urls]):
        os.makedirs(b[urls])


def xz2():
    for i in c:
        url = i
        headers = {
            'Cookie': '_yep_uuid=d8b52da3-2e99-6aa7-6f40-b764179d5cd7; supportwebp=true; x-waf-captcha-referer=; _csrfToken=zOBoyYeRJQmHZ9hoQUBiTL2WMoK5X07zciU9oBYG; newstatisticUUID=1718464987_786877310; fu=1199187808; traffic_utm_referer=; Hm_lvt_f00f67093ce2f38f215010b699629083=1718464988; Hm_lpvt_f00f67093ce2f38f215010b699629083=1718464988; _ga_FZMMH98S83=GS1.1.1718464988.1.0.1718464988.0.0.0; _ga_PFYW0QLV3P=GS1.1.1718464988.1.0.1718464988.0.0.0; _ga=GA1.2.1668981357.1718464988; _gid=GA1.2.1024042542.1718464988; _gat_gtag_UA_199934072_2=1; w_tsfp=ltvgWVEE2utBvS0Q6KLtkk+nHj47Z2R7xFw0D+M9Os09CKcnV56F1Yd5vNfldCyCt5Mxutrd9MVxYnGJUtAgfREQRM+Sb5tH1VPHx8NlntdKRQJtA8iOXlEYcr1yujNPKG9ccBS02j4rcdBCxLBkg1AOtiJ937ZlCa8hbMFbixsAqOPFm/97DxvSliPXAHGHM3wLc+6C6rgv8LlSgWmesA7uLi0lX+ZX0jPShH9KD20hlwOnZrgPa0Pvfp/zSucirTPzwjn3apCs2RYj4VA3sB49AtX02TXKL3ZEIAtrZVignqx3I/v4baVg5CMTTq8dGgoW/QsctbQ9rxZHDSHpMyPdBPIo5AMAT/YK/8j6a2Y=',
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0'
        }
        reponse = requests.get(url, headers=headers)
        # print(reponse.text)
        html = BeautifulSoup(reponse.text, 'lxml')
        htmls = html.select('div.print main p ')
        for j in htmls:
            #print(j.text + '\'')
            for k in d:
                with open(f'{b[urls]}/{k}.txt','a')as f:
                    f.write(j.text + '\ ')






if __name__ == '__main__':
    qd()
    xz1()
    xz2()

希望能帮助到大家。 

  • 15
    点赞
  • 18
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 6
    评论
好的,我可以为您介绍一下如何使用 Python 爬虫爬取房产数据并保存到本地。 首先,您需要选择一个合适的网站来进行爬取。例如,可以选择链家网、安居客等房产网站。然后,您需要分析该网站的页面结构和数据请求方式,确定需要爬取哪些数据,以及如何获取这些数据。 接着,您可以使用 Python 的 requests 库来发送 HTTP 请求,获取网页内容。然后,可以使用 BeautifulSoup 库来解析网页内容,提取需要的数据。 最后,您可以使用 Python 的 Pandas 库将数据保存到本地的 Excel 或 CSV 文件中。以下是一个示例代码: ```python import requests from bs4 import BeautifulSoup import pandas as pd # 发送 HTTP 请求获取网页内容 url = 'https://bj.lianjia.com/ershoufang/' headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'} response = requests.get(url, headers=headers) soup = BeautifulSoup(response.content, 'html.parser') # 解析网页内容,提取需要的数据 house_list = soup.find_all('div', {'class': 'info'}) data = [] for house in house_list: name = house.find('div', {'class': 'title'}).text.strip() price = house.find('div', {'class': 'price'}).text.strip() area = house.find('div', {'class': 'houseInfo'}).text.strip() data.append([name, price, area]) # 将数据保存到本地文件 df = pd.DataFrame(data, columns=['名称', '价格', '面积']) df.to_excel('house_data.xlsx', index=False) ``` 这是一个简单的示例代码,您可以根据您需要爬取数据和网站的不同来进行修改和优化。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 6
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

满心欢喜love

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值