一个爬虫的简单分析

最新推荐文章于 2024-04-07 09:10:09 发布

别内卷了

最新推荐文章于 2024-04-07 09:10:09 发布

阅读量569

点赞数 6

文章标签： python 爬虫

本文链接：https://blog.csdn.net/m0_56937298/article/details/124840129

版权

网络爬虫

什么是爬虫：简答来说就是模仿人类对整个页面的查看，然后再把整个页面的东西给复制下来的程序或者脚本。使用爬虫的目的，在信息含量密集不易提取的页面，找到并且下载到目标目录下

流程图

python库

Urllib模块和requests模块：

具体详情看python官网介绍：

urllib --- URL 处理模块 — Python 3.7.13 文档

Requests: 让 HTTP 服务人类 — Requests 2.18.1 文档

页面爬取分析

第一列 Name：请求的名称

第二列 Status：响应的状态码，

第三列 Type：请求的文档类型

第四列 Initiator：发起请求的对象或者进程

第五列 Size：服务器返回资源的大小

第六列 Time：发起请求到获取响应所用的总时间。

第七列 Waterfall：网络请求的可视化瀑布流。

常见的网络请求有两种，POST和GET两种

GET请求，默认的一种传递数据的方法，通过地址来传递表单中的数据。

特点：能传递敏感的数据例如：密码不能传递大量的数据

每次传递只能传递1024个字节

不能上传附件

POST请求，不是通过地址栏传递数据，直接将数据传给文件处理程序。

特点：相对安全

可以传递海量的数据

能上传附件

页面请求为GET请求

GET请求请求头里面不包含数据，可直接构造。

import requests

from fake_useragent import UserAgent
def Netwrok():#获取整个页面
    heade={
        "User-Agent":UserAgent().random #随机获取浏览器的请求头
    }

    url='https://so.gushiwen.cn/mingjus/'

    req=requests.get(url,headers=heade)#向网络发起一个GET请求

    rs=req.content.decode()

    print(rs)#把整个html打印出来

Netwrok()

通过正则过滤想要的内容

re_find = 'href="/mingju/juv_.*?.aspx">(.*?)</a>.*?aspx">(.*?)</a>'#正则规则
data_list = re.findall(re_find, html, re.S)#开始过滤

以一个目标古诗词的网站为例：

import requests, re,urllib.request,os
from fake_useragent import UserAgent
class Spider(object):
    def __init__(self):
        self.head ={'User-Agent': UserAgent().random}#随机获取浏览器的请求头
        self.url='https://so.gushiwen.cn/mingjus/'
        self.Netwrok()
        # self.Urllib_requst()
    def Netwrok(self):          #使用requests模块来获取页面内容
        try:
            req = requests.get(url=self.url, headers=self.head)
            if req.status_code == 200:
                return self.Filter(req.content.decode())
        except:
             print('网络页面不正常')
    def Filter(self,html):
        re_find = 'href="/mingju/juv_.*?.aspx">(.*?)</a>.*?aspx">(.*?)</a>'
        data_list = re.findall(re_find, html, re.S)
        self.Generate_file(data_list)
    def Generate_file(self,data_list):  #
        with open('D:\\爬虫\\古诗词.txt', 'w', encoding='UTF-8') as f:
            for data in data_list:
                f.write( data[0]+'\t\t' +data[1] + '\n')
                f.write('\n')
            print('over')
    # def Urllib_requst(self):  #使用urllib模块来获取页面内容
    #     req = urllib.request.Request(url=self.url,headers=self.head)
    #     res = urllib.request.urlopen(req)
    #     html = res.read().decode('UTF-8')
    #     return self.Filter(html)
if __name__ == '__main__':
    if (True!=os.path.isdir('D:\\爬虫')):  #判断在D:下是否有爬虫这个文件夹
        os.mkdir('D:\\爬虫')          #没有创建一个文件夹来存放爬取页面的数据
        db=Spider
    else:
        db=Spider()

运行结果D:\爬虫文件中