01.爬虫的基本介绍

<REAL>和光同尘

于 2023-03-06 12:35:33 发布

阅读量717

点赞数 1

分类专栏： # 爬虫 Python学习文章标签：爬虫 python 开发语言

本文链接：https://blog.csdn.net/uuuuty/article/details/129359319

版权

爬虫同时被 2 个专栏收录

4 篇文章 0 订阅

订阅专栏

Python学习

4 篇文章 0 订阅

订阅专栏

1.爬虫介绍

    *爬虫* (网页蜘蛛，网络机器人)

1.1 爬虫的本质：

就是模拟客户端(正常的用户)，发送网络请求，获取对应的响应数据

能够使用爬虫获取(爬取)的数据，理论上来说，只要是正常用户能够看到的，能够接触到的数据，我们都是可以抓取到的
可见皆可爬

1.2 爬虫的难点:

爬虫难点一：是否能够成功的抓取到数据
反爬：根据数据的重要性，反爬强度不一

爬虫拟客户端(正常的用户)，发送网络请求给服务端
服务端(后端) > 反爬识别爬虫，然后禁止爬虫访问
为什么服务端要做反爬?
–1.保护数据，价格相关，沃尔玛
–2.网站的后端部署在云服务器上面网站就有并发量的问题千万级的并发量
防止网络攻击
阿里云全国各地的机房(云服务器) 双十一这天

1.3 反反爬：

爬虫程序，3秒钟就访问别人一百次封禁你不允许再继续访问
A 换脸变脸 A A1 A2 A3 ~ A100就是模拟客户端(正常的用户)，发送网络请求，获取对应的响应数据

爱奇艺VIP视频

正常的用户登录充值，变成尊贵的VIP用户观看VIP视频
爬虫：抓取视频数据，保存到本地

庆余年
百度 BAT
百度 >> 爬虫巨头
百度网盘破解
没有法律明文规定爬虫是违法

沃尔玛
小本本去超市里面最早的数据搜集师 > 人工爬虫
通过爬虫抓取沃尔玛商品价格数据 >> 效率高

== 爬虫不是黑客，而是合法公民==

“”"
爬虫 >> 数据收集 >>> 被利用自身的盈利，或者说损害到别人的利益触及到违法行为
携程 >> 抢票 >> 爬虫
携程这个抢票也是利用了爬虫来盈利—>40 票钱保险费服务费加速包法律意识极强，我们没有利用爬虫收费，我们收取的是服务费
“”"

2.爬虫的基本流程

url:网络资源定位符
www.baidu.com www.sina.com url
1.确认目标的url(地址)

2.发送网络请求(模拟正常的用户)，得到对应的响应数据

3.提取出特定的数据

4.保存本地入库

3.robots协议

3.1 含义

robots协议:网站通过Robots协议告诉搜索引擎哪些页面可以抓取，哪些页面不能抓取，但它仅仅是互联网中的一般约定。==

“”"

3.2 怎么查看一个网站的robots协议：

网站服务器门口挂了一个牌子，告诉爬虫，哪些东西可以抓取，哪些东西不允许你去抓取

域名/robots.txt
www.taobao.com/robots.txt

User-agent: Baiduspider 用户代理 Baiduspider 百度爬虫
Disallow:

User-agent: baiduspider
Disallow:

3.3示例：

“”"
斗鱼robots协议
User-agent: Baiduspider 百度爬虫
Disallow: /api/*
Disallow: /member*
Disallow: /admin/*
Disallow: /room/*
Disallow: /search/*
Disallow: /cms/*

User-agent: Bytespider 字节跳动爬虫
Disallow: /api/*
Disallow: /member*
Disallow: /admin/*
Disallow: /room/*
Disallow: /search/*
Disallow: /cms/*

User-agent: Sosospider 搜搜爬虫
Disallow: /api/*
Disallow: /member*
Disallow: /admin/*
Disallow: /room/*
Disallow: /search/*
Disallow: /cms/*

User-agent: Sogou 搜狗爬虫
Disallow: /api/*
Disallow: /member*
Disallow: /admin/*
Disallow: /room/*
Disallow: /search/*
Disallow: /cms/*

User-agent: YodaoBot
Disallow: /api/*
Disallow: /member*
Disallow: /admin/*
Disallow: /room/*
Disallow: /search/*
Disallow: /cms/*

User-agent: Googlebot
Disallow: /api/*
Disallow: /member*
Disallow: /admin/*
Disallow: /room/*
Disallow: /search/*
Disallow: /cms/*

User-agent: Bingbot
Disallow: /api/*
Disallow: /member*
Disallow: /admin/*
Disallow: /room/*
Disallow: /search/*
Disallow: /cms/*

User-agent: Slurp
Disallow: /api/*
Disallow: /member*
Disallow: /admin/*
Disallow: /room/*
Disallow: /search/*
Disallow: /cms/*

User-agent: MSNBot
Disallow: /api/*
Disallow: /member*
Disallow: /admin/*
Disallow: /room/*
Disallow: /search/*
Disallow: /cms/*

User-agent: 360Spider
Disallow: /api/*
Disallow: /member*
Disallow: /admin/*
Disallow: /room/*
Disallow: /search/*
Disallow: /cms/*

User-agent: YisouSpider
Disallow: /api/*
Disallow: /member*
Disallow: /admin/*
Disallow: /room/*
Disallow: /search/*
Disallow: /cms/*

User-agent: Chinasospider
Disallow: /api/*
Disallow: /member*
Disallow: /admin/*
Disallow: /room/*
Disallow: /search/*
Disallow: /cms/*

User-agent: * 其它所有的爬虫
Disallow: /

== 了解即可==

User-agent: Baiduspider
Allow: /

User-agent: Baiduspider-image
Allow: /

User-agent: Baiduspider-video
Allow: /

User-agent: Baiduspider-news
Allow: /

User-agent: Googlebot
Allow: /

User-agent: MSNBot
Allow: /

User-agent: YoudaoBot
Allow: /

User-agent: Sogou web spider
Allow: /

User-agent: Sogou inst spider
Allow: /

User-agent: Sogou spider2
Allow: /

User-agent: Sogou blog
Allow: /

User-agent: Sogou News Spider
Allow: /

User-agent: Sogou Orion spider
Allow: /

User-agent: JikeSpider
Allow: /

User-agent: Sosospider
Allow: /

User-agent: *
Disallow: /

4.演示一个最简单的爬虫

下面展示一些 内联代码片。

import requests
from lxml import etree
import os
if __name__ == '__main__':
    # 确认目标的url
    url_ = "https://www.douyu.com/g_yz"

    # 设置用户代理，伪装身份，证明我们是一个浏览器，正常的用户
    headers_  = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.88 Safari/537.36"
    }

    # 2.发送请求，获取响应
    response_ = requests.get(url_,headers=headers_)

    # 获取字符串类型的响应文本
    str_data = response_.content.decode()

    # 转换成html对象
    html_obj = etree.HTML(str_data)

    # 获取页面中所有图片的li标签
    item_list = html_obj.xpath('//li[@class="layout-Cover-item"]')

    now_dir = os.getcwd()
    new_dir = now_dir + "\\PICTURE"
    if not os.path.exists(new_dir):    # 是否存在这个文件夹
        os.makedirs(new_dir)           # 如果没有这个文件夹，那就创建一个

    # 循环遍历每一个图片url以及图片名称所在的li标签
    for i in item_list:
        # 提取图片的url
        url = i.xpath('.//img[@class="DyImg-content is-normal"]/@src')

        # 提取图片名称
        name = i.xpath('.//h3/@title')
        print(name)
        # # 对图片url发送请求
        # response = requests.get(url[0],headers=headers_)
        #
        # # 4.保存图片
        # with open(f"{new_dir}/{name[0].replace('/','')}.jpg","wb") as f:
        #     f.write(response.content)
        #
        # print(f'<<{name[0]}>>已经下载完成....')