python 爬虫例子及总结（详细理解注释）

不想想了

已于 2022-03-15 10:08:31 修改

阅读量2.3k

点赞数 1

分类专栏： Python 文章标签： python 爬虫开发语言

于 2022-03-14 20:34:48 首次发布

本文链接：https://blog.csdn.net/weixin_45627194/article/details/123487327

版权

Python 专栏收录该内容

21 篇文章 1 订阅

订阅专栏

文章目录

备注

集成环境：anaconda（Spyder）
anaconda(Spyder)下载后还需要pycharm吗？（python集成环境选择）

详细注释代码

# 使用import导入requests模块，在爬取新的网页内容前，我们需要导入requests模块，请求并查看状态码。
import requests

# 使用from..import从bs4模块导入BeautifulSoup
# 拿到网页源代码后，使用解析库BeautifulSoup对网页进行解析，提取网页节点内容。
# beautifulsoup是一个可以从HTML或XML文件中提取数据的Python库。它能够通过你喜欢的转换器实现惯用的文档导航，查找，修改文档的方# 式。（官方），是一个解析器，可以特定的解析出内容，省去了我们编写正则表达式的麻烦。
from bs4 import BeautifulSoup

# 将User-Agent以字典键对形式赋值给headers，https://blog.csdn.net/yeyuanxiaoxin/article/details/104345734
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36 Edg/95.0.1020.53"}
    
# 使用for循环遍历range()函数生成的0-9的数字，拿去分页信息
for i in range(0, 10):

    # 取遍历中的每个数和25相乘计算每页的数值，并赋值给page
    page = i * 25

    # 用"https://movie.douban.com/top250?start="和page转换成的字符串格式相连，接着连上"&filter="，并赋值给url.	
    # for循环生成的参数属于整型，不能直接与字符串进行拼接，需要将整型转为字符串再处理。str(page)
    url = "https://movie.douban.com/top250?start=" + str(page) + "&filter="

    # 将字典headers传递给headers参数，添加进requests.get()中，赋值给response
    response = requests.get(url, headers=headers)

    # 将服务器响应内容转换为字符串形式，赋值给html
    html = response.text

    # 使用BeautifulSoup()传入变量html和解析器lxml，赋值给soup
    soup = BeautifulSoup(html, "lxml")

    # 使用find_all()查询soup中class="pic"的节点，赋值给content_all。字符串过滤
    content_all = soup.find_all(class_="pic")
'''
find_all(tag, attributes, recursive, text, limit, keywords)
（标签、属性、递归、文本、限制、关键词）
find(tag, attributes, recursive, text, keywords)
'''

    # for循环遍历content_all。333，分别保存
    for content in content_all:

        # 使用find()查询content中的img标签，并赋值给imgContent
        imgContent = content.find(name="img")

        # 使用.attrs获取alt对应的属性值，并赋值给imgName
        imgName = imgContent.attrs["alt"]

        # 使用.attrs获取src对应的属性值，并赋值给imgUrl
        imgUrl = imgContent.attrs["src"]

        # 使用replace()函数将链接中的s_ratio_poster替换成m，并赋值给imgUrlHd
        imgUrlHd = imgUrl.replace("s_ratio_poster", "m")
'''
首页图片链接和高清图片的链接，两个链接非常相似，只要将图中标黄的部分替换成m，就变成了高清图片链接。
图片的后缀名为.webp，这是Google开发的一种图片格式，网站可以使用 WebP 创建尺寸更小、细节更丰富的图片。
'''
        # 将链接添加进requests.get()中，赋值给imgResponse
        imgResponse = requests.get(imgUrlHd)

        # 使用.content属性将响应消息转换成图片数据，赋值给imgHtml
        img = imgResponse.content
        
        # 使用with语句配合open()函数以图片写入的方式打开文件。r防止转义字符干扰。
        # 用格式化将图片名字和.jpg格式组合
        # 打开的文件赋值为f
        with open(r"./pic/{0}.jpg".format(imgName), "wb") as f:
            # 使用write()将图片写入
            f.write(img)

无解析代码

import requests
from bs4 import BeautifulSoup

# 1，爬取网页
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36 Edg/95.0.1020.53"}
for i in range(0, 9):
    page = i * 25
    url = "https://movie.douban.com/top250?start=" + str(page) + "&filter="
    response = requests.get(url, headers=headers)
    html = response.text

    # 2,逐一解析数据
    soup = BeautifulSoup(html, "lxml")
    content_all = soup.find_all(class_="pic")  # class与python关键字所以用class_代替
    for content in content_all:

        imgContent = content.find(name="img")

        imgName = imgContent.attrs["alt"]
        imgUrl = imgContent.attrs["src"]
        imgUrlHd = imgUrl.replace("s_ratio_poster", "m")
        imgResponse = requests.get(imgUrlHd)
        img = imgResponse.content
        # 3,保存数据
        with open(r"./pic/{0}.jpg".format(imgName), "wb") as f:
            f.write(img)

以年份-评论人数作为图片命名

最常用数据整理：前面的BeautifulSoup，re
加入re,正则提取数据

# 以年份-评论人数作为图片命名
# 1，爬取网页

    # 2,逐一解析数据
    soup = BeautifulSoup(html, "lxml")
    content_all = soup.find_all(class_="item")
    #print(content_all)
    for content in content_all:
        imgContent = content.find('img')
        imgUrl = imgContent.attrs["src"]
        imgUrlHd = imgUrl.replace("s_ratio_poster", "m")
        imgResponse = requests.get(imgUrlHd)
        img = imgResponse.content
        content = str(content)
        # print(content)
        #找出所有数据
        #年份
        ye = re.compile(r'<p class="">(.*?)</p>', re.S).findall(content)
        # 去掉干扰项
        ye = re.sub(r"xa0", "", str(ye))
        # 提取年份数字
        cye = re.compile(r"-?\d+\.?\d*", re.S).findall(ye)
        # print(cye)
        # 提取评论人数
        judgeNum = re.compile(r'<span>(\d*)人评价</span>').findall(content)
        # print(judgeNum)
        # 3,保存数据
        with open(r"./pic/{0}-{1}.jpg".format(cye[0],judgeNum[0]), "wb") as f:
            f.write(img)

由25到250

原理就是找规律加for循环。爬虫之所以可以爬不是爬虫有多复杂，而是大量数据本身是有规律部署的，找到规律，利用规律爬取数据！！！

for i in range(0, 2):

python爬虫总结

基本思路：

1，爬取网页
2，解析数据
解析：处理，json解析，Beautifulsoup3等
3，保存数据

Python爬虫与机器学习

使用Python爬虫，获取与整理人类在网络中产生的各种数据和资源，是机器学习以及深度学习的第一步。
但机器学习的入门技术Python爬虫却成为掌握AI技术的第一个拦路虎。
目标网站必须登录才能显示
目标检测出是爬虫封了IP
目标返回了脏数据，无法辨认
目标网站有验证码无法获取资源
目标返回了加密过的数据
目标网站数据由JavaScript渲染无法抓取
APP、小程序中的数据怎样获取？
机器性能受限导致效率低下。这些都是新手在爬虫过程中会遇到的问题，其中任意一种，都将导致数据无法被获取。搞不定Python爬虫问题，就无法掌握机器学习、深度学习的基础技能，这已经是导致职场人无法通过人工智能技术获得高薪的重要原因。