爬虫之网页解析

最新推荐文章于 2024-04-12 02:17:38 发布

xiaomu_347

最新推荐文章于 2024-04-12 02:17:38 发布

阅读量507

点赞数

分类专栏：深度学习

本文链接：https://blog.csdn.net/xiaomu_347/article/details/104706969

版权

深度学习专栏收录该内容

18 篇文章 2 订阅

订阅专栏

通过F12查看网页源码，通过分析网页与源码实现匹配，如下图所示

而对该网上数据爬取时，一定会碰到对网页内容的解析，下面就自己常用的几种方式进行总结：

（1）beautifulsoup

https://www.bilibili.com/video/av93140655?from=search&seid=18437810415575324694

bs4库中的beautifulsoup类可以很简单的实现网页解析，其内部几个常用的函数需要注意

soup.find
soup.find_all
soup.select_one
soup.select

还有soup.head.contents和soup.head.children等属性，而对于单个标签来说，需要注意a.attrs（获得所有属性值），a.string（获得对应的文本值，和a.get_text()一样）

补充：在写 CSS 时，标签名不加任何修饰，类名（class="className"引号内即为类名）前加点，id名（id="idName"引号前即为id名）前加 #，在这里我们也可以利用类似的方法来筛选元素，用到的方法是 soup.select()，返回类型是 list。参考https://www.cnblogs.com/kangblog/p/9153871.html，其中:nth-of-type(n)选择器匹配同类型中的第n个同级兄弟元素，与:nth-child(该选择器匹配父元素中的第n个子元素)相近。

import requests
from bs4 import BeautifulSoup

class Douban:
    def __init__(self):
        self.URL = 'https://movie.douban.com/top250'
        self.starnum =[]
        for start_num in range(0,251,25):
            self.starnum.append(start_num)
            self.header = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'}

    def get_top250(self):
        for start in self.starnum:
            start = str(start)
            #import pdb;pdb.set_trace()
            html = requests.get(self.URL, params={'start':start},headers = self.header)
            soup = BeautifulSoup(html.text,"html.parser")
            names = soup.select('#content > div > div.article > ol > li > div > div.info > div.hd > a > span:nth-of-type(1)') ##中间的>表示从属关系，之间要有间隔
            for name in names:
                print(name.get_text())

if __name__== "__main__":
    cls = Douban()
    cls.get_top250()

soup.select与soup.find_all()的使用前者需要根据树目录进行详细匹配，而后者可以选择局部特征来选择，具体使用差异可查看https://www.cnblogs.com/suancaipaofan/p/11786046.html，根据自己的喜好选用。

（2）re

正则表达式（Regular Expression）是一种文本模式，包括普通字符（例如，a 到 z 之间的字母）和特殊字符（称为"元字符"）。通常被用来匹配、检索、替换和分割那些符合某个模式(规则)的文本。

单字符：
        . : 除换行以外所有字符
        [] ：[aoe] [a-w] 匹配集合中任意一个字符
        \d ：数字  [0-9]
        \D : 非数字
        \w ：数字、字母、下划线、中文
        \W : 非\w
        \s ：所有的空白字符包,括空格、制表符、换页符等等。等价于 [ \f\n\r\t\v]。
        \S : 非空白

    数量修饰：
        * : 任意多次  >=0
        + : 至少1次   >=1
        ? : 可有可无  0次或者1次
        {m} ：固定m次 hello{3,}
        {m,} ：至少m次
        {m,n} ：m-n次

    边界：
        $ : 以某某结尾 
        ^ : 以某某开头

    分组：
        (ab)  

    贪婪模式： .*
    非贪婪（惰性）模式： .*?

    re.I : 忽略大小写
    re.M ：多行匹配
    re.S ：单行匹配

re.sub(正则表达式, 替换内容, 字符串)

实例分析：

#提取170
string = '我喜欢身高为170的女孩'
re.findall('\d+',string)

#提取出http://和https://
key='http://www.baidu.com and https://boob.com'
re.findall('https?://',key)

#提取出hello
key='lalala<hTml>hello</HtMl>hahah' #输出<hTml>hello</HtMl>
re.findall('<[Hh][Tt][mM][lL]>(.*)</[Hh][Tt][mM][lL]>',key)



###爬取糗事百科，并保存
import requests
import re
import os

# 创建一个文件夹
if not os.path.exists('./qiutuLibs'):
    os.mkdir('./qiutuLibs')

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'
}

#封装一个通用的url模板
url = 'https://www.qiushibaike.com/pic/page/%d/?s=5185803'

for page in range(1,36):
    new_url = format(url%page)                            #不要忘了format，里面不加引号
    page_text = requests.get(url=new_url, headers=headers).text

    # 进行数据解析（图片的地址）
    ex = '<div class="thumb">.*?<img src="(.*?)" alt.*?</div>'
    src_list = re.findall(ex, page_text, re.S)                        # re.S单行匹配，因为页面源码里面有 \n

    # 发现src属性值不是一个完整的url，缺少了协议头
    for src in src_list:
        src = 'https:' + src
        # 对图片的url单独发起请求，获取图片数据.content返回的是二进制类型的响应数据
        img_data = requests.get(url=src, headers=headers).content
        img_name = src.split('/')[-1]
        img_path = './qiutuLibs/' + img_name
        with open(img_path, 'wb') as fp:
            fp.write(img_data)
            print(img_name, '下载成功！')

（3）lxml

使用lxml对获取的网页进行数据抓取，然后在使用xpath对其进行内容筛选，其中细节

/和//的区别：/代表只获取直接子节点。//获取子孙节点。一般//用得比较多。当然也要视情况而定。
contains：有时候某个属性中包含了多个值，那么可以使用contains函数。

具体形式如下：

from lxml import etree
page = etree.HTML(html.decode('utf-8'))

# a标签
tags = page.xpath(u'/html/body/a')
print(tags)  
# html 下的 body 下的所有 a
# 结果[<Element a at 0x34b1f08>, ...]

爬取最受欢迎的语言top20实例：

# 导入所需要的库
import urllib.request as urlrequest
from lxml import etree

# 获取html
url = r'https://www.tiobe.com/tiobe-index/'
page = urlrequest.urlopen(url).read()
# 创建lxml对象
html = etree.HTML(page)

# 解析HTML，筛选数据
df = html.xpath('//table[contains(@class, "table-top20")]/tbody/tr//text()')
# 数据写入数据库
import pandas as pd
tmp = []
for i in range(0, len(df), 5):
    tmp.append(df[i: i+5])
df = pd.DataFrame(tmp)

上述的这三种方式都可以实现对网页内容进行解析，从而获取想要的内容，然后进行爬取。最后再来一个福利对比环节：

"""
爬取妹子图所有的妹子图片
"""
import requests
import re
import time
import os
from bs4 import BeautifulSoup

header = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36",
    "cookie": "__cfduid=db8ad4a271aefd9fe52709ba6d3d94f561583915551; UM_distinctid=170c8b96544233-0c58b81612557b-404b032d-100200-170c8b96545354" ,
    "accept": "text/html,application/xhtml+xml,application/xml;q=0."
}


# 获取当前目录
root = os.getcwd()
cnt = 48
for page in range(2):
    # 进入当前目录
    response = requests.get(f"https://www.meizitu.com/a/list_1_{page+1}.html", headers=header,stream=True,timeout=(3,7))
    response.encoding = "gb2312"
    if response.status_code == 200:
        import pdb;pdb.set_trace()
        #result = re.findall("""<a target='_blank' href=".*?"><img src="(.*?)" alt="(.*?)"></a>""", response.text)  ###正则匹配
        result2=BeautifulSoup(response.text,'html.parser').find_all("img") ###beautifulsoup
        #result3=etree.HTML(response.text).xpath("//img/@src")  ###lxml
        for i in result2:
            path = i.attrs["src"]
            try:
                response = requests.get(path, headers=header,stream=True,timeout=(3,7))
                #import pdb;pdb.set_trace()
                with open("./meinv//"+str(cnt)+".jpg", "wb") as f:
                # 响应的文本，内容content
                    f.write(response.content)
                cnt += 1
            except:
                pass
    print(f"第{page+1}获取成功，请慢慢欣赏")

未完待续！