HTML页面解析

哈麻闲人

已于 2022-08-04 17:26:12 修改

阅读量758

点赞数 1

分类专栏： python爬虫文章标签： html servlet 前端 python 爬虫

于 2022-08-04 17:23:36 首次发布

本文链接：https://blog.csdn.net/hgfyyhg/article/details/126163280

版权

python爬虫专栏收录该内容

2 篇文章 0 订阅

订阅专栏

一、正则表达式

1.引入re模块

import re

2.写正则表达式

obj = re.compile(r'<li>.*?<span class="title">(?P<name>.*?)'
                     r'</span>.*?<p class="">.*?<br>(?P<year>.*?)&nbsp;/&nbsp;'
                     r'(?P<place>.*?)&nbsp;.*?</p>', re.S)

(?P<分组名字>正则)

使用re.S参数以后，正则表达式会将这个字符串作为一个整体，将“\n”当做一个普通的字符加入到这个字符串中，在整体中进行匹配。

3.HTML页面解析　　

result = obj.finditer(resp.text)    # 返回一个迭代器

result = obj.match(resp.text)    # 从字符串开头匹配

result = obj.search(resp.text)　　# 返回第一个结果或者空

result = obj.findall(resp.text)　　# 返回list

4.取组值

for it in result:
    print(it.group('name'))

二、BeautifulSoup

1.引入BeautifulSoup

import requests
from bs4 import BeautifulSoup    # 引入BeautifulSoup

find(标签，属性=值)

find_all(标签，属性=值)

2.bs4解析页面

# bs4解析页面
page = BeautifulSoup(resp.text, 'html.parser')
pageList = page.find_all('a', style="display: block;",target="_blank")
for i in pageList:
        print(i.get('href'))    # 获取标签中属性的值，.text来取标签标记的内容

三、xpath

1.引入etree

import requests
from lxml import etree    # 引入etree

2.使用xpath对HTML解析

html = etree.HTML(resp.text)
divs = html.xpath('//*[@id="utopia_widget_6"]/div/div[1]/div')

    for div in divs:
        price = div.xpath('./div[4]/span/text()')[0]