Python爬虫——解析_获取百度一下

错过人间飞鸿

已于 2023-08-12 21:58:00 修改

阅读量259

点赞数

文章标签： python 爬虫

于 2023-07-24 20:01:26 首次发布

本文链接：https://blog.csdn.net/m0_63757342/article/details/131903754

版权

该文章演示了如何使用Python的urllib.request模块发送HTTP请求获取网页源码，然后利用lxml库的etree.HTML()方法解析HTML内容，通过XPath表达式提取特定元素。示例中，代码访问百度首页并提取了id为su的input标签的value属性。

摘要由CSDN通过智能技术生成

服务器响应文件 etree.HTML()

tree = etree.HTML(response.read().decode('utf-8'))

先获取网页源码

import urllib.request

url = 'https://www.baidu.com/'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36'
}

# 请求对象的定制
request = urllib.request.Request(url, headers=headers)
# 模拟浏览器访问服务器
response = urllib.request.urlopen(request)
# 获取网页源码
content = response.read().decode('utf-8')

crtl+shift+x打开xpath插件判断想要获取数据的标签是否正确
在这里插入图片描述
解析网页源码，来获取我们想要的数据

from lxml import etree
# 解析服务相应的文件
tree = etree.HTML(content)
# 获取想要的数据   xpath的返回值是一个列表类型的数据	使用列表的下标访问没有['']
result = tree.xpath('//input[@id="su"]/@value')[0]
print(result)

完整代码：

import urllib.request
from lxml import etree

url = 'https://www.baidu.com/'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36'
}

# 请求对象的定制
request = urllib.request.Request(url, headers=headers)
# 模拟浏览器访问服务器
response = urllib.request.urlopen(request)
# 获取网页源码
content = response.read().decode('utf-8')

# 解析网页源码，来获取我们想要的数据
# 解析服务相应的文件
tree = etree.HTML(content)
# 获取想要的数据   xpath的返回值是一个列表类型的数据 使用列表的下标访问没有['']
result = tree.xpath('//input[@id="su"]/@value')[0]
print(result)