网络爬虫技术笔记——静态网页爬取

3sin2x

已于 2023-05-20 10:32:58 修改

阅读量665

点赞数

分类专栏：网络爬虫技术笔记文章标签：笔记

于 2023-05-18 17:07:21 首次发布

本文链接：https://blog.csdn.net/weixin_68874096/article/details/130729855

版权

网络爬虫技术笔记专栏收录该内容

3 篇文章 0 订阅

订阅专栏

静态网页

含义：纯粹HTML格式，没有后台数据库、不含程序、不可交互

查看方式：鼠标右键+查看网页源代码

<html> </html>构成一个HTML标签

其中标签内含有<head>标签、<body>标签

<body>内含有<p>等

<p>内设计更多标签

相当于一个树

爬虫基本流程（编辑器：pycharm；环境：python）

发起请求——request库

import requests
#爬取目标网址
url='http://tipdm.com/'
#设置网页请求头
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36 Edg/113.0.1774.42'}
#生成HTTP请求,设定请求头和反应时间，超过2s则停止运行
rq=requests.get(url,headers=headers,timeout=2)

获取相应内容

#查看状态吗
rq.status_code
#查看源代码
rq.text
#查看请求头
rq.headers

解析内容

方法一：

lxml.etree(解析内容）+xpath方法（指定位置）

from lxml import etree

with open('html_doc.html') as f:
    html_doc =f.read()
print(html_doc)

#解析网页=显示网址源代码——etree.HTML()
dom=etree.HTML(html_doc)


#查找标题
#获取存储位置，绝对路径
dom.xpath('/html/head/title')
# [<Element title at 0x1eb53997d80>]
#相对路径,两种方法
dom.xpath('//title')
# [<Element title at 0x1eb53997d80>]
dom.xpath('/html//title')
# [<Element title at 0x1eb53997d80>]
#获取文本内容，三种方法
dom.xpath('/html/head/title/text()')
# ["\n            The Dormouse's story\n        "]
dom.xpath('//title/text()')
# ["\n            The Dormouse's story\n        "]
dom.xpath('/html//title/text()')
# ["\n            The Dormouse's story\n        "]


#查找Elsie
#会出现三个姓名
dom.xpath('//body/p/a/text()')
# ['\n                Elsie\n            ', '\n                Lacie\n            ', '\n                Tillie\n            ']
#精准指定目标的两种方式
#加入顺序（序号）
dom.xpath('//body/p/a[1]/text()')
# ['\n                Elsie\n            ']
#加入属性值
dom.xpath('//body/p/a[@id="link1"]/text()')
# ['\n                Elsie\n            ']

Beautiful Soup解析网页+CSS Selector

#使用beautifulsoup解析网页,需要借用相应解释器
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc,'lxml')
type(soup)
# <class 'bs4.BeautifulSoup'>
#解析内容，直接.标签，不需要路径
soup.title
# <title>
#             The Dormouse's story
#         # </title>
soup.p
# <p class="title">
# <b>
#                 The Dormouse's story
#             </b>
# </p>
soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">
#                 Elsie
#             </a>


#精准定位
#绝对路径
soup.select('html>head>title')
# [<title>
#             The Dormouse's story
#         </title>]

#相对路径
soup.select('body  a')
# [<a class="sister" href="http://example.com/elsie" id="link1">
#                 Elsie
#             </a>, <a class="sister" href="http://example.com/lacie" id="link2">
#                 Lacie
#             </a>, <a class="sister" href="http://example.com/tillie" id="link2">
#                 Tillie
#             </a>]
#最精准定位
#nth-child是一个关键词
soup.select('p>a:nth-child(1)')
# [<a class="sister" href="http://example.com/elsie" id="link1">
#                 Elsie
#             </a>]
soup.select('p>a[id="link1"]')
# [<a class="sister" href="http://example.com/elsie" id="link1">
#                 Elsie
#             </a>]
#加入标签索引获取内容
soup.select('p>a:nth-child(1)')[0].text
# '\n                Elsie\n            '
# 获取标签超链接
soup.select('p>a:nth-child(1)')[0].get('herf')

保存数据

with open ('web_data.txt','w',encoding='utf-8') as f:
    f.write(rq.text)

额外工具

chardet库的detect函数可检测给定字符串的编码

import chardet
chardet.detect(rq.content)
{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}