HackRequests+BeautifulSoup+re爬取网站网址

最新推荐文章于 2023-04-17 02:40:04 发布

蒙奇奇

最新推荐文章于 2023-04-17 02:40:04 发布

阅读量1.8k

点赞数 2

文章标签：爬虫 python 正则表达式安全

本文链接：https://blog.csdn.net/weixin_44174581/article/details/120201597

版权

先用语法搜索一波想要爬取的网站
在这里插入图片描述
点击页码，抓包看看请求头，多抓几个放入对比器找到页码参数

可以看到页码参数为pn，第一页为0，第二页为10，第五页为40，可以知道每一页pn增加10，写python脚本。

import re
from bs4 import BeautifulSoup as BS
import HackRequests as hack

def tomcat(raw):
    hh = hack.httpraw(raw=raw)
    soup = BS(hh.text(), features="html.parser")

    #正则匹配网址，通过观察，a标签，href属性格式
    links = soup.findAll(name='a', attrs={'href': re.compile('http://www.baidu.com/link\?u

最低0.47元/天解锁文章

确定要放弃本次机会？

福利倒计时

: :

立减 ¥

普通VIP年卡可用

立即使用

蒙奇奇

关注关注

2
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
HackRequests+BeautifulSoup+re爬取网站网址

先用语法搜索一波想要爬取的网站点击页码，抓包看看请求头，多抓几个放入对比器找到页码参数可以看到页码参数为pn，第一页为0，第二页为10，第五页为40，可以知道每一页pn增加10，写python脚本。import refrom bs4 import BeautifulSoup as BSimport HackRequests as hackdef tomcat(raw): hh = hack.httpraw(raw=raw) soup = BS(hh.text(), fe
复制链接

扫一扫