Python3 爬虫-提取请求页面所有的真实url-BeautifulSoup

最新推荐文章于 2023-08-16 14:15:04 发布

lemon_tree1002

最新推荐文章于 2023-08-16 14:15:04 发布

阅读量1.7k

点赞数

分类专栏： url重定向 Python 文章标签： python爬虫

本文链接：https://blog.csdn.net/weixin_39568072/article/details/106998222

版权

本文介绍了如何使用Python3进行网页爬虫，通过BeautifulSoup库来提取请求页面中的所有真实URL。讲解了两种方法，包括find_all函数和CSS选择器select的使用技巧。

摘要由CSDN通过智能技术生成

在 HTML中 <a href='xx'> 表示超链接，所以要是提取页面 url 的话就是提取 ‘xx’

方法一：find_all

import urllib
import requests
from urllib.parse import urlparse
from urllib import request, parse
from bs4 import BeautifulSoup

word = '周杰伦'
# word为关键词，pn是百度用来分页的..
url = 'http://www.baidu.com.cn/s?wd=' + urllib.parse.quote(word) + '&pn=0'
print(url)
# 通过 url 获取域名
res = urlparse(url)
domain = res.netloc
print(domain)
print('- - '*30)

response = request.urlopen(url)
page = response.read()
soup = BeautifulSoup(page, 'lxml')
# tagh3 = soup.find_all('h3')  # 返回 list
tagh3 = soup.find_all('a')  # 获取所有 a 标签下内容，返回 list
all = open(r'F:\security\web\output\report\test.txt', 'w+')
hrefs = []
for h3 in tagh3:
    # href = h3.find('a').get('href')
    try:
        href = h3

最低0.47元/天解锁文章

lemon_tree1002

关注

0
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
Python3 爬虫-提取请求页面所有的真实url-BeautifulSoup

在 HTML中 <a href='xx'> 表示超链接，所以要是提取页面 url 的话就是提取 ‘xx’方法一：find_allimport urllibimport requestsfrom urllib.parse import urlparsefrom urllib import request, parsefrom bs4 import BeautifulSoupword = '周杰伦'# word为关键词，pn是百度用来分页的..url = 'http://.
复制链接

扫一扫

专栏目录