python urllib与BeautifulSoup联用

最新推荐文章于 2024-01-11 16:01:01 发布

ZhenY.Yu

最新推荐文章于 2024-01-11 16:01:01 发布

阅读量276

点赞数

分类专栏： python

本文链接：https://blog.csdn.net/yzyolala123/article/details/117393839

版权

本文介绍了如何结合Python的urllib库和BeautifulSoup库来抓取网页内容。通过urllib获取网页源代码，然后利用BeautifulSoup解析HTML，提取所需信息，实现简单的网络爬虫功能。

摘要由CSDN通过智能技术生成

import urllib.request,urllib.parse,urllib.error
#从bs4库导入beautifulsoup类
from bs4 import BeautifulSoup
#导入ssl为了防止错误，此段照抄
import ssl
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url='http://py4e-data.dr-chuck.net/comments_1205404.html'
#给beautifulsoup类实例化
b=urllib.request.urlopen(url,context=ctx).read()
#context=ctx也是导入了ssl的固定格式
soup=BeautifulSoup(b,'html.parser')
#html.parser为beautifulsoup的固定格式
tags=soup('span')#（）中输入要查找的tag，tag可先用浏览器右键网页检查源代码找到
for i in tags:
    print(i)

import urllib.request,urllib.parse,urllib.error
from bs4 import BeautifulSoup
import ssl
import re

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url='https://movie.douban.com/top250'
#模拟浏览器header方法
headers={'User-Agent&#