爬新闻内容及标题11.24

最新推荐文章于 2024-08-03 15:58:00 发布

花开未满

最新推荐文章于 2024-08-03 15:58:00 发布

阅读量364

点赞数

原文链接：https://blog.csdn.net/holiday0/article/details/103327397

版权

一、库的导入

from urllib.request import urlopen
from bs4 import BeautifulSoup  
from urllib import parse
import requests

  
  
  
  1
2
3
4

二、新闻标题的爬取

html = urlopen("http://xgxy.hbue.edu.cn/")#打开所需爬取的页面
bs = BeautifulSoup(html,'html.parser')#用BeautifulSoup解析网页

p1 = bs.findAll(‘div’,{‘class’:‘news_tit’})#找到新闻标题的所在标签名称

for each in p1:
titles = each.select(‘a’)[0][‘title’]#即a标签下的title
print(titles)

在这里插入图片描述

三、爬取新闻的内容

page_url = "http://xgxy.hbue.edu.cn/"
news_full_urls = []

p1 = bs.findAll(‘div’,{‘class’:‘news_tit’})
#爬取新闻内容的链接
for each in p1:
news_url = each.select(‘a’)[0][‘href’]
new_full_url = parse.urljoin(page_url, news_url)#把相对路径连接成绝对路径
news_full_urls.append(new_full_url)

for url in news_full_urls:
result1=requests.get(url) #进入新闻链接
result1.encoding=‘utf-8’#让中文可以正常显示
content1=result1.content
soup1=BeautifulSoup(content1,fromEncoding=result1.encoding)#解析该网页
main_article=soup1.find(‘div’,{“class”:“wp_articlecontent”})#找到新闻内容所在的标签
print(main_article.text)

在这里插入图片描述

补充点

parse.urljoin：把一个基地址和相对地址智能连接成一个绝对地址。
url=parse.urljoin(old_url, new_url)
findAll:和find_all没有太大区别，返回文档中符合条件的所有tag,没有找到目标返回空列表。而find 等价于 findAll的limit等于1时的情形
selenium库也可以爬取数据，可以让浏览器自动加载网站，甚至对页面截屏，但它自己不带浏览器，需要与第三方浏览器集成才能运行