百度热点新闻上,前6条是在strong > a下面抓取,后30条,以及之后的各个分版块(国内,国际,地方,娱乐,体育等等),抓取的特征值是a标签下的mon的值,c=板块名称,pn=为每个分类下的第几条新闻,一个分类下显示12条(地方新闻显示8条),看看原网页就可以知道了。
完整代码如下
import requests
from bs4 import BeautifulSoup
import time
url='http://news.baidu.com/'
res=requests.get(url)
soup = BeautifulSoup(res.text,'lxml')
print('百度新闻python爬虫抓取')
print('头条热点新闻')
sel_a =soup.select('strong a')
for i in range(0,5):
print(sel_a[i].get_text())
print(sel_a[i].get('href'))
print('热点新闻')
titles_b=[]
titlew=""
for i in range(1,31):
sel_b=soup.find_all('a',mon="ct=1&a=2&c=top&pn="+str(i))
titles_b.append(sel_b[0])
for i in range(0,30):
print(titles_b[i].get_text())
print(titles_b[i].get('href'))
titlew=titlew + titles_b[i].get_text() + "\n"
# 获取当前时间
now = time.strftime('%Y-%m-%d', time.localtime(time.time()))
# 输出到文件
with open('news' + now + '.txt', 'a', encoding='utf-8') as file:
file.write(titlew) #只输出标题
摸索期间,可以直接把网页下载到本地进行调试,代码如下:
with open('本地文件路径',encoding='utf-8') as f:
# print(f.read())
soup = BeautifulSoup(f,'lxml')