本文介绍我用Python语言开发的热搜榜,聚合有百度、头条、微博、知乎和CSDN等网站热搜信息。该工具运行于终端中,比如cmder、powershell或者git bash等,实在是上班、摸鱼之必备工具。
—、工具执行效果
1.1 项目代码
项目代码地址存在gitee中,仓库地址:https://gitee.com/shawn_chen_rtz/hot_billboard.git,欢迎Star。
代码结构:
app.py文件是项目启动文件,执行python app.py,根据提示进行后续操作即可。
1.2 执行效果
执行效果如下,
输入对应数字访问不同网站热搜列表,输入字母q或者Q,工具退出运行。
比如,输入数字3,对应微博热搜列表,
热搜列表打印出后,输入对应数字获取访问链接,
CSDN热搜榜,
1.3 app.py启动文件程序
app.py程序,
# -*- coding:utf-8 -*-
from baidu_hot import get_baidu_hot
from toutiao_hot import get_toutiao_hot
from weibo_hot import get_weibo_hot
from zhihu_hot import get_zhihu_hot
from csdn_hot import get_csdn_hot
import time
print("欢迎回来!请输入对应数字浏览热搜")
on = True
while on:
user_input = input("1-baidu;2-toutiao;3-weibo;4-zhihu;5-CSDN;q/Q-退出;请输入:")
if user_input == '1':
get_baidu_hot()
elif user_input == '2':
get_toutiao_hot()
elif user_input == '3':
get_weibo_hot()
elif user_input == '4':
get_zhihu_hot()
elif user_input == '5':
get_csdn_hot()
elif user_input == 'q' or user_input == 'Q':
on = False
else:
print("用户非法输入,3s后刷新,重新选择操作")
time.sleep(3)
print("退出应用成功,期待再次光临")
一个while循环,循环体中根据用户输入内容进行条件判断,执行对应方法。
二、百度热搜实现
2.1 涉及模块
获取百度热搜方法实现需要导入模块requests、BeautifulSoup、re、time
2.2 对应接口
百度热搜接口:
https://top.baidu.com/board?tab=realtime
2.3 代码实现
代码实现,
import requests
from bs4 import BeautifulSoup
import re
import time
def get_baidu_hot():
while True:
baidu_top = "https://top.baidu.com/board?tab=realtime"
resp = requests.get(baidu_top)
resp.encoding = 'utf-8'
html = resp.text
soup = BeautifulSoup(html,'html.parser')
news = soup.findAll(class_="content_1YWBm")
news.reverse()
i = 0
news_ls = []
for new in news:
i = i + 1
url = new.find('a').attrs['href']
text = new.find(class_="c-single-text-ellipsis").text
news_ls.append({"text":text.strip(),"url":url})
print(('\033[1;37m'+str(i)+'\033[0m').center(50,"*"))
print("\033[1;36m"+text.strip()+"\033[0m")
# news_ls.reverse()
user_input = input("输入新闻编号获取进一步访问的超链接,输入q/Q退出,输入r/R刷新热榜:")
if user_input == 'q' or user_input == 'Q':
break
elif user_input == 'r' or user_input == 'R':
continue
elif user_input in [str(i) for i in range(1,len(news_ls)+1)]:
news_index = eval(user_input) - 1
print(news_ls[news_index].get('url'))
print("\033[1;33m" + "按住Ctrl键,点击超链接进行访问" + "\033[0m")
print('\033[5;31m'+'10s后自动刷新热榜'+'\033[0m')
time.sleep(10)
continue
else:
print("Invalid User Input.")
print('\033[5;31m'+"3s后自动刷新热榜"+'\033[0m')
time.sleep(3)
continue
print("Over,退出百度热搜!")
其中需要注意,根据接口返回页面数据具体情况使用BeautifulSoup模块。
三、头条热搜实现
3.1 涉及模块
获取头条热搜方法实现需要导入模块requests、time
3.2 对应接口
头条热搜的访问接口:
https://www.toutiao.com/hot-event/hot-board/?origin=toutiao_pc
3.3 代码实现
代码实现,
import requests
import time
def get_toutiao_hot():
while True:
url = "https://www.toutiao.com/hot-event/hot-board/?origin=toutiao_pc"
resp = requests.get(url)
resp.encoding = 'utf-8'
resp = resp.json()
news_ls = []
i = 0
news = resp.get('data')
news.reverse()
for new in news:
i += 1
print(('\033[1;37m'+str(i)+'\033[0m').center(50,'*'))
news_ls.append({'title':new.get('Title'),'url':new.get('Url')})
print('\033[1;36m'+new.get('Title')+'\033[0m')
fixed_top_data = resp.get('fixed_top_data')
fixed_top_data = fixed_top_data[0]
news_ls.append({'title':fixed_top_data.get('Title'),'url':fixed_top_data.get('Url')})
print(('\033[1;37m'+str(i+1)+'\033[0m').center(50,'*'))
print('\033[1;36m'+news_ls[-1].get('title')+'\033[0m')
user_input = input("输入新闻编号获取进一步访问的超链接,输入q/Q退出,输入r/R刷新热榜:")
if user_input == 'q' or user_input == 'Q':
break
elif user_input == 'r' or user_input == 'R':
continue
elif user_input in [str(i) for i in range(1,len(news_ls)+1)]:
news_index = eval(user_input) - 1
print(news_ls[news_index].get('url'))
print("\033[1;33m" + "按住Ctrl键,点击超链接进行访问" + "\033[0m")
print('\033[5;31m'+'10s后自动刷新热榜'+'\033[0m')
time.sleep(10)
continue
else:
print("Invalid User Input.")
print('\033[5;31m'+"3s后自动刷新热榜"+'\033[0m')
time.sleep(3)
continue
print("Over,退出头条热搜!")
与百度热搜的区别是,该接口返回json数据,不是html源代码。所以不需要使用模块BeautifulSoup、re分析匹配页面元素。返回数据处理相对简单~
四、微博热搜实现
4.1 涉及模块
获取微博热搜方法实现需要导入模块requests、time、BeautifulSoup
4.2 对应接口
微博热搜的访问接口:
https://s.weibo.com/top/summary?cate=realtimehot
需要注意的是该接口的访问需要设置请求头,设置对应cookie信息,否则访问异常。
cookie信息,本章节的代码实现中是随机设置的,可以通过以下方法自行查找获取设置。浏览器页面访问https://s.weibo.com/top/summary?cate=realtimehot,F12找到该请求,如下图。
4.3 代码实现
代码实现,
import requests
import time
from bs4 import BeautifulSoup
def get_weibo_hot():
while True:
url = "https://s.weibo.com/top/summary?cate=realtimehot"
headers = {"Cookie":"SUB=_2AxxxxxxxxxNxqwJxxx3dtWXlM5SjftExkMQK6NASTHqZWXWFEB;"}
resp = requests.get(url=url,headers=headers)
resp.encoding = 'utf-8'
html = resp.text
soup = BeautifulSoup(html,'html.parser')
news = soup.findAll(class_='td-02')
news.reverse()
base_url = "https://s.weibo.com"
news_ls = []
i = 0
for new in news:
i = i + 1
url = base_url + new.find('a').attrs['href']
# print(url)
title = new.find('a').text
print(('\033[1;37m' + str(i) + '\033[0m').center(50,'*'))
print('\033[1;36m' + title + '\033[0m')
news_ls.append({"title":title,"url":url})
news_length = len(news_ls)
# news_ls.reverse()
user_input = input("输入新闻编号获取进一步访问的超链接,输入q/Q退出,输入r/R刷新热榜:")
if user_input == 'q' or user_input == 'Q':
break
elif user_input == 'r' or user_input == 'R':
continue
elif user_input in [str(i) for i in range(1,news_length+1)]:
news_index = eval(user_input) - 1
print(news_ls[news_index].get('url'))
print("\033[1;33m" + "按住Ctrl键,点击超链接进行访问" + "\033[0m")
print('\033[5;31m'+'10s后自动刷新热榜'+'\033[0m')
time.sleep(10)
continue
else:
print("Invalid User Input.")
print('\033[5;31m'+"3s后自动刷新热榜"+'\033[0m')
time.sleep(3)
continue
print("Over,退出微博热搜!")
同百度热搜返回结果处理类似,需要使用BS模块对返回数据进行处理,查找到对应热搜数据。BeautifulSoup模块在网页爬虫数据处理中起到很大的作用,可以重点关注下该模块。
五、知乎热搜实现
5.1 涉及模块
获取知乎热搜方法实现需要导入模块requests、time、BeautifulSoup、json
5.2 对应接口
知乎热搜的访问接口:
https://www.zhihu.com/billboard
5.3 代码实现
代码实现,
import requests
import time
from bs4 import BeautifulSoup
import json
def get_zhihu_hot():
while True:
url = "https://www.zhihu.com/billboard"
resp = requests.get(url)
resp.encoding = 'utf-8'
html = resp.text
soup = BeautifulSoup(html,'html.parser')
news = soup.findAll(class_='HotList-itemTitle')
# print(len(news))
news_ls = []
title_ls = []
for new in news:
title = new.text
# print(title)
title_ls.append(title)
js_text_dict = json.loads(soup.find('script',{'id':'js-initialData'}).get_text())
#print(js_text_dict['initialState']['topstory']['hotList'])
js_text_dict = js_text_dict['initialState']['topstory']['hotList']
url_ls = []
for new in js_text_dict:
url = new['target']['link']['url']
url_ls.append(url)
news_ls = [{'title':title_ls[i],'url':url_ls[i]} for i in range(len(title_ls))]
news_ls.reverse()
# print(news_ls)
i = 0
for new in news_ls:
i += 1
print(('\033[1;37m'+str(i)+'\033[0m').center(50,"*"))
print('\033[1;36m'+new.get('title')+'\033[0m')
news_length = len(news_ls)
# news_ls.reverse()
user_input = input("输入新闻编号获取进一步访问的超链接,输入q/Q退出,输入r/R刷新热榜:")
if user_input == 'q' or user_input == 'Q':
break
elif user_input == 'r' or user_input == 'R':
continue
elif user_input in [str(i) for i in range(1,news_length+1)]:
news_index = eval(user_input) - 1
print(news_ls[news_index].get('url'))
print("\033[1;33m" + "按住Ctrl键,点击超链接进行访问" + "\033[0m")
print('\033[5;31m'+'10s后自动刷新热榜'+'\033[0m')
time.sleep(10)
continue
else:
print("Invalid User Input.")
print('\033[5;31m'+"3s后自动刷新热榜"+'\033[0m')
time.sleep(3)
continue
print("Over,退出知乎热搜!")
六、CSDN热搜实现
6.1 涉及模块
获取CSDN热搜方法实现需要导入模块requests、time
6.2 对应接口
CSDN热搜的访问接口:
https://blog.csdn.net/phoenix/web/blog/hot-rank?page=0&pageSize=50
https://blog.csdn.net/phoenix/web/blog/hot-rank?page=1&pageSize=50
注意!该接口返回数据较多,使用了分页参数page和pageSize,注意page参数替换成对应数字即可。比如0和1;该接口访问也需要设置请求头,否则返回不了正确数据。
6.3 代码实现
代码实现,
import requests
import time
def get_csdn_hot():
while True:
news_ls = []
for i in range(2):
url = "https://blog.csdn.net/phoenix/web/blog/hot-rank?page=" + str(i) + "&pageSize=50"
#print(url)
# csdn做了校验,必须设置请求头中的User-Agent才能成功返回内容
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"}
resp = requests.get(url,headers=headers)
resp = resp.json()
news = resp['data']
for new in news:
news_ls.append({"title":new.get('articleTitle'),"url":new.get('articleDetailUrl')})
i = 0
news_ls.reverse()
for new in news_ls:
i += 1
print(("\033[1;37m" + str(i) + "\033[0m").center(50,"*"))
print("\033[1;36m" + new.get('title') + "\033[0m")
news_length = len(news_ls)
# news_ls.reverse()
user_input = input("输入新闻编号获取进一步访问的超链接,输入q/Q退出,输入r/R刷新热榜:")
if user_input == 'q' or user_input == 'Q':
break
elif user_input == 'r' or user_input == 'R':
continue
elif user_input in [str(i) for i in range(1,news_length+1)]:
news_index = eval(user_input) - 1
print(news_ls[news_index].get('url'))
print("\033[1;33m" + "按住Ctrl键,点击超链接进行访问" + "\033[0m")
print('\033[5;31m'+'10s后自动刷新热榜'+'\033[0m')
time.sleep(10)
continue
else:
print("Invalid User Input.")
print('\033[5;31m'+"3s后自动刷新热榜"+'\033[0m')
time.sleep(3)
continue
print("Over,退出CSDN热搜!")
可以关注作者微信公众号,追踪更多有价值的内容!