实现方案,结合了多平台爬取技巧和反爬策略:
一、通用爬取框架
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
from fake_useragent import UserAgent
ua = UserAgent()
headers = {'User-Agent': ua.random}
二、主流平台实现方案
1. 微博热搜(实时+趋势)
接口地址:https://s.weibo.com/top/summary?cate=realtimehot
关键参数:
cate
:分类类型(realtimehot实时热榜,total总榜)key
:分类标识(person名人榜,films影视榜)
代码示例:
def get_weibo_hot(cate='realtimehot'):
url = f"https://s.weibo.com