爬取网页下的列表数据，以微博热搜榜为例

最新推荐文章于 2025-02-25 09:31:45 发布

程序员rabbit

最新推荐文章于 2025-02-25 09:31:45 发布

阅读量419

点赞数 6

文章标签： python beautifulsoup 爬虫

本文链接：https://blog.csdn.net/qq_43920769/article/details/141923624

版权

爬取网站

链接：https://s.weibo.com/top/summary?cate=realtimehot
2. 第一步：获取headers，cookie，response写法
F12打开开发者工具，CTRL+R刷新，出现下图，点1，再点2，得到3 在这里插入图片描述
在要提取的网址上右键，copy–copy as curl(bash)
把复制到的内容粘贴到https://curlconverter.com/这个网站（可能要科学上网），粘贴进去，就可以得到python格式直接可以用的headers，cookie，response

提取内容
先点1和2，,2是元素选择器，
把上图2的图标移到要提取的标题上，多选几个，点1再点2,2就是标题对应的代码

右键，然后点copy–copy selector

得到以下的内容，我取了三个标题

#pl_top_realtimehot > table > tbody > tr:nth-child(15) > td.td-02 > a
#pl_top_realtimehot > table > tbody > tr:nth-child(8) > td.td-02 > a
#pl_top_realtimehot > table > tbody > tr:nth-child(18) > td.td-02 > a

可以看出公共部分是

content="#pl_top_realtimehot > table > tbody > tr > td.td-02 > a"

soup和text过滤掉不必要的信息，比如js类语言，排除这类语言对于信息受众阅读的干扰

fo = open("./热搜.txt",'a',encoding="utf-8")
a=soup.select(content)
for i in range(0,len(a)):
    a[i] = a[i].text
    fo.write(a[i]+'\n')
fo.close()

为了看的更直白，我写了一部分打印在控制台

a = soup.select(content)
for i in range(0, len(a)):
    a[i] = a[i].text
    print(a[i])

代码

.代码如下：

import os
import requests
from bs4 import BeautifulSoup

cookies = {
--改成你自己的
}

headers = {
   --改成你自己的
}
#数据存储
fo = open("./热搜数据.txt",'a',encoding="utf-8")
response = requests.get('https://s.weibo.com/top/summary', cookies=cookies, headers=headers)

# 解析网页
response.encoding = 'utf-8'
soup = BeautifulSoup(response.text, 'html.parser')
# 爬取内容
content = "#pl_top_realtimehot > table > tbody > tr > td.td-02 >a"
# 清洗数据
a = soup.select(content)
for i in range(0, len(a)):
    a[i] = a[i].text
    fo.write(a[i] + '\n')
fo.close()
#一下是我自己加的打印效果
a = soup.select(content)
for i in range(0, len(a)):
    a[i] = a[i].text
    print(a[i])

我也是学习其他博主的，非常感谢出这篇文章的作者大大，让我可以入门。