Python爬取小红书笔记信息加图片（仅供学习参考）

才华是浅浅的耐心

已于 2025-04-24 10:54:29 修改

阅读量7.6k

点赞数 53

文章标签：笔记学习 python 爬虫

于 2024-11-28 09:47:50 首次发布

本文链接：https://blog.csdn.net/weixin_74305707/article/details/144102768

版权

声明：文章内容仅供学习参考使用，请勿做任何商业行为等非法用途。

在本文中，我们将详细介绍如何使用Python对小红书的公开数据进行采集。以下代码不仅能够完成从小红书获取笔记信息，还支持解析、清洗和存储数据。希望对从事数据分析与爬虫的朋友有所帮助（附完整代码）。

1. 项目背景

小红书作为内容分享平台，聚集了大量用户生成的高质量内容。为了深入分析其内容的特点，我们需要一种高效的方式来采集公开数据。本项目旨在实现以下目标：

目标一：获取指定关键词的笔记信息。
目标二：解析笔记的标题、内容、点赞数、评论数等信息。
目标三：将爬取结果存储为CSV文件，同时下载相关图片

2. 项目所需工具与依赖

在开发过程中，我们用到了以下工具和库：

Python核心库：json, time, random, os, datetime
第三方库：
- execjs：执行JavaScript代码，生成签名参数。
- requests：发送HTTP请求，获取小红书的API数据。
- loguru：记录日志，方便调试。
- pandas：处理和存储数据为CSV格式。
安装方式： pip +包名

3. 核心代码实现

3.1 初始化工作

在程序开始时，我们需要：

初始化CSV文件，确保爬取数据能存储。
创建生成搜索ID的函数，为每次请求生成唯一标识。

img_path = 'result'
output_file_path = "result.csv"

# 初始化 CSV 文件并写入表头
if not os.path.exists(output_file_path):
    with open(output_file_path, mode="w", encoding="utf-8-sig", newline="") as f:
        f.write("note_url,last_update_time,note_id,xsec_token,type,title,text,topics,likes,comments,collects,shares\n")

def base36encode(number, digits='0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ'):
    base36 = ""
    while number:
        number, i = divmod(number, 36)
        base36 = digits[i] + base36
    return base36.lower()

def generate_search_id():
    timestamp = int(time.time() * 1000) << 64
    random_value = int(random.uniform(0, 2147483646))
    return base36encode(timestamp + random_value)

3.2 构建请求与生成

我们需要借助execjs执行JavaScript脚本，生成请求的签名参数，以及设置我们的请求参数。

search_data = {
"keyword": "旅游",
"page": 1,
"page_size": 20,
"search_id": generate_search_id(),
"sort": "general",
"note_type": 0
}

url = 'https://edith.xiaohongshu.com/api/sns/web/v1/search/notes'
api_endpoint = '/api/sns/web/v1/search/notes'

headers = {
'sec-ch-ua': 'Not_A Brand";v="8", "Chromium";v="120", "Google Chrome";v="120"',
'Content-Type': 'application/json;charset=UTF-8',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'x-s': sign['X-s'],
'x-t': str(sign['X-t']),
'X-s-common': sign['X-s-common']
}

3.3 数据解析与存储

我们需要解析返回的JSON数据，提取笔记的标题、点赞数、评论数等信息。同时，我们将数据存储为CSV格式。

def parse_data(data):
items = data.get('data', {}).get('items', [])
parsed_info = []
for item in items:
note = item.get('note_card', {})
title = note.get('title', '')
desc = note.get('desc', '')
topics = [word.strip('#').replace('[话题]', '').strip() for word in desc.split() if '[话题]' in word]
desc_cleaned = ' '.join([word for word in desc.split() if '[话题]' not in word]).strip()

interact_info = note.get('interact_info', {})
liked_count = interact_info.get('liked_count', 0)
comment_count = interact_info.get('comment_count', 0)
collected_count = interact_info.get('collected_count', 0)
share_count = interact_info.get('share_count', 0)

parsed_info.append({
'标题': title,
'内容': desc_cleaned,
'点赞数': liked_count,
'评论数': comment_count,
'收藏数': collected_count,
'转发数': share_count,
'话题': topics
})

return parsed_info

3.4 图片下载

对于每条笔记，我们还需要下载其图片并保存到本地。

output_dir = f"./{img_path}/{note_id}"
os.makedirs(output_dir, exist_ok=True)

# 下载并保存图片
for i, url in enumerate(image_urls):
image_path = os.path.join(output_dir, f"image_{i + 1}.jpg")
try:
response = requests.get(url)
response.raise_for_status()
with open(image_path, 'wb') as f:
f.write(response.content)
print(f"图片已保存: {image_path}")
except requests.exceptions.RequestException as e:
print(f"图片下载失败 {url}: {e}")

3.5结果展示：