项目实训一--数据获取（从CSDN或知乎上搜索）

m0_74291022

已于 2024-06-24 12:55:19 修改

阅读量52

点赞数

文章标签： python 开发语言

于 2024-05-31 15:46:19 首次发布

本文链接：https://blog.csdn.net/m0_74291022/article/details/139354604

版权

但从公开数据集合中搜索得到的数据不够多，所以还需要从CSDN或者知乎等大型提问网站上进行获取，这样才能保证数据集足够完善。

这里我通过使用requests配合beautiful soap

我的思路很简单，首先修改上面的代码获取answer_id，然后根据answer_id去爬每个对应的完整回答。

第一步：获取answer_id

import requests
import pandas as pd
import time

template = 'https://www.zhihu.com/api/v4/questions/30644408/feeds?cursor=1c4cacd45e70f24bd620bad51c605d59&include=data[*].is_normal,admin_closed_comment,reward_info,is_collapsed,annotation_action,annotation_detail,collapse_reason,is_sticky,collapsed_by,suggest_edit,comment_count,can_comment,content,editable_content,attachment,voteup_count,reshipment_settings,comment_permission,created_time,updated_time,review_info,relevant_info,question,excerpt,is_labeled,paid_info,paid_info_content,reaction_instruction,relationship.is_authorized,is_author,voting,is_thanked,is_nothelp;data[*].mark_infos[*].url;data[*].author.follower_count,vip_info,badge[*].topics;data[*].settings.table_of_content.enabled&limit=5&{offset}&order=default&platform=desktop&session_id=1698132896804376037'

headers = {
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
}

cookies = {
    # 填入自己的z_c0 cookie
}

df = pd.DataFrame(columns=['answer_id', 'content', 'created_time'])
answer_ids = []

# 请求第一页数据
url = template.format(offset=0)
try:
    resp = requests.get(url, headers=headers, cookies=cookies)
    resp.raise_for_status()  # 检查请求是否成功
    data = resp.json()
    for item in data['data']:
        answer_id = item['target']['id']
        answer_ids.append(answer_id)
    next_url = data['paging']['next']
except requests.exceptions.RequestException as e:
    print(f"Error fetching data: {e}")

# 请求后续页数据
page = 1
while next_url:
    try:
        resp = requests.get(next_url, headers=headers, cookies=cookies)
        resp.raise_for_status()  # 检查请求是否成功
        data = resp.json()
        for item in data['data']:
            answer_id = item['target']['id']
            answer_ids.append(answer_id)
        next_url = data['paging']['next']
        page += 1
        print(f'正在爬取第 {page} 页')
        time.sleep(3)  # 可以根据需要调整等待时间
    except requests.exceptions.RequestException as e:
        print(f"Error fetching data: {e}")
        break

# 将答案ID存入DataFrame并保存为CSV文件
df['answer_id'] = answer_ids
df.to_csv('answer_id.csv', index=False)

第二步：根据answer_id来获取内容

from bs4 import BeautifulSoup
import pandas as pd
import random
import requests
import time

headers = {
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
}

cookies = {
    # 填入自己的z_c0 cookie
}

# 从CSV文件中读取答案ID列表
df_answer_ids = pd.read_csv('answer_id.csv')
answer_ids = df_answer_ids['answer_id'].tolist()

contents = []
batch = 0

for index, answer_id in enumerate(answer_ids):
    print(f'正在爬取 answer_id 为 {answer_id} 的数据')
    url = f'https://www.zhihu.com/question/30644408/answer/{answer_id}'
    try:
        resp = requests.get(url, headers=headers, cookies=cookies)
        resp.raise_for_status()  # 检查请求是否成功

        soup = BeautifulSoup(resp.text, 'html.parser')
        content_elem = soup.find('div', class_='RichContent-inner')

        if content_elem:
            content = content_elem.text.strip()
            contents.append(content)
            print(content)
        else:
            print(f'未找到答案内容，answer_id 为 {answer_id}')
    except requests.exceptions.RequestException as e:
        print(f'爬取 answer_id 为 {answer_id} 的数据时出现异常：{e}')
        continue

    time.sleep(random.randint(1, 4))

    # 每爬取100个回答就保存一次数据
    if (index + 1) % 100 == 0:
        new_data = {'answer_id': answer_ids[:index + 1], 'content': contents}
        new_df = pd.DataFrame(new_data)
        new_df.to_csv(f'text_{batch}.csv', index=False)  # 不保存索引
        batch += 1

# 最后保存剩余的数据
if contents:
    new_data = {'answer_id': answer_ids[:len(contents)], 'content': contents}
    new_df = pd.DataFrame(new_data)
    new_df.to_csv(f'text_{batch}.csv', index=False)  # 不保存索引
    batch += 1

m0_74291022

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
项目实训一--数据获取（从CSDN或知乎上搜索）

二：若方法一无法找到特定的数据集脚本或数据文件。这可能是因为尝试加载数据集时出现了问题。如何根据JSON文件找到地址信息，这里我本来以为下载好的文件就是数据内容，但实际上不是，只是一些地址信息，还需要继续处理。一：使用Python代码。
复制链接

扫一扫