项目实训一--数据获取(从CSDN或知乎上搜索)

但从公开数据集合中搜索得到的数据不够多,所以还需要从CSDN或者知乎等大型提问网站上进行获取,这样才能保证数据集足够完善。

这里我通过使用requests配合beautiful soap

我的思路很简单,首先修改上面的代码获取answer_id,然后根据answer_id去爬每个对应的完整回答。

第一步:获取answer_id

import requests
import pandas as pd
import time
​
template = 'https://www.zhihu.com/api/v4/questions/30644408/feeds?cursor=1c4cacd45e70f24bd620bad51c605d59&include=data[*].is_normal,admin_closed_comment,reward_info,is_collapsed,annotation_action,annotation_detail,collapse_reason,is_sticky,collapsed_by,suggest_edit,comment_count,can_comment,content,editable_content,attachment,voteup_count,reshipment_settings,comment_permission,created_time,updated_time,review_info,relevant_info,question,excerpt,is_labeled,paid_info,paid_info_content,reaction_instruction,relationship.is_authorized,is_author,voting,is_thanked,is_nothelp;data[*].mark_infos[*].url;data[*].author.follower_count,vip_info,badge[*].topics;data[*].settings.table_of_content.enabled&limit=5&{offset}&order=default&platform=desktop&session_id=1698132896804376037'
​
headers = {
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
}
​
cookies = {
    # 填入自己的z_c0 cookie
}
​
df = pd.DataFrame(columns=['answer_id', 'content', 'created_time'])
answer_ids = []
​
# 请求第一页数据
url = template.format(offset=0)
try:
    resp = requests.get(url, headers=headers, cookies=cookies)
    resp.raise_for_status()  # 检查请求是否成功
    data = resp.json()
    for item in data['data']:
        answer_id = item['target']['id']
        answer_ids.append(answer_id)
    next_url = data['paging']['next']
except requests.exceptions.RequestException as e:
    print(f"Error fetching data: {e}")
​
# 请求后续页数据
page = 1
while next_url:
    try:
        resp = requests.get(next_url, headers=headers, cookies=cookies)
        resp.raise_for_status()  # 检查请求是否成功
        data = resp.json()
        for item in data['data']:
            answer_id = item['target']['id']
            answer_ids.append(answer_id)
        next_url = data['paging']['next']
        page += 1
        print(f'正在爬取第 {page} 页')
        time.sleep(3)  # 可以根据需要调整等待时间
    except requests.exceptions.RequestException as e:
        print(f"Error fetching data: {e}")
        break
​
# 将答案ID存入DataFrame并保存为CSV文件
df['answer_id'] = answer_ids
df.to_csv('answer_id.csv', index=False)

第二步:根据answer_id来获取内容

from bs4 import BeautifulSoup
import pandas as pd
import random
import requests
import time
​
headers = {
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
}
​
cookies = {
    # 填入自己的z_c0 cookie
}
​
# 从CSV文件中读取答案ID列表
df_answer_ids = pd.read_csv('answer_id.csv')
answer_ids = df_answer_ids['answer_id'].tolist()
​
contents = []
batch = 0
​
for index, answer_id in enumerate(answer_ids):
    print(f'正在爬取 answer_id 为 {answer_id} 的数据')
    url = f'https://www.zhihu.com/question/30644408/answer/{answer_id}'
    try:
        resp = requests.get(url, headers=headers, cookies=cookies)
        resp.raise_for_status()  # 检查请求是否成功
​
        soup = BeautifulSoup(resp.text, 'html.parser')
        content_elem = soup.find('div', class_='RichContent-inner')
​
        if content_elem:
            content = content_elem.text.strip()
            contents.append(content)
            print(content)
        else:
            print(f'未找到答案内容,answer_id 为 {answer_id}')
    except requests.exceptions.RequestException as e:
        print(f'爬取 answer_id 为 {answer_id} 的数据时出现异常:{e}')
        continue
​
    time.sleep(random.randint(1, 4))
​
    # 每爬取100个回答就保存一次数据
    if (index + 1) % 100 == 0:
        new_data = {'answer_id': answer_ids[:index + 1], 'content': contents}
        new_df = pd.DataFrame(new_data)
        new_df.to_csv(f'text_{batch}.csv', index=False)  # 不保存索引
        batch += 1
​
# 最后保存剩余的数据
if contents:
    new_data = {'answer_id': answer_ids[:len(contents)], 'content': contents}
    new_df = pd.DataFrame(new_data)
    new_df.to_csv(f'text_{batch}.csv', index=False)  # 不保存索引
    batch += 1

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值