爬取网站patient like me中COVID-19论坛中的评论信息

最新推荐文章于 2024-09-27 10:11:28 发布

Sakura_❀_

最新推荐文章于 2024-09-27 10:11:28 发布

阅读量579

点赞数

分类专栏：自学知识文章标签： python 爬虫

本文链接：https://blog.csdn.net/weixin_45781143/article/details/118784544

版权

自学知识专栏收录该内容

3 篇文章 0 订阅

订阅专栏

这几天老师布置了一个小的爬虫任务
对于我这种完全不会爬虫的人来说，我以为会有些难度，但应该也不会太费时间。
哪知道小小的patient like me竟然让我两天没打游戏了

不多说了直接开干

确定思路

首先先打开论坛链接
COVID-19
观察发现：会强制跳转到登陆界面进行登陆
注册账号登陆进去看看

是一个一个关于COVID-19的帖子
进入帖子发现里面是关于帖子内容的评论这就是我们要爬取的内容
确定思路
- 解决登陆问题
- 爬取论坛帖子链接
- 爬取每个链接内的评论内容

解决登陆问题

简单在网上查了查目前常用的有两种

进行模拟登陆
先真实登陆一次提取出自己的cookies，爬取时写在headers上

由于第二种看着就比较简单所以我决定先试一下第二种

如何找到cookies

在网上查了很久也试过很多次有的说在Network里面找，有的说在Application找

最后发现在Application的能在patient like me登陆成功

部分代码

import requests
cookies = {.......}
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    "Referer": "https://www.patientslikeme.com/users/sign_in"  //注意Referer一定要填登陆界面的链接
}
url = 'https://www.patientslikeme.com/api/internal/v1/topics?'
session = requests.Session()
requests.utils.add_dict_to_cookiejar(session.cookies, cookies)
response = session.get(url=url, headers=headers)
print(response.status_code)

解决爬取每个帖子链接问题

首先点开几个链接看看

https://www.patientslikeme.com/forum/covid19/topics/160469
https://www.patientslikeme.com/forum/covid19/topics/162792
https://www.patientslikeme.com/forum/covid19/topics/162800

发现规律原来是按照某一串数字来的
只要找到这一系列的数字就能解决问题了

由于是第一次爬不是很会走了很多弯路
以为只要.text到源代码，然后加以一定的正则表达式就能找到这些数字
然后写了大半天正则表达式的我发现.text的源码里面根本没有这些东西

后来查资料发现原来这些帖子都是动态产生的.text只能找静态产生的内容
按照网上的教程，我在Network的XHR下找了起接口来

经过观察终于找到接口的url了

进入标题把headers、url、cookies信息更改
进行简单的提取就能提取到链接了

id_list=[]
for item in data['topics']:
    id = item['id']
    id_list.append(id)
    print(id)
print(id_list)

for id in id_list:
    url = 'https://www.patientslikeme.com/forum/covid19/topics/'+str(id)
    print(url)

解决爬取每个链接内的评论内容

本来想如法炮制用爬链接的方法爬到评论内容的json文件，但是能力有限，我在Network里找遍了都没找到对应内容
还好
B计划

利用selenium模拟登陆

导入必要的库

from selenium import webdriver
import time
import  json
from bs4 import BeautifulSoup

加载谷歌驱动（如果电脑上没用，就按照自己谷歌对应版本下载）

driver_path = r'C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe'
driver = webdriver.Chrome(executable_path=driver_path)

模拟登陆
- 首先肯定是要加载到所需界面driver.get("https://www.patientslikeme.com/forum/covid19/topics/160469")果然跳转到了登陆界面
- 利用flind_element_by_XXX函数进行定位输入账号密码进行登陆

inputTag = driver.find_element_by_id('user_email_or_login')
inputTag.send_keys('Sakura686868')
inputTag = driver.find_element_by_id('user_password')
inputTag.send_keys('cyy20010809')
rememberTag = driver.find_element_by_xpath('//*[@id="pg-new"]/div[2]/div/div/div/div[2]/div/div/div/div[3]')
rememberTag.click()

等待10s等界面加载完全
time.sleep(10)

爬取评论内容

首先在f12 Elements中观察评论的位置

<div class="rich-content" id="rich-content-for-Post-2543264">
<div data-react-class="RichContentWrapper" data-react-props="{"body":"<p>Being diagnosed with COVID-19 can be overwhelming and the quarantine and social isolation add to the stress of diagnosis. Please feel free to connect with others in this thread.  Share your diagnosis story, your symptoms and treatment experience, and how you've been coping with the isolation.  </p>"}" data-react-cache-id="RichContentWrapper-0"><div class="text js-no-observer"><p>Being diagnosed with COVID-19 can be overwhelming and the quarantine and social isolation add to the stress of diagnosis. Please feel free to connect with others in this thread.  Share your diagnosis story, your symptoms and treatment experience, and how you've been coping with the isolation.  </p></div></div>

每个评论都有一个特殊的id:id="rich-content-for-Post-2543264,而且不同的仅仅是数字部分

定位所有数字
又经过一系列的观察，有一个标签下面有单独的name就是数字部分

<a class="secondary-link" name="2543869" title="Link directly to this post" id="post-2543869" href="/forum/covid19/topics/160469?post_id=2543869#post-2543869"><small class="helptext" id="timestamp-2543869"><time datetime="2020-03-22T14:22:09-04:00">Mar 22, 2020 02:22PM</time></small>
</a>

由于这个标签太深了经过测试要写两层才能定位到

divs = driver.find_elements_by_class_name("thread-date")
for div in divs:
    ids =div.find_element_by_class_name("secondary-link").get_attribute('name')

用.text， .get_attribute（）函数爬取评论

# 接上面的for循环
comment = driver.find_elements_by_xpath('//*[@id="rich-content-for-Post-'+str(ids) + '\"]/div/div/p ')
    if comment:
        comment = comment[0].text
    else:
        comment = None
    data_dict = {}
    data_dict['comment'] = comment
    print(data_dict)