如何用python爬虫爬取qq空间说说_python爬取qq空间说说-CSDN博客

本文链接：https://blog.csdn.net/qq_40435621/article/details/89363241

之前学了下爬虫一直就想爬一下QQ空间

在爬取之前需要做的准备工作

安装python3
需要的库：

re 正则
selenium

需要安装
chrome 或者 Firefox
还有他们的模拟
Chrome 模拟添加链接描述
Firefox 模拟添加链接描述

我使用的Firefox的之前用Chrome的报错了

如何爬取

首先使用selenium模拟登录得到cookie（一般是没有验证码的，但是不要太多的异常登录），但之前我们要分析网页的结构打开qq空间登录界面
在这里插入图片描述
分析网页查看模拟的元素的id，并之后写入代码

def login():
    driver=webdriver.Firefox(executable_path=r'D:/Anaconda3/geckodriver.exe') 
    #打开QQ网页
    driver.get("https://qzone.qq.com/")
    #特别找到这个frame
    driver.switch_to_frame('login_frame')
    driver.find_element_by_id('switcher_plogin').click()
    driver.find_element_by_id('u').clear()
    driver.find_element_by_id('u').send_keys('你的账号')
    driver.find_element_by_id('p').clear()
    driver.find_element_by_id('p').send_keys('你的密码')
    driver.find_element_by_id('login_button').click()
    time.sleep(3)
    #把Frame的定位换回来 都这样做的哦不然要报错
    driver.switch_to.default_content()
    return driver

这样我们就模拟登录成功了
但是这driver还是用不了腾讯加入的可能是为了安全和反爬吧

#g_tk算法
def get_g_tk(cookie):
    hashes = 5381
    for letter in cookie['p_skey']:
        hashes += (hashes << 5) + ord(letter)  
    return hashes & 0x7fffffff

#获取qzondetoken参数
def get_title():
    driver=login()
    html=driver.page_source
    xpat=r'window\.g_qzonetoken = \(function\(\)\{ try\{return (.*?);\} catch\(e\)'
    qzonetoken=re.compile(xpat).findall(html)[0]

继续分析网页这里先点开开发者模式（F12），（在浏览器中手动登录后）再点击好友选项
在这里插入图片描述
在一大堆服务端刚发过来的数据中找到这个，有好友的信息

你会发现uin是好友的qq号，name是名字哈哈掉落好友qq列表技能

def get_allQQ(mysession,g_tk,qzonetoken): 
    #获取好友QQ
    url_friend='https://user.qzone.qq.com/proxy/domain/r.qzone.qq.com/cgi-bin/tfriend/friend_ship_manager.cgi?uin=str(qq）&do=1&rd=0.7020990069082498&fupdate=1&clean=1&g_tk='+str(g_tk)+'&qzonetoken='+qzonetoken+'&g_tk='+str(g_tk)
    friendpat='"uin":(.*?),'
    resp=mysession.get(url_friend)
    friendlist=re.compile(friendpat).findall(resp.text)
    time.sleep(3)
    return friendlist

再之后的就是爬取好友的空间说说了

#主要就是这个网址上爬取
 url='https://h5.qzone.qq.com/proxy/domain/taotao.qq.com/cgi-bin/emotion_cgi_msglist_v6?uin='+str(qq)+'&inCharset=utf-8&outCharset=utf-8&hostUin='+str(qq)+'&notice=0&sort=0&pos=0&num=20&cgi_host=http://taotao.qq.com/cgi-bin/emotion_cgi_msglist_v6&code_version=1&format=jsonp&need_private_comment=1&g_tk='+str(g_tk)+'&qzonetoken='+str(qzonetoken)