qq群 名片爬取

这次有个需求,要爬取群里的名片并做一下统计分析,qq官方提供了接口。
https://qun.qq.com/member.html#gid=546786372 后面的数字为我们需要爬取的qq群号。这里利用两个简易的办法完成这项工作。

  1. 利用requests库
    F12抓包内容如下
    在这里插入图片描述

我们用到的只有cookie和data 其他的都可以不用管,由于接口限制了一次性能获取的群成员个数,所以需要多次爬取
gc是群号,bkn是根据cookie计算出来的,照抄即可,cookie照抄,sort为0表示不排序,st和end代表序号范围

import socket,re,ssl
import json,gzip,requests

ans = []
def spider(i):
    data = {'gc':546786372,
            'st':i,
            'end':i+19,
            'sort':0,
            'bkn':711855034
        }
    headers = {'Cookie':'pgv_info=ssid=s750098733; pgv_pvid=625209338; RK=HXRwWT7cF; ptcz=d984bcdebe497afb91dcce5b7c39aaa61f9df551a8041949356ef5a5fed70d; tvfe_boss_uuid=827a0cf9909bf9c; ptui_loginuin=297490; pac_uid=1_24779490; iip=0; o_cookie=2477490; _qpsvr_localtk=161433760809; uin=o02186265; p_uin=o0298665; traceid=40647b914f; skey=@1cY7LKcNi; pt4_token=ye36khVQbqEM6bZRk2Nn*uCI1A1iDEqZjTJlRqvh38_; p_skey=z2wlQ9O3UgkKqrNRdCie2T2lM0eWJAfPXn365Pw1Qw_'}
    r = requests.post(url='https://qun.qq.com/cgi-bin/qun_mgr/search_group_members', data=data, headers=headers)
    json_data = json.loads(r.content.decode('utf-8'))
    if 'mems' in json_data.keys():
        mems = json_data['mems']
        for mem in mems:
            ans.append(mem['card'])
        
if __name__ == "__main__":
    for i in range(0,500,20):
        spider(i)
    print('\n'.join(ans))
    
  1. 利用原生请求
    这里有一些坑,打开burpsuite 把包完整的抓下来
    在这里插入图片描述
import ssl
import socket,re,json,gzip
data = """POST /cgi-bin/qun_mgr/search_group_members HTTP/1.1
Host: qun.qq.com
Connection: close
Content-Length: {}
Accept: application/json, text/javascript, */*; q=0.01
X-Requested-With: XMLHttpRequest
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.192 Safari/537.36
Content-Type: application/x-www-form-urlencoded; charset=UTF-8
Origin: https://qun.qq.com
Sec-Fetch-Site: same-origin
Sec-Fetch-Mode: cors
Sec-Fetch-Dest: empty
Accept-Encoding: gzip, deflate
Referer: https://qun.qq.com/member.html
Accept-Language: zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7,zh-TW;q=0.6
Cookie: pgv_info=ssid=s7500978733; pgv_pvid=6255209338; RK=HXRwWTX7cF; ptcz=d984bcdebe497afb091dcce5b7c39aaa61f9df551a80419249356ef5a5fed70d; tvfe_boss_uuid=827a0ccf9909bf9c; ptui_loginuin=247490; pac_uid=1_2497490; iip=0; o_cookie=2477490; p_uin=o0296265; traceid=4064914f; uid=3698429; uin=o0291865; skey=@KAxFH3T0x; pt4_token=m1oMuANPcifwVFb-BpJCNAbHlBU8Zw9aIWwjrTPZwlQ_; p_skey=lZDWSFkqVoWPB5PIRpH*DmnKgUtjDL8Rso2vk4Tm10I_

{}"""
data = data.replace('\n','\r\n')

def spider(st):
    try:
        sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        sock = ssl.wrap_socket(sock)  # for https
        sock.connect(('qun.qq.com', 443))
        payload = "gc=546786372&st={}&end={}&sort=0&bkn=323497478".format(st, st + 19)
        new_data = data.format(len(payload), payload)
        sock.sendall(bytes(new_data,encoding='utf-8'))
        ans = []
        while True:
            packet = sock.recv()
            if not packet:
                break
            ans.append(packet)
        print(b''.join(ans))
    except Exception as e:
        print(e)
        return e
        
if __name__ == "__main__":
    spider(20)

第一个坑:包太长,一次接受不完,一次默认4096字节,需要写个循环,然后串联在一起就好了

 ans = []
 while True:
     packet = sock.recv()
     if not packet:
         break
     ans.append(packet)
 print(b''.join(ans))

第二个坑:由于这个字段Accept-Encoding: gzip, deflate的存在,内容被压缩了,解决方法是删掉,或者看下面的办法
gzip库可以很方便的帮助我们解压这堆乱码
在这里插入图片描述

gzip.decompress(data)

但是问题又来了,请注意一下黄线部分,内容里掺杂了回车换行,导致解码错误,这都是这个害的Transfer-Encoding: chunked,解决办法有换成http1.0协议,不然就只能根据格式拼接在一起。格式如下图:一个长度一个内容。
在这里插入图片描述
我们其实直接按照等间距把内容连在一起就好了

import ssl
import socket,re,json,gzip
data = """POST /cgi-bin/qun_mgr/search_group_members HTTP/1.1
Host: qun.qq.com
Connection: close
Content-Length: {}
Accept: application/json, text/javascript, */*; q=0.01
X-Requested-With: XMLHttpRequest
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.192 Safari/537.36
Content-Type: application/x-www-form-urlencoded; charset=UTF-8
Origin: https://qun.qq.com
Sec-Fetch-Site: same-origin
Sec-Fetch-Mode: cors
Sec-Fetch-Dest: empty
Accept-Encoding: gzip, deflate
Referer: https://qun.qq.com/member.html
Accept-Language: zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7,zh-TW;q=0.6
Cookie: pgv_info=ssid=s7500978733; pgv_pvid=6255209338; RK=HXRwWTX7cF; ptcz=d984bcdebe497afb091dcc5b7c39aaa61f9df551a80419249356ef5afed70d; tvfe_boss_uuid=827a0ccf9909f9c; ptui_loginuin=24490; pac_uid=1_2497490; iip=0; o_cookie=247490; p_uin=o029265; traceid=40647b914f; uid=36998429; uin=o0988665; skey=@KAxFH3T0x; pt4_token=m1oMuANPciwVFb-BpJCNAbHlBU8Zw9aIWwjrTPZwlQ_; p_skey=lZDWSFkqVoWPB5PIRpH*DmnKgUtDL8Rso2vk4Tm10I_

{}"""
data = data.replace('\n','\r\n')

def spider(st):
    try:
        sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        sock = ssl.wrap_socket(sock)  # for https
        sock.connect(('qun.qq.com', 443))
        payload = "gc=546786372&st={}&end={}&sort=0&bkn=323497478".format(st, st + 19)
        new_data = data.format(len(payload), payload)
        sock.sendall(bytes(new_data,encoding='utf-8'))
        ans = []
        while True:
            packet = sock.recv()
            if not packet:
                break
            ans.append(packet)
        
        chunked = b''.join(ans).split(b'\r\n\r\n')[1].split(b'\r\n')
        # 取1,3,5,7...
        res = b''.join(chunked[1::2])
        res = gzip.decompress(res)
        
        res = json.loads(res)
        if 'mems' in res.keys():
            for mem in res['mems']:
                print(mem['card'])
        sock.close()
    except Exception as e:
        print(e)
        return e
        
if __name__ == "__main__":
    for i in range(0,500,20):
        spider(i)

PS:代码中的cookie请自行修改,为了安全,上面的cookie全部做了增删

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值