百家号爬虫(获取各领域创作者appid)

本文为爬虫及数据分析学习文章,网页解析方法较笨,仅作纪念。

百家号爬虫(获取各领域创作者appid)

由于百度的限制,每个领域最多能获取760个id

#!/usr/bin/env python3
# -*- coding: utf-8 -*-


from urllib.parse import quote
from urllib import request
from bs4 import BeautifulSoup
from urllib import error
from openpyxl import Workbook
import time

#Some User Agents
hds=[{'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'},\
{'User-Agent':'Mozilla/5.0 (Windows NT 6.2) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.12 Safari/535.11'},\
{'User-Agent': 'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; Trident/6.0)'}]


#当遍历账号后,百度搜索结果会重新开始;所以要获取第一个name,作为停止的判断标准
def name_first(field):
    url = 'https://www.baidu.com/sf?word=%E7%99%BE%E5%AE%B6%E5%8F%B7%2B'\
    +quote(field)+'&pd=cambrian_list&atn=index&title=%E7%99%BE%E5%AE%B6%E5%8F%B7%2B'\
    +quote(field)+'&lid=9080249029523443283&ms=1&frsrcid=206&frorder=1&pn=0&data_type=json%20---------------------%20'
    Response_1= str(request.urlopen(url).read().decode('utf-8'))
    soup_1= BeautifulSoup(Response_1,'lxml')
    name_1=soup_1.find('div',class_=\
    'c-color-link c-font-big sfc-cambrian-list-subscribe-title c-line-clamp1').string.strip()
    print(name_1)
    return name_1 
    
def appid_list_excel(appid_list,field):
    wb=Workbook()
    ws=wb.active
    ws.append(['name','field','appid','smallfont','vip_info']) 
    for i in range(len(appid_list)):
        lists=appid_list[i]
        ws.append([lists[0],lists[1],lists[2],lists[3],lists[4]])
    save_path=field
    save_path+='.xlsx'
    wb.save(save_path)


#从百度搜索获取各领域百家号账号信息
def get_appid(field,name_1):
    number = 0 #URL地址中,pn=number为账号定位,XHR,每次从pn开始返回10账号,所以要循环操作
    appid_list=[]
    name='name'
    
    while  number<=10000 and name!=name_1 : 

        url = 'https://www.baidu.com/sf?word=%E7%99%BE%E5%AE%B6%E5%8F%B7%2B'\
        +quote(field)+'&pd=cambrian_list&atn=index&title=%E7%99%BE%E5%AE%B6%E5%8F%B7%2B'\
        +quote(field)+'&lid=9080249029523443283&ms=1&frsrcid=206&frorder=1&pn='\
        +str(number)+'&data_type=json%20---------------------%20'  
        
        try:
            req = request.Request(url, headers=hds[number%len(hds)])
            Response = str(request.urlopen(req).read().decode('utf-8'))
            soup = BeautifulSoup(Response,'lxml')
            subsrcibes =soup.find_all('div',class_="sfc-cambrian-list-subscribe")
        except error.HTTPError as e:
                print("HTTPError") 
                print(e.code)
        except error.URLError as e:
                print("URLError")
                print(e.reason)
           
        for subsrcibe in subsrcibes:
            smallfont=subsrcibe.find('div',class_='c-font-small c-gray c-line-clamp1').string.strip()
            name=subsrcibe.find('div',class_=\
            'c-color-link c-font-big sfc-cambrian-list-subscribe-title c-line-clamp1').string.strip()
            img_info=subsrcibe.find_all('img')  #从图片地址截取信息
            try:
                appid_info=str(img_info[0])
                appid=appid_info[appid_info.find('_')+1:appid_info.find('.jpeg')] 
            except:
                appid='缺失'
            try:
                  vip_info=str(img_info[1])\
                  [str(img_info[1]).find('vip'):str(img_info[1]).find('vip')+5]
            except:
                vip_info='暂无'
            if number>=10 and name==name_1: 
                break         
            appid_list.append([name,field,appid,smallfont,vip_info])
            
        number+=10
        print('%s==%d'% (field,number))
        time.sleep(1)
        
    return appid_list

if __name__=='__main__':
#    field_list = ['娱乐’,’体育’,’财经’]
#    field_list = ['人文’,'科技','互联网','数码','社会']
#    field_list = ['汽车','房产','旅游','女人','情感','时尚','星座','美食','生活']
#    field_list = ['育儿','影视','音乐','动漫','搞笑','教育','文化','宠物','游戏','家居']
#    field_list = ['悦读','艺术','摄影','健康','养生','科学','三农','职场','综合','百科','学术']
    field_list =['其它']
    for field in field_list:
        name_1=name_first(field)
        appid_list=get_appid(field,name_1) 
        appid_list_excel(appid_list,field)
    print('ok')

     
    
    
    
    
  • 1
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
根据提供的引用内容,可以使用uni.login()接口来获取uniappappid。在uni.login()的成功回调函数,可以通过res.code获取到用户的code值。然后可以将code值传递给后端接口,以获取openid。具体代码如下所示: ```javascript uni.login({ provider: 'weixin', success: function (res) { let code = res.code; // 将code值传递给后端接口,以获取openid // 调用后端接口的方法可以根据实际情况进行修改 appService.getOpenid({ code: code }).then((res) => { console.log(res); // 这里的res就是openid }); } }); ``` 请注意,以上代码仅为示例,具体的后端接口调用方法需要根据实际情况进行修改。 #### 引用[.reference_title] - *1* [uni-app 获取用户的openID和基本信息](https://blog.csdn.net/y1238r/article/details/122129202)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v91^control,239^v3^insert_chatgpt"}} ] [.reference_item] - *2* [uniapp获取小程序openid](https://blog.csdn.net/LW0512/article/details/126854070)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v91^control,239^v3^insert_chatgpt"}} ] [.reference_item] - *3* [uniapp开发小程序挂载第三方平台获取最新appid](https://blog.csdn.net/baidu_41899377/article/details/118940513)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v91^control,239^v3^insert_chatgpt"}} ] [.reference_item] [ .reference_list ]

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值