微信公众号信息爬取

最新推荐文章于 2024-06-05 10:58:23 发布

张峰π_π

最新推荐文章于 2024-06-05 10:58:23 发布

阅读量869

点赞数 3

分类专栏：爬虫 python 学习

本文链接：https://blog.csdn.net/qq_42370313/article/details/88391610

版权

python 同时被 3 个专栏收录

44 篇文章 6 订阅

订阅专栏

学习

34 篇文章 1 订阅

订阅专栏

爬虫

17 篇文章 0 订阅

订阅专栏

微信公众号信息爬取

微信公众号文章信息爬取利用的是搜狗的微信文章搜索，通过搜索页面的结合找对应关键词的文章
https://weixin.sogou.com/

代码的大致介绍

考虑到访问频繁容易封ip 这里采用了阿布云的IP，有需求的也可以看一下
https://www.abuyun.com/http-proxy/dyn-manual.html
我的代码采用的是从数据库读入关键词和对应的id 数据，从而实现爬虫自己构造搜索拼接的网址，这里会用到pymysl这个第三包，这里我就不详细介绍。
我爬取的信息设定的是一天中并且含有你需要的关键词，这个会有相关的函数来判断。
搜狗微信每一页请求头header中的reffer字典对应的参数是和网页一起变化，如果你改变页面，请求头也要随着网页结构的变化。

接下来就是上代码了。

from  lxml  import  etree #xpath解析的包
import datetime
import requests
import time
from bs4 import BeautifulSoup
import os
import pandas as pd
import pymysql
import urllib.parse
import time
import numpy as np
import math
cuttime1=(datetime.datetime.now()-datetime.timedelta(days=1)).strftime('%Y-%m-%d')#读库的时间限制，去重作用，间隔时间不长，防止信息爬重复。
cut =1#爬取时间，随下面的type决定
types = False#这个参数是设定你是按天爬取还是按小时爬取，False是按天爬取，而True是按小时爬取
def timea(time_1):#时间判断函数
    if(types):
        cut_time=datetime.date.today().strftime('%Y-%m-%d')
    else:
        cut_time = (datetime.date.today() - datetime.timedelta(days=cut)).strftime('%Y-%m-%d')
    cut_nb=int(cut_time.split('-')[0])*366+int(cut_time.split('-')[1])*31+int(cut_time.split('-')[2])
    time_nb=int(time_1.split('-')[0])*366+int(time_1.split('-')[1])*31+int(time_1.split('-')[2])
    test_1=False
    if(time_nb>=cut_nb):
        test_1 = True
    return test_1
def get_data(url):
    r = requests.get(url,headers=headers,proxies=proxies,timeout=30)
    r.raise_for_status()###################查看是否正常，正常则返回200，如有异常则返回404等。
    r.encoding = 'utf-8'
    return r.text
#一些连接数据库的信息并且连接数据库
dbconn=pymysql.connect(
      host="127.0.0.1",
      database="yuqingyuebao",
      password="123456",
      user="root",
      port=3306,
      charset='utf8mb4'
     )  
cur=dbconn.cursor()
url_prefix='https://weixin.sogou.com/weixin?'
key_word=[]
sch_id=[]
sql_use='SELECT sch_id,key_word FROM key_word'#从数据库中读取关键词和对应的学校id
sql_stu_data = pd.read_sql(sql_use,dbconn)  
train_data = np.array(sql_stu_data)#np.ndarray()
main_list = train_data.tolist()#list
for i in range(len(main_list)):
    sch_id.append(main_list[i][0])
    key_word.append(main_list[i][1])
sql_use='SELECT sch_id,title FROM sougou_weixin'+'  WHERE time>=\''+cuttime1+'\''
sql_stu_data = pd.read_sql(sql_use,dbconn)  #读出数据库爬的历史信息，用于去重
train_data = np.array(sql_stu_data)#np.ndarray()
main_list = train_data.tolist()#list
old_new = []#存放从数据库的信息，用于去重
for i in range(len(main_list)):
    old_new.append(main_list[i][0]+main_list[i][1])
for j in range(len(key_word)):
    a=1
    web_page = 1
    while(a):
        all_nb=0        
        right_nb=0
        url_all='https://weixin.sogou.com/weixin?usip=&query=' +urllib.parse.quote(key_word[j]) + '&ft=&tsn=1&et=&interation=&type=2&wxid=&page='+str(web_page)+'&ie=utf8'
        headers={"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.81 Safari/537.36',
         "Referer":url_all}
#        targetUrl = "http://test.abuyun.com"
        #targetUrl = "http://proxy.abuyun.com/switch-ip"
        #targetUrl = "http://proxy.abuyun.com/current-ip"
    
        # 代理服务器
        proxyHost = "http-dyn.abuyun.com"
        proxyPort = "9020"
    
        # 代理隧道验证信息
        proxyUser = "**888***8*888"#这里的账号需要到网上购买，我就不公示了，，有需求的可以购买或者采用那种蘑菇代理
        proxyPass = "************8*"
    
        proxyMeta = "http://%(user)s:%(pass)s@%(host)s:%(port)s" % {
          "host" : proxyHost,
          "port" : proxyPort,
          "user" : proxyUser,
          "pass" : proxyPass,
        }
    
        proxies = {
            "http"  : proxyMeta,
            "https" : proxyMeta,
        }
        try:
            data=get_data(url_all)
            selector =  etree.HTML(data)
            url_list=selector.xpath('//div[@class="txt-box"]/h3/a/@href')#每条微信的内容
            content_source = selector.xpath('//div[@class="s-p"]/a/text()')#微信公众号来源
            content_time = selector.xpath('//div[@class="s-p"]//script/text()')#爬取的初步时间
            for i in range(len(content_time)):
                all_nb=all_nb+1
                content_time1 = content_time[i].split('(')[2]
                content_time2 = int(content_time1[1:11])
                time_local = time.localtime(content_time2)#转化为时间
                content_time_1= time.strftime("%Y-%m-%d",time_local)
                if timea(content_time_1):
                    title_1=selector.xpath('//div[@class="txt-box"]/h3')[i]
                    title2 = title_1.xpath('string(.)').replace('\n','')
                    if not (sch_id[j]+title2 in old_new):#去重检测                        
                        if key_word[j] in title2:#符合标题判断，保存到数据库中
                            print(title2)
                            right_nb=right_nb+1
                            cur.execute("INSERT INTO sougou_weixin(`weixin_id`,`sch_id`,`url`,`time`,`title`) VALUES (%s,%s,%s,%s,%s)", (content_source[i],sch_id[j],url_list[i],content_time_1,title2))
                            dbconn.commit()
                            old_new.append(sch_id[j]+title2)#新加入的信息也要添加到那个去重检测里面
                        else:
                            data1 = get_data(url_list[i])
                            soup = BeautifulSoup(data1,'lxml')
                            content = soup.get_text()#由于每篇微信文章来自不同的网址，所以只能全部解析用于判断
                            if key_word[j] in content:#内容判断
                                print(title2)
                                right_nb=right_nb+1
                                cur.execute("INSERT INTO sougou_weixin(`weixin_id`,`sch_id`,`url`,`time`,`title`) VALUES (%s,%s,%s,%s,%s)", (content_source[i],sch_id[j],url_list[i],content_time_1,title2))
                                dbconn.commit()
                                old_new.append(sch_id[j]+title2)
            
            if (all_nb<=right_nb*2/3):#起到一个判断是否翻页的作用
                web_page = web_page +1#构成下一页网址 
            else:
                a = 0 
        except:#处理程序异常情况
            pass
cur.close()#关闭数据库

这篇博客对于时间判断函数还是不怎么准确，后面我会改进，但对于每篇微信文章的结构不一样，提取正文内容有一定难度，希望大家多多指点。改进正文提取的方法。

张峰π_π

关注

3
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
微信公众号信息爬取

微信公众号信息爬取微信公众号文章信息爬取利用的是搜狗的微信文章搜索，通过搜索页面的结合找对应关键词的文章https://weixin.sogou.com/代码的大致介绍考虑到访问频繁容易封ip 这里采用了阿布云的IP，有需求的也可以看一下https://www.abuyun.com/http-proxy/dyn-manual.html我的代码采用的是从数据库读入关键词和对应的id...
复制链接

扫一扫