从搜狗网站爬取微信公众号文章

转自:http://blog.csdn.net/mr_guo_lei/article/details/78570744

1.模拟浏览器登陆,获取cookies

2.request.get()带上cookies访问

3.反反爬虫(目前是代理ip+休眠,搜狗模式:封ip+封cookie+重点关照)

(根据实际情况,选择代理ip和休眠时间)


from selenium import webdriver  
import requests  
import time  
from bs4 import BeautifulSoup  
import re  
from mysql_py import *  
import threading  
from urllib import request  
from get_ip_pools import *  
import random  
  
#get cookie  
def get_cookies():  
    driver = webdriver.Chrome()  
    driver.get("http://weixin.sogou.com/")  
  
    driver.find_element_by_xpath('//*[@id="loginBtn"]').click()  
    time.sleep(10)  
  
    cookies = driver.get_cookies()  
    cookie = {}  
    for items in cookies:  
        cookie[items.get('name')] = items.get('value')  
    return cookie  
  
#url = "http://weixin.sougou.com"  
#response = requests.get(url,cookies = cookie)  
#search = input("输入你想搜索的关键词")  
  
#get total url  
def get_total_url(url):  
    if url.startswith("//"):  
        url = "http:" + url  
    elif url.startswith("/"):  
        url = "http:/" + url  
    else:  
        url = url  
    return url  
  
#init header  
header = {  
    'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',  
    'Accept-Encoding':'gzip, deflate',  
    'Accept-Language':'zh-CN,zh;q=0.9',  
    'Connection':'keep-alive',  
    'Host':'weixin.sogou.com',  
    'Upgrade-Insecure-Requests':'1',  
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36',  
    }  
  
#init proxys  
alright_proxys = main_function()  
  
#get total page num  
def get_page_count(search,cookie):  
    global header  
    page_source = requests.get("http://weixin.sogou.com/weixin?query=%s&type=2&page=1"%search,cookies = cookie,headers = header,proxies = alright_proxys[random.randint(0,len(alright_proxys)-1)]).content  
    bs_obj = BeautifulSoup(str(page_source,encoding = "utf-8"),"html.parser")  
    #print(bs_obj)  
    item_count_str = bs_obj.find("div",{"class":"mun"}).text  
    pattern = re.compile(r'\d+')  
    total_item_count = pattern.findall(item_count_str.replace(",",""))[0]  
    page_count = int(int(total_item_count)/10)  
    return page_count  
  
#check path  
def check_mkdir(path):  
    if not os.path.exists(path):  
        try:  
            os.makedirs(path)  
        except Exception:  
            pass  
          
#download img  
def get_img(url,num,connect,cursor):  
    global alright_proxys  
    response = request.get(url,headers = header).content  
    content = str(response,encoding = "utf-8")  
    bs_obj = BeautifulSoup(content,"html.parser")  
    img_list = bs_obj.findAll("img")  
    count = 0  
    for img in img_list:  
        try:  
            imgurl=get_total_url(img.attrs["data-src"])  
            store_name = "%s"%url_num+"%s"%count  
            path = r"C:\Users\Mr.Guo\Pictures\weixin"  
            check_mkdir(path)  
            #urllib.request.urlretrieve(imgurl,r"C:\Users\Mr.Guo\Pictures\weixin\%s.jpeg" %store_name)  
            insert_into_table(connect,cursor,store_name,html)  
            count += 1  
            time.sleep(5)  
        except Exception as e:  
            pass  
  
#main function  
def main_fun(page_count,search,cookie,connect,cursor):  
    global header  
    for i in range(page_count):  
        num = i  
        page_source = requests.get("http://weixin.sogou.com/weixin?query=%s&type=2&page=%s"%(search,num + 1),cookies = cookie,headers = header,proxies = alright_proxys[random.randint(0,len(alright_proxys)-1)]).content  
        bs_obj = BeautifulSoup(str(page_source,encoding = "utf-8"),"html.parser")  
        url_list = bs_obj.findAll("div",{"class":"txt-box"})  
        final_url_list = []  
        for url in url_list:  
            final_url_list.append(url.h3.a.attrs['href'])  
        for url_num in range(len(final_url_list)):  
            t = threading.Thread(target = get_img,args = (final_url_list[url_num],url_num,connect,cursor))  
            #time.sleep(3)  
            t.start() 
  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Python可以用于爬取微信公众号文章。有几种方式可以实现这一目标,其中一种方式是通过登录微信公众号平台,在里面搜索要爬取的公众号,然后进行抓包。另外一种方式是使用安卓模拟器进行抓包。还可以使用搜狗微信模块来获取数据。 在使用第一种方式时,首先需要拥有一个微信公众号的账号。然后,通过工具如pycharm和fiddler等,登录微信公众号平台,进入创作管理,选择图文素材,然后创建或编辑文章。最后,通过添加引用和查找公众号文章等步骤,进行数据的爬取和收集。 在实现这些步骤之前,你需要了解微信公众号平台的主要功能和界面布局。主面板可以划分为七大块,每个块都有不同的功能。在后续的操作中,你可以使用Python代码来模拟微信请求,并实现数据的爬取和分析。<span class="em">1</span><span class="em">2</span><span class="em">3</span> #### 引用[.reference_title] - *1* *3* [欢度国庆⭐️共享爬虫之美⭐️基于 Python 实现微信公众号爬虫(Python无所不能爬)](https://blog.csdn.net/xiejiachao/article/details/120573509)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_2"}}] [.reference_item style="max-width: 50%"] - *2* [【爬虫】python爬取微信公众号](https://blog.csdn.net/qq_36269293/article/details/109244944)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_2"}}] [.reference_item style="max-width: 50%"] [ .reference_list ]
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值