[python 爬虫]微信公众号权律二表情和壁纸爬虫

最新推荐文章于 2024-05-31 14:21:08 发布

Thorrrrrrrrrr

最新推荐文章于 2024-05-31 14:21:08 发布

阅读量1k

点赞数

分类专栏： Python爬虫 Python 文章标签：爬虫 xpath requests-HTML 正则表达式

本文链接：https://blog.csdn.net/sinat_33487968/article/details/80926654

版权

搜狗搜索引擎可以搜索到微信的公众号，许久没有爬虫了，最近买了崔大神的《python网络爬虫开发实战》，感觉又回到了一年前初学爬虫时满怀激情的时代。下面小试牛刀，利用一些基本的库 requests-html，xpath，request以及正则表达式来抓一些表情和壁纸。

先来看看效果是怎么样吧

源码奉上，其实改一改就能爬取其他内容。

import os
import urllib.request
import re
import ssl
from requests_html import HTMLSession

import time
from lxml import etree

ssl._create_default_https_context = ssl._create_unverified_context


def getData(url):
    # 模拟成浏览器
    headers = ("User-Agent",
               "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0")
    opener = urllib.request.build_opener()
    opener.addheaders = [headers]
    # 将opener安装为全局
    urllib.request.install_opener(opener)
    data = urllib.request.urlopen(url).read().decode("utf-8")
    return data

def getcontent(url):
    data = getData(url)
    # 构建表情提取的正则表达式
    stickerpat = '<img.*?data-src="(.*?)"'
    stickerlist = re.compile(stickerpat, re.S).findall(data)
    # 构建标题提取的正则表达式
    titlepat = '<h2.*?>(.*?)</h2>'
    title = re.compile(titlepat,re.S).findall(data)
    title = title[0].replace('\n','').replace('|','').strip()
    return stickerlist,title


def download(stickerlist,title):
    path = title
    number = 1
    for sticker in stickerlist:
        if(sticker.endswith('gif')):
            filename =   os.path.join(path,str(number)+'.gif')
            print("正在下载：" ,filename)
            urllib.request.urlretrieve(sticker, filename = filename)
            time.sleep(1)
            number += 1
        if (sticker.endswith('jpeg')):
            filename = os.path.join(path, str(number) + '.jpeg')
            print("正在下载：", filename)
            urllib.request.urlretrieve(sticker, filename=filename)
            time.sleep(1)
            number += 1

def creatDir(title):
    isExists = os.path.exists(title)
    if not isExists:
        os.makedirs(title)
        print(title + ' 创建成功')
        return True
    return False

def getUrlList():
    session = HTMLSession()
    for page in range(1,11):
        url = 'http://weixin.sogou.com/weixin?query=%E6%9D%83%E5%BE%8B%E4%BA%8C&_sug_type_=&sut=4989&lkt=1%2C1530759390068%2C1530759390068&s_from=input&_sug_=y&type=2&sst0=1530759390170&page='+str(page)+'&ie=utf8&w=01019900&dr=1'
        time.sleep(5)
        r = session.get(url)
        dom = r.html
        print(dom)
        for i in range(10):
            try:
                result = dom.xpath('//*[@id="sogou_vr_11002601_title_'+str(i)+'"]//@href')
                time.sleep(5)
                print(i,result)
                stickerlist, title = getcontent(result[0])
                if(creatDir(title)):
                    download(stickerlist, title)
            except Exception:
                continue





getUrlList()

代码

顺便复习一下基础的知识，等到暑假再好好精修吧。

正则表达式基础知识

基础1：
全局匹配函数使用格式	re.compile(正则表达式).findall(源字符串)

普通字符	正常匹配
\n			匹配换行符  
\t 			匹配制表符
\w 			匹配字母、数字、下划线
\W 			匹配除字母、数字、下划线
\d 			匹配十进制数字
\D 			匹配除十进制数字
\s 			匹配空白字符
\S 			匹配除空白字符
[ab89x]		原子表，匹配ab89x中的任意一个
[^ab89x]		原子表，匹配除ab89x以外的任意一个字符

实例1：
源字符串："aliyunedu"
正则表达式："yu"
匹配出什么？	yu


源字符串：'''aliyun
edu'''
正则表达式："yun\n"
匹配出什么？	yun\n

源字符串："aliyu89787nedu"
正则表达式："\w\d\w\d\d\w"
匹配出什么？	u89787


源字符串："aliyu89787nedu"
正则表达式："\w\d[nedu]\w"
匹配出什么？	87ne


基础2：
.	匹配除换行外任意一个字符
^	匹配开始位置
$	匹配结束位置
*	前一个字符出现0\1\多

最低0.47元/天解锁文章

Thorrrrrrrrrr

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
1
评论
[python 爬虫]微信公众号权律二表情和壁纸爬虫

搜狗搜索引擎可以搜索到微信的公众号，许久没有爬虫了，最近买了崔大神的《python网络爬虫开发实战》，感觉又回到了一年前初学爬虫时满怀激情的时代。下面小试牛刀，利用一些基本的库 requests-html，xpath，request以及正则表达式来抓一些表情和壁纸。先来看看效果是怎么样吧源码奉上，其实改一改就能爬取其他内容。import osimport urllib....
复制链接

扫一扫