selenium+browsermobproxy实现woff解析

最新推荐文章于 2024-08-08 07:30:22 发布

我终于有blog了

最新推荐文章于 2024-08-08 07:30:22 发布

阅读量2.2k

点赞数

分类专栏： Python 爬虫

本文链接：https://blog.csdn.net/qq_29493353/article/details/82019372

版权

Python 同时被 2 个专栏收录

13 篇文章 0 订阅

订阅专栏

爬虫

3 篇文章 0 订阅

订阅专栏

本次遇到的问题是woff文件混淆html的页面元素，使得爬虫获取的数据是错误的，需要将woff获取到。

selenium现在貌似没有获取network的能力，所以需要代理来配合用于拦截请求。

python的这个代理服务器启动之后要随着代码在本地一起跑不能远程访问

贴一波代码:

1.开启代理服务

browsermobproxy 在github上有源码直接python setup.py install 然后把生成的文件夹py文件move到site-package里面

然后去https://github.com/lightbody/browsermob-proxy/releases下载browsermob-proxy.bat执行文件

from browsermobproxy import Server,RemoteServer,client

server = Server('D:\\Downloads\\browsermob-proxy-2.0-beta-6\\bin\\browsermob-proxy.bat')
server.start()
print server.url
proxy = server.create_proxy()
print '代理服务器开启'

2.selenium设置代理服务的配置

from proxyServer import proxy

chrome_options = Options()
chrome_options.add_argument('--ignore-certificate-errors')
chrome_options.add_argument('--proxy-server={0}'.format(proxy.proxy))  # client
chrome_options.add_argument('--disable-gpu')
brower = webdriver.Chrome(executable_path="C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe",
                          chrome_options=chrome_options)
brower.maximize_window()

proxy.new_har(url)
brower.get(url=url)

from fontTools.ttLib import TTFont
import xml.etree.ElementTree as et

def get_woff_number(woff_dict):
    result = proxy.har
    woff_url = None
    for entry in result['log']['entries']:
        url = entry['request']['url']
        if 'font-awesome-qxb' in url and 'woff2' in url:
            woff_url = url
    if woff_url is not None:
        print woff_url
        f = urllib2.urlopen(woff_url)
        file_name = str(woff_url).split('/')[-1]
        data = f.read()
        #文件存在则说明解析已经生成了
        if os.path.exists("./woff/"+file_name):
            if len(woff_dict) == 0:
                temp_dict = manageWoff.parse_woff("./woff/" + file_name)
                woff_dict = temp_dict
            return woff_dict
        with open("./woff/"+file_name, "wb") as code:
            code.write(data)
        temp_dict = manageWoff.parse_woff("./woff/"+file_name)
        woff_dict = temp_dict
    return woff_dict

这个proxy是一个全局变量，所以代码都是公用一个变量。这里的woff文件是用python的fonttool包来解析的生成xml然后获取元素（自己将获取的数字和字母用字典对应好保存，以便值的替换）

注意:python的这个代理不能拦截https的请求，万一遇到python就gg了，项目就只能换成用java来写了。

还可以尝试mitmproxy，这个可以获取https，只要本地信任证书就行了。

案例:http://www.site-digger.com/html/articles/20180821/653.html

https://cuiqingcai.com/5391.html

https://github.com/ring04h/wyproxy