破解参数？拒绝头秃，selenium大法好！-CSDN博客

本文链接：https://blog.csdn.net/minge89/article/details/106132583

一个国外海报欣赏站点，typographicposters.com，比较有意思（头秃）的json数据传递，分类也比较有意思，采用的是点击rgb颜色参数获取分类，以海报的颜色为分类，看了下数据还是比较多，直接requests数据是不行的。

目标网址

https://www.typographicposters.com/?filter=recent

下一页数据的加载是需要手动点击，有个查看更多的按钮

感觉是用了Query

抓包数据

全部协议头添加上，还是画红线处报错提示，服务器错误？

怕了，怕了，使用最笨的方法，python selenium搞起！

第一步：获取到网页源码，

1.不怎么会处理while的循环逻辑，还是用了for循环！

2.报错是必需有的！！

3.xpath获取到下一页按钮

find_element_by_xpath('//div[@class="pagination"]/button[@class="button-highlight"]')

附上源码：

#https://www.typographicposters.com
#海报图采集
#20200513by 微信：huguo00289
# -*- coding: UTF-8 -*-
import time
from selenium import webdriver


def xl(browser):
    try:
        browser.find_element_by_xpath('//div[@class="pagination"]/button[@class="button-highlight"]').click()
    except:
        print("网络问题！")
        time.sleep(5)
        browser.find_element_by_xpath('//div[@class="pagination"]/button[@class="button-highlight"]').click()
    time.sleep(3)
    js = "var q=document.documentElement.scrollTop=100000"
    browser.execute_script(js)
    html = browser.page_source
    with open('sj{}.html'.format(i), 'w', encoding='utf-8') as f:
        f.write(html)




chromedriver_path=r"C:\Users\Administrator\AppData\Local\Programs\Python\Python37\chromedriver.exe"  #完整路径
url = 'https://www.typographicposters.com/?filter=recent'
options = webdriver.ChromeOptions()  # 配置 chrome 启动属性
# options.add_experimental_option("prefs", {"profile.managed_default_content_settings.images": 2}) #不加载图片，加快访问速度
options.add_experimental_option("excludeSwitches", ['enable-automation'])  # 此步骤很重要，设置为开发者模式，防止被各大网站识别出来使用了Selenium


browser = webdriver.Chrome(executable_path=chromedriver_path, options=options)


browser.get(url)
time.sleep(5)
js = "var q=document.documentElement.scrollTop=100000"
browser.execute_script(js)
time.sleep(2)
for i in range(1,20):
    print(i)
    if browser.find_element_by_xpath('//div[@class="pagination"]/button[@class="button-highlight"]'):
        xl(browser)
    else:
        print("完成翻页")
        time.sleep(10)
        # 打印当前网页源码
        html = browser.page_source
        with open('sjj.html', 'w', encoding='utf-8') as f:
            f.write(html)
        break

第二步：获取到单页面链接，再获取到大图图片链接，下载图片

1.使用了xpath获取到单页面（详情页）链接地址

2.使用了正则获取到图片链接，这里发现头部就有大图地址，使用requests获取到网页源码，再结合正则一步到位获取到图片的地址

附上源码：

#https://www.typographicposters.com
#海报图采集
#20200513by 微信：huguo00289
# -*- coding: UTF-8 -*-
import requests,re,time
from lxml import etree
from fake_useragent import UserAgent
from selenium import webdriver


def ua():
    ua=UserAgent()
    headers={"User-Agent":ua.random}
    return headers


def tp(img_url):
    ua = UserAgent()
    headers = {
        'referer': 'https://www.typographicposters.com',
        'User-Agent': ua.random,
    }
    img_name=img_url.split('/')[-1]
    r = requests.get(img_url,headers=headers,timeout=10)
    time.sleep(1)
    with open(img_name,'wb')as f:
        f.write(r.content)
    print(f"{img_name}下载图片成功")






def getimg(url):
    html=requests.get(url,headers=ua(),timeout=10).content.decode('utf-8')
    img_url=re.findall(r'<meta property="og:image:url" content="(.+?)" />',html,re.S)[0]
    print(img_url)
    tp(img_url)








with open("sj17.html",encoding='utf-8') as f:
    html=f.read()
req=etree.HTML(html)
hrefs=req.xpath('//div[@class="col-6 posters-item"]/a/@href')
print(len(hrefs))
for href in hrefs:
    url = f"https://www.typographicposters.com{href}"
    print(url)
    try:
        img_url=getimg(url)
    except:
        pass

太偷懒了，还是需要破解post参数才好！