一个国外海报欣赏站点,typographicposters.com,比较有意思(头秃)的json数据传递,分类也比较有意思,采用的是点击rgb颜色参数获取分类,以海报的颜色为分类,看了下数据还是比较多,直接requests数据是不行的。
目标网址
https://www.typographicposters.com/?filter=recent
下一页数据的加载是需要手动点击,有个查看更多的按钮
感觉是用了Query
抓包数据
全部协议头添加上,还是画红线处报错提示,服务器错误?
怕了,怕了,使用最笨的方法,python selenium搞起!
第一步:获取到网页源码,
1.不怎么会处理while的循环逻辑,还是用了for循环!
2.报错是必需有的!!
3.xpath获取到下一页按钮
find_element_by_xpath('//div[@class="pagination"]/button[@class="button-highlight"]')
附上源码:
#https://www.typographicposters.com
#海报图采集
#20200513by 微信:huguo00289
# -*- coding: UTF-8 -*-
import time
from selenium import webdriver
def xl(browser):
try:
browser.find_element_by_xpath('//div[@class="pagination"]/button[@class="button-highlight"]').click()
except:
print("网络问题!")
time.sleep(5)
browser.find_element_by_xpath('//div[@class="pagination"]/button[@class="button-highlight"]').click()
time.sleep(3)
js = "var q=document.documentElement.scrollTop=100000"
browser.execute_script(js)
html = browser.page_source
with open('sj{}.html'.format(i), 'w', encoding='utf-8') as f:
f.write(html)
chromedriver_path=r"C:\Users\Administrator\AppData\Local\Programs\Python\Python37\chromedriver.exe" #完整路径
url = 'https://www.typographicposters.com/?filter=recent'
options = webdriver.ChromeOptions() # 配置 chrome 启动属性
# options.add_experimental_option("prefs", {"profile.managed_default_content_settings.images": 2}) #不加载图片,加快访问速度
options.add_experimental_option("excludeSwitches", ['enable-automation']) # 此步骤很重要,设置为开发者模式,防止被各大网站识别出来使用了Selenium
browser = webdriver.Chrome(executable_path=chromedriver_path, options=options)
browser.get(url)
time.sleep(5)
js = "var q=document.documentElement.scrollTop=100000"
browser.execute_script(js)
time.sleep(2)
for i in range(1,20):
print(i)
if browser.find_element_by_xpath('//div[@class="pagination"]/button[@class="button-highlight"]'):
xl(browser)
else:
print("完成翻页")
time.sleep(10)
# 打印当前网页源码
html = browser.page_source
with open('sjj.html', 'w', encoding='utf-8') as f:
f.write(html)
break
第二步:获取到单页面链接,再获取到大图图片链接,下载图片
1.使用了xpath获取到单页面(详情页)链接地址
2.使用了正则获取到图片链接,这里发现头部就有大图地址,使用requests获取到网页源码,再结合正则一步到位获取到图片的地址
附上源码:
#https://www.typographicposters.com
#海报图采集
#20200513by 微信:huguo00289
# -*- coding: UTF-8 -*-
import requests,re,time
from lxml import etree
from fake_useragent import UserAgent
from selenium import webdriver
def ua():
ua=UserAgent()
headers={"User-Agent":ua.random}
return headers
def tp(img_url):
ua = UserAgent()
headers = {
'referer': 'https://www.typographicposters.com',
'User-Agent': ua.random,
}
img_name=img_url.split('/')[-1]
r = requests.get(img_url,headers=headers,timeout=10)
time.sleep(1)
with open(img_name,'wb')as f:
f.write(r.content)
print(f"{img_name}下载图片成功")
def getimg(url):
html=requests.get(url,headers=ua(),timeout=10).content.decode('utf-8')
img_url=re.findall(r'<meta property="og:image:url" content="(.+?)" />',html,re.S)[0]
print(img_url)
tp(img_url)
with open("sj17.html",encoding='utf-8') as f:
html=f.read()
req=etree.HTML(html)
hrefs=req.xpath('//div[@class="col-6 posters-item"]/a/@href')
print(len(hrefs))
for href in hrefs:
url = f"https://www.typographicposters.com{href}"
print(url)
try:
img_url=getimg(url)
except:
pass
太偷懒了,还是需要破解post参数才好!