字体反爬案例分析与爬取实战_对接字体文件反爬-CSDN博客

本文链接：https://blog.csdn.net/qq_39217312/article/details/140773076

该案例将真实数据隐藏到字体文件中，使我们即使获取了页面源代码，也没法直接提取数据的真实值

案例介绍

案例网站： https://antispider4.scrape.center/

打开之后，看不出有什么特别的地方

我们按照常规逻辑进行爬取

from selenium import webdriver
from pyquery import  PyQuery as pq
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait

browser = webdriver.Chrome()
browser.get('https://antispider4.scrape.center/')
(WebDriverWait(browser, 10)
 .until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '.item'))))
html = browser.page_source
doc = pq(html)
items = doc('.item')
for item in items.items():
    name = item('.name').text()
    categories = [o.text() for o in item('.categories button').items()]
    score = item('.score').text()
    print(f'name: {name} categories: {categories} score: {score}')
browser.close()

部分输出

name: 霸王别姬 - Farewell My Concubine categories: ['剧情', '爱情'] score:
name: 这个杀手不太冷 - Léon categories: ['剧情', '动作', '犯罪'] score:
name: 肖申克的救赎 - The Shawshank Redemption categories: ['剧情', '犯罪'] score:
name: 泰坦尼克号 - Titanic categories: ['剧情', '爱情', '灾难'] score:
name: 罗马假日 - Roman Holiday categories: ['剧情', '喜剧', '爱情'] score:

这里通过 Selenium 打开案例网站，等待所有电影加载出来，然后获取源代码，并通过 pyquery 提取和解析每一个电影信息，得到名称，类别和评分，之后输出。我们发现输出结果中 score 字段并没有任何信息，经过观察发现评分对应的源代码不包含数字信息

案例分析

我们观察源代码发现，各个 span 节点的不同之处在于内部的 i 节点的 class 取值不一样，可以从上图中看到，一共有 3 个 sapn 节点，对应的 class 取值分别是 icon-789, icon-981 , icon-504,这和显示的 9.5什么关系呢

会发现 i 节点内部有一个 ::before 字段，在 CSS 中，该字段用于创建一个伪节点，即这个节点和 i 节点或者 span 节点不一样， ::before 可以往特定的节点插入内容，同时在 CSS 中使用 content 字段定义这个内容，我们在第一个 i 节点里看到了数字 9 ，观察另外两个节点，可以看到 . （点）和 5，三个内容组合起来就是 9.5

实战

那 class 的取值和 content 字段值的映射关系是怎么定义的呢？我们可以在 CSS 中追踪源代码

进入文件后，如果代码都在一行，可以点击下面的 { } 按钮，格式化代码

可以从中找到这样的内容

.icon-789:before {
content: "9"
}

.icon-281:before {
content: "8"
}

原来CSS对应的值就是一个个评分结果。这样我们只需要解析转换就可以了，这里需要读取 CSS 文件并提取映射关系，这个 CSS 文件是 https://antispider4.scrape.center/css/app.654ba59e.css

#header[data-v-74e8b908]{background-color:#fff}.container[data-v-74e8b908]{height:60px;padding-top:8px}.logo[data-v-74e8b908]{height:40px;width:200px;position:relative}.logo .logo-image[data-v-74e8b908]{height:40px}.logo .logo-title[data-v-74e8b908]{position:absolute;left:55px;top:5px;height:40px;font-size:23px;font-weight:700;color:#444}#app{font-family:Avenir,Helvetica,Arial,sans-serif}@font-face{font-family:element-icons;src:url(../fonts/element-icons.535877f5.woff) format("woff"),url(../fonts/element-

这里是部分内容

我们可以用 requests 库读取出来，并通过正则表达式将映射关系提取出来

import re
import requests

url = 'https://antispider4.scrape.center/css/app.654ba59e.css'

response = requests.get(url)
pattern = re.compile('.icon-(.*?):before\{content:"(.*?)"}')
results = re.findall(pattern, response.text)
icon_map = {item[0]: item[1] for item in results}

这里我们使用了 requests 库提取了 CSS 文件的内容，然后使用正则表达式进行了文本匹配，表达式写作 .icon-(.*?):before\{content:"(.*?)"} 这个表达式并没有考虑空格，因为 CSS 源代码本身就是在一行放着而且去除了空格

这里我么使用 findall 方法进行了内容匹配得到结果如下

print(icon_map)

部分输出

{'"],[class^=el-icon-]{-webkit-font-smoothing:antialiased;-moz-osx-font-smoothing:grayscale}[class*=" el-icon-"],[class^=el-icon-]{font-family:element-icons!important;speak:none;font-style:normal;font-weight:400;font-variant:normal;text-transform:none;line-height:1;vertical-align:baseline;display:inline-block}.el-icon-ice-cream-round': '\\E6A0', 'ice-cream-square': '\\E6A3',

如果使用 789 作为索引

print(icon_map['789'])

9

和源代码一致

所以我们只需要修改一下提取逻辑

from selenium import webdriver
from pyquery import  PyQuery as pq
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
import re
import requests

url = 'https://antispider4.scrape.center/css/app.654ba59e.css'

response = requests.get(url)
pattern = re.compile('.icon-(.*?):before\{content:"(.*?)"}')
results = re.findall(pattern, response.text)
icon_map = {item[0]: item[1] for item in results}
def parse_score(item):
    elements = item('.icon')
    icon_values = []
    for element in elements.items():
        class_name = (element.attr('class'))
        icon_key = re.search('icon-(\d+)', class_name).group(1)
        icon_value = icon_map.get(icon_key)
        icon_values.append(icon_value)
    return ''.join(icon_values)


browser = webdriver.Chrome()
browser.get('https://antispider4.scrape.center/')
(WebDriverWait(browser, 10)
 .until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '.item'))))
html = browser.page_source
doc = pq(html)
items = doc('.item')
for item in items.items():
    name = item('.name').text()
    categories = [o.text() for o in item('.categories button').items()]
    score = parse_score(item)
    print(f'name: {name} categories: {categories} score: {score}')
browser.close()

name: 霸王别姬 - Farewell My Concubine categories: ['剧情', '爱情'] score: 9.5
name: 这个杀手不太冷 - Léon categories: ['剧情', '动作', '犯罪'] score: 9.5
name: 肖申克的救赎 - The Shawshank Redemption categories: ['剧情', '犯罪'] score: 9.5
name: 泰坦尼克号 - Titanic categories: ['剧情', '爱情', '灾难'] score: 9.5
name: 罗马假日 - Roman Holiday categories: ['剧情', '喜剧', '爱情'] score: 9.5
name: 唐伯虎点秋香 - Flirting Scholar categories: ['喜剧', '爱情', '古装'] score: 9.5
name: 乱世佳人 - Gone with the Wind categories: ['剧情', '爱情', '历史', '战争'] score: 9.5
name: 喜剧之王 - The King of Comedy categories: ['剧情', '喜剧', '爱情'] score: 9.5
name: 楚门的世界 - The Truman Show categories: ['剧情', '科幻'] score: 9.0
name: 狮子王 - The Lion King categories: ['动画', '歌舞', '冒险'] score: 9.0