与之前思路相同,先分析网页结构,获取我们想要爬取元素的信息,再想办法将内容抓取下来。本次实验我们主要抓取自如网站中商品价格信息,其是通过CSS加密。但经过规律寻找,其加密是通过将数字乱序存放于图片中,再定位显示,我们需要做的工作有: (1) 获取乱序存放数字的图片;(2) 获取数字定位显示信息;(3) 利用ocr文字识别识别数字,再按序取出、拼接,得到正确价格。
自如租房网站链接:【天津租房房源价格信息】-天津自如网 (ziroom.com)
下面是整个探索过程……
Step 1 :找到我们要获取元素的位置,查看其是否存在源码中、是否可以直接获取。
点入某个商品,比如自如友家·天悦风华·4居室(https://tj.ziroom.com/x/807872053.html)
可以看到价格并非直接卒年放在i源码中,是存放在i标签中的,
class="num" style="background-position:-31.24px;background-image:url(//static8.ziroom.com/phoenix/pc/images/2019/price/6f8787069ac0a69b36c8cf13aacb016b.png); background-position与background-image分别为背景位置与背景图,在浏览器中查看背景图链接且四个数字对应的图片链接是相同的,不同的是图片位置,
尝试修改background-position的值,发现如果第一个数字的 background-position改为-0,页面显示数据变为6030,也就是第一个数字。由此,可以判定,所显示数字位置是以31.24px为单位递增排列的, 即background-position除31.24并取整。
Step 2 :尝试利用ocr文字识别提取图片中的文字信息。
需要导入python第三方库ddddocr,安装与使用的参考博客见下。
(69条消息) (亲测好用便捷)Python通用验证码识别OCR库ddddocr的安装使用教程_ddddocr安装_qq_50058672的博客-CSDN博客https://blog.csdn.net/qq_50058672/article/details/126123778(69条消息) Python3-图片文字识别库ddddocr的使用,验证码的识别_liranke的博客-CSDN博客https://blog.csdn.net/liranke/article/details/126405660
如果之前装过open-cv相关库的话,可能会出现报错,可参考以下两篇博客,亲测有效
(69条消息) 已解决AttributeError: partially initialized module ‘cv2‘ has no attribute ‘gapi_wip_gst_GStreamerPipeli_袁袁袁袁满的博客-CSDN博客https://blog.csdn.net/yuan2019035055/article/details/128076070(69条消息) AttributeError: partially initialized module ‘cv2‘ has no attribute ‘gapi_wip_gst_GStreamerPipeline‘_attributeerror: partially initialized module 'cv2'_「 25' h 」的博客-CSDN博客https://blog.csdn.net/weixin_54884881/article/details/126062844
代码:(将需要识别的图片先保存为zr.png,并与代码放在同一目录下)
import ddddocr
def img2text(img_file):
ocr = ddddocr.DdddOcr() # 法1
with open(img_file, 'rb') as f: # 打开文件
img_bytes = f.read()
# 获取文字
res = ocr.classification(img_bytes)
print(res)
img2text('zr.png')
运行结果: (好像貌似有广告,最后一行为识别结果) 识别结果正确。
Step 3 :尝试获取background-position与background-image,并利用二者信息获取图片链接,获取图片,识别图片,拼接出价格。
代码:
source = requests.get('https://tj.ziroom.com/x/807872053.html', headers=headers).text
result = etree.HTML(source).xpath('//div[@class="Z_price"]/i/@style')
print(result)
list_index = []
for s in result:
background_pi_compile = re.compile('background-position:-([\d\.]+)px;background-image: url\((.*?)\);')
background_pi = background_pi_compile.findall(s)
position = int(float(background_pi[0][0])//31.24)
image = background_pi[0][1]
image_link = 'https:'+image
list_index.append(position)
print(position, image)
输出结果:
在此基础上,加上图片识别与数字定位即可。
Step 4: 完整代码:
import ddddocr
import requests
from lxml import etree
import re
def img2text(img_bytes):
ocr = ddddocr.DdddOcr()
res = ocr.classification(img_bytes)
return res
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6",
"Cache-Control": "no-cache",
"Connection": "keep-alive",
# "Cookie": "SECKEY_ABVK=+AhKxXoC2vCGHgVkogpQ8V8JLGc15shV9MTUOSfoTxQ%3D; BMAP_SECKEY=aWFX1vabkUuKU-Fb_-GwzlazsBAS_cWqFA_ON_8KzUKbP4RGbbn0JTYR-3AfpPw3pYwnCbczRdFqxK6XacrRISdubrB2IAgIZVRr88e2pEEdAqg3MhA9B9WEdpRwwMzMEyMtGXQBistN9DXdCNQZyQZSNSyL-brfbwLZRV0d8djkoVj-D1gH1nExxzfz4-oR; CURRENT_CITY_CODE=120000; _csrf=RPlmIBoCfvT7Wm_6IgJkBFx2aS4c-e_B; __jsluid_s=7a5efd41b82dccf4e3d30d24c795aef1; PHPSESSID=526isovo290grgs0i3b3cu0ju3; Hm_lvt_4f083817a81bcb8eed537963fc1bbf10=1685689065; sajssdk_2015_cross_new_user=1; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%221887ae5e0c5113-0b1f5d3d88ca1e-7b515473-1327104-1887ae5e0c62b4%22%2C%22%24device_id%22%3A%221887ae5e0c5113-0b1f5d3d88ca1e-7b515473-1327104-1887ae5e0c62b4%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E5%BC%95%E8%8D%90%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22https%3A%2F%2Fmp.csdn.net%2Fmp_blog%2Fcreation%2Feditor%3Fnot_checkout%3D1%22%2C%22%24latest_referrer_host%22%3A%22mp.csdn.net%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC%22%7D%7D; visitHistory=%5B%22807872053%22%2C%22761496060%22%2C%22809002819%22%5D; Hm_lpvt_4f083817a81bcb8eed537963fc1bbf10=1685690736",
"Host": "tj.ziroom.com",
"Pragma": "no-cache",
"sec-ch-ua": "\"Microsoft Edge\";v=\"113\", \"Chromium\";v=\"113\", \"Not-A.Brand\";v=\"24\"",
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": "\"Windows\"",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36 Edg/113.0.1774.57"
}
source = requests.get('https://tj.ziroom.com/x/807872053.html', headers=headers).text
result = etree.HTML(source).xpath('//div[@class="Z_price"]/i/@style')
list_index = []
for s in result:
background_pi_compile = re.compile('background-position:-([\d\.]+)px;background-image: url\((.*?)\);')
background_pi = background_pi_compile.findall(s)
position = int(float(background_pi[0][0])//31.24)
image = background_pi[0][1]
image_link = 'https:'+image
list_index.append(position)
link_source = requests.get(image_link).content
image_str = img2text(link_source)
print(image_str)
image_num = ''.join(map(lambda x: image_str[x], list_index))
print(image_num)
输出结果: