Python 爬取58同城以及解析

python爬虫——爬取58同城房屋的信息 仅供参考

解析58同城的字体反爬

58同城出租房屋
在这里插入图片描述
这是爬取下来的数据:
在这里插入图片描述
我们用谷歌浏览器右击点开查看网页源代码搜索font-face可以看到一串用base64加密的字符
在这里插入图片描述
将这些字符粘贴下来,将这个字符进行解密并保存成ttf

font_face='AAEAAAALAIAAAwAwR1NVQiCLJXoAAAE4AAAAVE9TLzL4XQjtAAABjAAAAFZjbWFwq8B/ZwAAAhAAAAIuZ2x5ZuWIN0cAAARYAAADdGhlYWQTmDvfAAAA4AAAADZoaGVhCtADIwAAALwAAAAkaG10eC7qAAAAAAHkAAAALGxvY2ED7gSyAAAEQAAAABhtYXhwARgANgAAARgAAAAgbmFtZTd6VP8AAAfMAAACanBvc3QFRAYqAAAKOAAAAEUAAQAABmb+ZgAABLEAAAAABGgAAQAAAAAAAAAAAAAAAAAAAAsAAQAAAAEAAOu1IchfDzz1AAsIAAAAAADYCHhnAAAAANgIeGcAAP/mBGgGLgAAAAgAAgAAAAAAAAABAAAACwAqAAMAAAAAAAIAAAAKAAoAAAD/AAAAAAAAAAEAAAAKADAAPgACREZMVAAObGF0bgAaAAQAAAAAAAAAAQAAAAQAAAAAAAAAAQAAAAFsaWdhAAgAAAABAAAAAQAEAAQAAAABAAgAAQAGAAAAAQAAAAEERAGQAAUAAAUTBZkAAAEeBRMFmQAAA9cAZAIQAAACAAUDAAAAAAAAAAAAAAAAAAAAAAAAAAAAAFBmRWQAQJR2n6UGZv5mALgGZgGaAAAAAQAAAAAAAAAAAAAEsQAABLEAAASxAAAEsQAABLEAAASxAAAEsQAABLEAAASxAAAEsQAAAAAABQAAAAMAAAAsAAAABAAAAaYAAQAAAAAAoAADAAEAAAAsAAMACgAAAaYABAB0AAAAFAAQAAMABJR2lY+ZPJpLnjqeo59kn5Kfpf//AACUdpWPmTyaS546nqOfZJ+Sn6T//wAAAAAAAAAAAAAAAAAAAAAAAAABABQAFAAUABQAFAAUABQAFAAUAAAABgAIAAEABQAKAAIABwADAAQACQAAAQYAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADAAAAAAAiAAAAAAAAAAKAACUdgAAlHYAAAAGAACVjwAAlY8AAAAIAACZPAAAmTwAAAABAACaSwAAmksAAAAFAACeOgAAnjoAAAAKAACeowAAnqMAAAACAACfZAAAn2QAAAAHAACfkgAAn5IAAAADAACfpAAAn6QAAAAEAACfpQAAn6UAAAAJAAAAAAAAACgAPgBmAJoAvgDoASQBOAF+AboAAgAA/+YEWQYnAAoAEgAAExAAISAREAAjIgATECEgERAhIFsBEAECAez+6/rs/v3IATkBNP7S/sEC6AGaAaX85v54/mEBigGB/ZcCcwKJAAABAAAAAAQ1Bi4ACQAAKQE1IREFNSURIQQ1/IgBW/6cAicBWqkEmGe0oPp7AAEAAAAABCYGJwAXAAApATUBPgE1NCYjIgc1NjMyFhUUAgcBFSEEGPxSAcK6fpSMz7y389Hym9j+nwLGqgHButl0hI2wx43iv5D+69b+pwQAAQAA/+YEGQYnACEAABMWMzI2NRAhIzUzIBE0ISIHNTYzMhYVEAUVHgEVFAAjIiePn8igu/5bgXsBdf7jo5CYy8bw/sqow/7T+tyHAQN7nYQBJqIBFP9uuVjPpf7QVwQSyZbR/wBSAAACAAAAAARoBg0ACgASAAABIxEjESE1ATMRMyERNDcjBgcBBGjGvv0uAq3jxv58BAQOLf4zAZL+bgGSfwP8/CACiUVaJlH9TwABAAD/5gQhBg0AGAAANxYzMjYQJiMiBxEhFSERNjMyBBUUACEiJ7GcqaDEx71bmgL6/bxXLPUBEv7a/v3Zbu5mswEppA4DE63+SgX42uH+6kAAAAACAAD/5gRbBicAFgAiAAABJiMiAgMzNjMyEhUUACMiABEQACEyFwEUFjMyNjU0JiMiBgP6eYTJ9AIFbvHJ8P7r1+z+8wFhASClXv1Qo4eAoJeLhKQFRj7+ov7R1f762eP+3AFxAVMBmgHjLfwBmdq8lKCytAAAAAABAAAAAARNBg0ABgAACQEjASE1IQRN/aLLAkD8+gPvBcn6NwVgrQAAAwAA/+YESgYnABUAHwApAAABJDU0JDMyFhUQBRUEERQEIyIkNRAlATQmIyIGFRQXNgEEFRQWMzI2NTQBtv7rAQTKufD+3wFT/un6zf7+AUwBnIJvaJLz+P78/uGoh4OkAy+B9avXyqD+/osEev7aweXitAEohwF7aHh9YcJlZ/7qdNhwkI9r4QAAAAACAAD/5gRGBicAFwAjAAA3FjMyEhEGJwYjIgA1NAAzMgAREAAhIicTFBYzMjY1NCYjIga5gJTQ5QICZvHD/wABGN/nAQT+sP7Xo3FxoI16pqWHfaTSSgFIAS4CAsIBDNbkASX+lf6l/lP+MjUEHJy3p3en274AAAAAABAAxgABAAAAAAABAA8AAAABAAAAAAACAAcADwABAAAAAAADAA8AFgABAAAAAAAEAA8AJQABAAAAAAAFAAsANAABAAAAAAAGAA8APwABAAAAAAAKACsATgABAAAAAAALABMAeQADAAEECQABAB4AjAADAAEECQACAA4AqgADAAEECQADAB4AuAADAAEECQAEAB4A1gADAAEECQAFABYA9AADAAEECQAGAB4BCgADAAEECQAKAFYBKAADAAEECQALACYBfmZhbmdjaGFuLXNlY3JldFJlZ3VsYXJmYW5nY2hhbi1zZWNyZXRmYW5nY2hhbi1zZWNyZXRWZXJzaW9uIDEuMGZhbmdjaGFuLXNlY3JldEdlbmVyYXRlZCBieSBzdmcydHRmIGZyb20gRm9udGVsbG8gcHJvamVjdC5odHRwOi8vZm9udGVsbG8uY29tAGYAYQBuAGcAYwBoAGEAbgAtAHMAZQBjAHIAZQB0AFIAZQBnAHUAbABhAHIAZgBhAG4AZwBjAGgAYQBuAC0AcwBlAGMAcgBlAHQAZgBhAG4AZwBjAGgAYQBuAC0AcwBlAGMAcgBlAHQAVgBlAHIAcwBpAG8AbgAgADEALgAwAGYAYQBuAGcAYwBoAGEAbgAtAHMAZQBjAHIAZQB0AEcAZQBuAGUAcgBhAHQAZQBkACAAYgB5ACAAcwB2AGcAMgB0AHQAZgAgAGYAcgBvAG0AIABGAG8AbgB0AGUAbABsAG8AIABwAHIAbwBqAGUAYwB0AC4AaAB0AHQAcAA6AC8ALwBmAG8AbgB0AGUAbABsAG8ALgBjAG8AbQAAAAIAAAAAAAAAFAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACwECAQMBBAEFAQYBBwEIAQkBCgELAQwAAAAAAAAAAAAAAAAAAAAA'
b = base64.b64decode(font_face)
with open('58.ttf','wb') as f:
    f.write(b)

在网上搜索下载并安装字体处理软件FontCreator或者用百度在线编辑将58.ttf文件放入
在这里插入图片描述
在这里插入图片描述

与上图对比我们可以发现&#x9fa4对应的是3,这时我们就可以将HTML中的编码和我们获取的正常字体的映射表进行对比将页面中的编码进行替换

from fontTools.ttLib import TTFont

font = TTFont('58.ttf') # 打开本地的ttf文件
font.saveXML('58.xml')  # 转换成xml

打开xml文件,可以看到类似html标签的文件结构。

点开glyf标签,看到的是name和一些坐标点,这些座标点就是描绘字体形状的,这里不需要关注这些坐标点。(如果字体是动态的话我们就需要以这个坐标进行区分)

在这里插入图片描述

点开cmap标签,是编码和name的对应关系
在这里插入图片描述
从这张图我们可以发现,glyph00001对应的是数字0以此类推

import re
from fontTools.ttLib import TTFont

font = TTFont('58.ttf') #打开本地的ttf文件
bestcmap = font['cmap'].getBestCmap()
print(bestcmap)
newmap = dict()
for key in bestcmap.keys():
    print(key)
    value = int(re.search(r'(\d+)', bestcmap[key]).group(1)) - 1
    key = hex(key)  # 取出的数据是int类型 将数据变成编码和正常字体的映射关系
    newmap[key] = value
print(newmap)

结果如下:

{'0x9476': 5, '0x958f': 7, '0x993c': 0, '0x9a4b': 4, '0x9e3a': 9, '0x9ea3': 1, '0x9f64': 6, '0x9f92': 2, '0x9fa4': 3, '0x9fa5': 8}

此时我们已经知道了编码的对应数字,我们就可以将页面数据进行替换

import requests
import re
import base64
import io
from lxml import etree
from fontTools.ttLib import TTFont

url = 'https://sz.58.com/chuzu/?PGTID=0d3090a7-0000-49be-8dc8-e2e4af034ea9&ClickID=1'
headers = {
    'User-Agent':'Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)',
    'cookie': 'f=n; commontopbar_new_city_info=4%7C%E6%B7%B1%E5%9C%B3%7Csz; commontopbar_ipcity=shuyang%7C%E6%B2%AD%E9%98%B3%7C0; id58=c5/nn18XlQCym5PPT8JdAg==; 58tj_uuid=547cf37c-9135-4168-9fdc-341836d1afdb; als=0; wmda_uuid=796eab3d6b3576eee6c534b6d3379544; wmda_new_uuid=1; wmda_visited_projects=%3B11187958619315; f=n; xxzl_deviceid=Be9TJhOgYAPP7GSUEkgk4RZBMabLmi34GBBt%2FTegkrNuIMH1nBc4pe1OJLgbV0nN; _house_detail_show_time_=2; ppStore_fingerprint=E6C2579E17E2E551AEDBDBA82D9C3927E985B7664F4F8B35%EF%BC%BF1596156971920; xzfzqtoken=Rmgrih5Drls4t8VZnmZuNUzpsVR6%2F2yOsmMXk%2BkaMs%2BEOTDi1FQwTvzz29JaZseJin35brBb%2F%2FeSODvMgkQULA%3D%3D; xxzl_cid=104cd4f8332c4312b5711ffd15134b1a; xzuid=8121746b-4950-4710-a1a0-40b2b0245a0d; new_session=1; new_uv=4; utm_source=; spm=; init_refer=; wmda_session_id_11187958619315=1596162757844-8ecc150a-be9e-98a4',

}
response = requests.get(url=url, headers=headers)
print(response.text)
# 获取HTML页面中的加密字符串
base64_str = re.search("base64,(.*?)'\)",response.text).group(1)
b = base64.b64decode(base64_str)  # 取出是二进制
font = TTFont(io.BytesIO(b))  # 将数据转换成字体格式
# print(b)
# print(font)
# 取出相对应的数据
bestcmap = font['cmap'].getBestCmap()
newmap = dict()
print(bestcmap)
for key in bestcmap.keys():
    value = int(re.search(r'(\d+)', bestcmap[key]).group(1)) - 1
    key = hex(key)  # 为了对字体文件映射
    newmap[key] = value
# 把页面上自定义字体替换成正常字体
font_response = response.text
for key,value in newmap.items():
    house_content = key.replace('0x','&#x') + ';'
    if house_content in font_response:
        font_response = font_response.replace(house_content,str(value))
    # print(font_response)
# # 获取标题
tc_html = etree.HTML(font_response)
print(tc_html)
tc_ul = tc_html.xpath('.//div[@class="list-box"]/ul[@class="house-list"]//li[@class="house-cell"]')
i = 0
for li in tc_ul:
    detail_url = li.xpath('./div[@class="img-list"]/a/@href')  # 页面详情url
    title = li.xpath('normalize-space(.//h2/a/text())')  # 标题
    house = li.xpath('normalize-space(.//div[@class="des"]/p[1]/text())')  # 房子的介绍
    area1 = li.xpath('normalize-space(.//div[@class="des"]/p[@class="infor"]/a[1]/text())')  # 区域地址
    area2 = li.xpath('normalize-space(.//div[@class="des"]/p[@class="infor"]/a[2]/text())')  # 区域地址2
    area3 = li.xpath('normalize-space(.//div[@class="des"]/p[2]/text()[4])')  # 区域地址3
    money = li.xpath('normalize-space(.//div[@class="money"]/b/text())')  # 房屋价格
    source = li.xpath('normalize-space(.//div[@class="jjr"]//span[@class="jjr_par_dp"]/text())')  # 来源
    source_peo = li.xpath('normalize-space(.//div[@class="jjr"]//span[@class="listjjr"]/text())')  # 来源人
    print(title, house, money, detail_url)
    i += 1
print(i)

运行结果:

在这里插入图片描述

  • 3
    点赞
  • 9
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值