爬取自如房源信息

难点自如的价格进行了css偏移:

利用到的第三方库:requests,re,lxml,pymysql,pytesseract,其中的数据库要是没有可以删除save方法,若有可以自行进行搭建,用户账号和密码输入自己的。数据库字段设置如下显示:
image.png

pytesseract这个库安装较为麻烦,具体操作可以看这个
https://blog.csdn.net/luanyongli/article/details/81385284------这个文字较多其实操作起来很快的。如果安装好后用图片实验发现失败,可以把文件删了重新操作(不要问我是怎么知道的)
1593698180(1).png

如有问题,请直接留言,可一同讨论!!!

话不多说直接上代码:

image.png

image.png

    def get_html(self,url):
        resp = requests.get(url=url, headers=self.headers, timeout=10)
        code = resp.status_code
        self.content = resp.text
        imgurl = re.findall(r'<span class="num" style="background-image: url(.*?);background-position:.*?"></span>',
                            resp.text)[0].replace('(', '').replace(')', '')
        image_url = 'http:'+imgurl
        if code == 200:
            with open('data.html', 'w', encoding='utf-8')as f:
                f.write(resp.text)
                f.close()
            return image_url
        else:
            print('获取失败!!', code)
        pass

    def download_img(self, image_url):
        resp = requests.get(url=image_url)
        file_name = image_url.split('/')[-1]
        with open(file_name, 'wb') as f:
            f.write(resp.content)
            f.close()
            pass
        return file_name

    def parse_img(self, file_name):
        image = Image.open(file_name)
        nums = pytesseract.image_to_string(image)
        nums = [num for num in nums]
        for num in nums:
            if num == ' ':
                nums.remove(num)
        self.offset = ['-0px', '-21.4px', '-42.8px', '-64.2px', '-85.6px',
                       '-107px', '-128.4px', '-149.8px', '-171.2px', '-192.6px']
        for k, v in zip(self.offset, nums):
            self.real_num[k] = v
        # print(self.real_num)

    def get_data(self):
        contents = etree.HTML(self.content)
        real_price = []
        name = contents.xpath('.//div[2]/h5/a/text()')
        floor_size = contents.xpath(
            '/html/body/section/div[3]/div[2]/div/div[2]/div[1]/div[1]/text()')
        location = contents.xpath(
            '//section/div[3]/div[2]/div/div[2]/div/div[2]/text()')
        home_offset = re.findall(
            r'<span class="rmb">¥</span>(.*?)</div>', self.content, re.S)
        for offsets in home_offset:
            offsets = re.findall(
                'background-position: (.*?)"></span>', offsets)
            price = ''
            for offset in offsets:
                price += self.real_num[offset]
            real_price.append(price)
            price = ''
        data = {}
        try:
            for name, price, floor_size, location in zip(name, real_price, floor_size, location):
                infos = name.split('·')
                items = floor_size.split('|')
                size=re.findall(r'(.*?)㎡',str(items[0]))[0]
                data['style'] = str(infos[0])
                data['name'] = str(infos[1])
                data['floor'] = str(items[1])
                data['size'] = float(size)
                data['price'] = float(price)
                data['location'] = location.replace(
                    '\n', '').replace('\t', '').strip()
                self.save(data)
        except:
            print('数据错误')
        pass

    def save(self, data):
        db = pymysql.connect(host='localhost', port=3306,
                             db='pengwei', user='pengwei', password='pengwei')
        cursor = db.cursor()
        cursor.execute(
            """ INSERT into ziru value(%s, %s, %s, %s, %s, %s)"""
            ,(data['name'], data['style'], data['size'], data['price'], data['floor'], data['location'])
        )
        db.commit()
        cursor.close()
        pass
    pass
```
![image.png](https://imgconvert.csdnimg.cn/aHR0cHM6Ly91cGxvYWQtaW1hZ2VzLmppYW5zaHUuaW8vdXBsb2FkX2ltYWdlcy8yMzczNjY2Mi0yMTNkYzFjMjU1OGY2MzBjLnBuZw?x-oss-process=image/format,png)

`

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值