水木租房、二手、购房简易爬虫

最新推荐文章于 2024-08-11 17:47:20 发布

maray

最新推荐文章于 2024-08-11 17:47:20 发布

阅读量2.8k

点赞数

分类专栏：编程与应用文章标签：爬虫

本文链接：https://blog.csdn.net/maray/article/details/53163024

版权

编程与应用专栏收录该内容

231 篇文章 2 订阅

订阅专栏

准备租房，对于中介提供的房子，大部分不满意，还不如去水木自己看看。于是写了一个小爬虫，帮我自动筛选出下列小区的出租信息：

keywords.txt 里包含了全部关注的关键词：

北京北
雍和宫
明天第一城
清水园
奥北
润枫欣尚
合立方
顶秀青溪
嘉城花园
东辰
旭辉奥都
佳运
林奥嘉园
拂林园
天月园
天溪园
天畅园
天居园
美立方
望春园
华发硕园
金泉家园
安慧北里逸园
安慧北里
北苑
团结湖
呼家楼

全部代码如下。自己记得先修改一下目录位置，并且保证已经安装了BeautifulSoup扩展。具体安装方法自行百度。

# coding:utf-8
# Filename: my_crawl.py
# Function: 租房小爬虫
# Author：hustos@qq.com
# 微博：OceanBase晓楚
# 微信：hustos

from bs4 import BeautifulSoup
import re
import sys
import urllib
import time
import random
import time

reload(sys)
sys.setdefaultencoding("utf-8")

# 支持爬不同版面，取消下面的注释即可

# 二手房
# board = 'OurHouse'

# 二手市场主版
# board = 'SecondMarket'

# 租房
board = 'HouseRent'


keywords = []
matched = []
final = []

for kw in open('/home/wwwroot/rent/keywords.txt').readlines():
    keywords.append(kw.strip())

# print keywords[0]


#soup = BeautifulSoup(open('pg2.html'), "html5lib")

for page in range(1, 10):
    url = 'http://m.newsmth.net/board/%s?p=%s' % (board, page)
    data = urllib.urlopen(url).read()
    # print data
    soup = BeautifulSoup(data, "html5lib")
    for a in soup.find_all(href=re.compile("\/article\/" + board)):
        item = a.encode('utf-8')
        for kw in keywords:
            if item.find(kw) >= 0:
                matched.append(item)
    time.sleep(5 + 10 * random.random())

for item in matched:
    if item not in final:
        final.append(item)

html = "<html><head><meta charset='UTF-8' /><title>租房</title><base href='http://m.newsmth.net/' /></head><body>"
html += "<br/>".join(final)
html += "<p>last update at %s </p><p><a href='http://m.newsmth.net/board/%s'>水木社区</a></p>" % (time.strftime('%Y-%m-%d %X', time.localtime()), board)
html += "</body></html>"

output = open('/home/wwwroot/rent/index.html', 'w')
output.write(html)
output.close()

# notify，爬完后通知用户
# notifyUrl = "http://m.xxx.cn/rent"
# data = urllib.urlopen(notifyUrl).read()

最终目录结构：

[raywill@rent]# ls
index.html  keywords.txt  my_crawl.py

运行：

python my_crawl.py

git

https://github.com/raywill/crawl_smth

扩展

为了实现自动运行，可加入到cron中。
为了实现自动通知，可接入微信、Slack、简聊等工具
为了实现只有变化时才提醒，可增加一个对比逻辑

maray

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录