又到了苦逼的换房季,饱受黑中介之苦的我听说豆瓣有租房话题,决定去看一看。打开豆瓣傻眼了,全都是租房话题,一条一条找我想要的目的地。决定写个小程序快速找一找
先给一下最终代码:
#coding=utf8
__author__ = ''
__date__ = 2018 / 5 / 5
import requests
from bs4 import BeautifulSoup
def getPicture():
result = open("东坝.txt","w")
links = []
proxie = {
'https':'http://101.81.141.175:9999'
}
for pageindex in range(0,1500,15):
url = "http://www.douban.com/group/beijingzufang/discussion"
Page = {'start':pageindex}
wbdata = requests.get(url,params=Page,proxies=proxie).text
soup = BeautifulSoup(wbdata, 'html.parser')
subject_titles = soup.select("td.title > a")
tag1= u"东坝"
tag2 = u"独卫"
for n in subject_titles:
title = n.get("title")
link = n.get("href")
if tag1 in title and tag2 in title and link not in links:
result.write(link + "\n")
links.append(link)
result.close()
getPicture()
截图:
简单解释一下这段代码:
1、以写模式打开一个文件,用于保存爬取结果
result = open("东坝.txt","w")
2、设置代理地址,很多网站都有反扒机制,不设置代理很容易被封掉IP。这个IP可以去网上搜一下。
proxie = {
'https':'http://101.81.141.175:9999'
}
3、设置url及参数化,请求目标页面,用BeautifulSoup解析页面
url = "http://www.douban.com/group/beijingzufang/discussion"
Page = {'start':pageindex}
wbdata = requests.get(url,params=Page,proxies=proxie).text
soup = BeautifulSoup(wbdata, 'html.parser')
subject_titles = soup.select("td.title > a")
4、设置过滤条件,对结果进行过滤和去重。把最终结果写到目标文件中去
tag1= u"东坝"
tag2 = u"独卫"
for n in subject_titles:
title = n.get("title")
link = n.get("href")
if tag1 in title and tag2 in title and link not in links:
result.write(link + "\n")
links.append(link)
这段小代码目前还比较粗糙,有时间还需要继续优化