链家房源数据采集与代理方案优化
1. 项目背景与难点分析
在进行房地产数据分析时,我们常需要采集大量房源数据。本文将分享一个链家房源数据采集项目的实践经验,重点解决以下难点:
- IP访问频率限制
- 数据采集稳定性
- 大规模数据处理
2. 基础爬虫实现
首先看一下基础版本的实现代码:
import requests
from lxml import etree
import pandas as pd
def main(page):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36...',
# ... 其他请求头
}
response = requests.get(f'https://sh.lianjia.com/ershoufang/pg{page}/', headers=headers)
tree = etree.HTML(response.text)
items = tree.xpath('//div[@class="info clear"]')
# ... 数据解析代码
3. 遇到的主要问题
在实际运行中,我们遇到了以下问题:
- IP被封禁,导致采集中断
- 大量请求时响应变慢
- 需要采集多个城市数据
- 数据实时性要求高
4. 优化方案:引入代理服务
经过多方对比,选择LunaProxy的代理服务来优化爬虫。改进后的代码:
import requests
from lxml import etree
import pandas as pd
from random import choice
def get_proxy():
# 代理配置
proxies = {
'http': 'http://username:password@proxy.lunaproxy.com:port',
'https': 'http://username:password@proxy.lunaproxy.com:port'
}
return proxies
def main(page):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36...',
}
# 使用代理发送请求
proxies = get_proxy()
response = requests.get(
f'https://sh.lianjia.com/ershoufang/pg{page}/',
headers=headers,
proxies=proxies,
timeout=10
)
# 数据处理逻辑保持不变
tree = etree.HTML(response.text)
items = tree.xpath('//div[@class="info clear"]')
data = []
for item in items:
# ... 数据解析代码
5. 性能提升效果
优化后的显著改进:
- 采集成功率提升90%以上
- 支持多城市并发采集
- 数据更新延迟降低到分钟级
- 无需担心IP封禁问题
6. 代理方案选择考虑因素
选择LunaProxy作为代理服务商时,主要考虑:
- 住宅IP资源丰富,伪装性好
- 支持城市级别的IP定向选择
- 响应速度快,适合爬虫场景
- 性价比高,按量计费

7. 参考代码
import requests
from lxml import etree
import pandas as pd
import random
import time
def get_proxy():
"""获取LunaProxy代理"""
username = "lu8389246"
password = "eRedqj"
port = "10000" # 这里需要替换为实际的端口号
proxies = {
'http': f'http://{username}:{password}@proxy.lunaproxy.com:{port}',
'https': f'http://{username}:{password}@proxy.lunaproxy.com:{port}'
}
return proxies
def main(page):
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',
'Connection': 'keep-alive',
'Referer': 'https://sh.lianjia.com/ershoufang/pg2/',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-User': '?1',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36 Edg/134.0.0.0',
'sec-ch-ua': '"Chromium";v="134", "Not:A-Brand";v="24", "Microsoft Edge";v="134"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
}
page = str(page)
proxies = get_proxy()
try:
# 使用代理发送请求
response = requests.get(
f'https://sh.lianjia.com/ershoufang/pg{page}/',
headers=headers,
proxies=proxies,
timeout=10 # 设置超时时间
)
# 检查响应状态
if response.status_code != 200:
print(f"请求失败,状态码:{response.status_code}")
return
tree = etree.HTML(response.text)
items = tree.xpath('//div[@class="info clear"]')
data = []
for item in items:
try:
title = item.xpath('.//div[@class="title"]/a/text()')[0].strip()
position1 = item.xpath('./div[@class="flood"]/div[@class="positionInfo"]/a[1]/text()')[0].strip()
position2 = item.xpath('./div[@class="flood"]/div[@class="positionInfo"]/a[2]/text()')[0].strip()
houseInfo = item.xpath('./div[@class="address"]/div[@class="houseInfo"]/text()')[0].strip()
followInfo = item.xpath('./div[@class="followInfo"]/text()')[0].strip()
spans = item.xpath('./div[@class="tag"]/span/text()')
totalPrice = item.xpath('./div[@class="priceInfo"]/div[@class="totalPrice totalPrice2"]/span/text()')[0].strip()
unitPrice = item.xpath('./div[@class="priceInfo"]/div[@class="unitPrice"]/span/text()')[0].strip()
print(f"成功抓取:{title}")
data.append({
'Title': title,
'Position1': position1,
'Position2': position2,
'HouseInfo': houseInfo,
'FollowInfo': followInfo,
'Tags': ', '.join(spans),
'TotalPrice': totalPrice,
'UnitPrice': unitPrice
})
except IndexError as e:
print(f"解析数据出错:{e}")
continue
# 创建 DataFrame 并保存
if data:
df = pd.DataFrame(data)
df.to_excel(f'lianjia_page_{page}.xlsx', index=False)
print(f"第{page}页数据已保存到Excel")
else:
print("未获取到数据")
# 添加随机延时,避免请求过快
time.sleep(random.uniform(2, 5))
except requests.exceptions.RequestException as e:
print(f"请求发生错误:{e}")
except Exception as e:
print(f"发生未知错误:{e}")
def batch_crawl(start_page, end_page):
"""批量抓取多页数据"""
for page in range(start_page, end_page + 1):
print(f"\n开始抓取第{page}页...")
main(page)
if __name__ == '__main__':
# 单页抓取
# main(1)
# 批量抓取
batch_crawl(1, 5) # 抓取1-5页数据