gitee: https://gitee.com/livingbody/district-information-crawling
url分析
https://xiangxi.loupan.com/info/7075218.html
1.流程分析
搜索获取楼盘数据–如果存在打开详情页–爬取楼盘信息
–如果不存在结束
楼盘名称 建筑类型 占地面积 参考起价 容积率
2.提交搜索post
https://xiangxi.loupan.com/xinfang/?q=吉首碧桂园
3.获取搜索数据
- 建筑类型:高层 低层 板楼
4.访问楼盘信息
5.提取楼盘数据
import re import pandas as pd import requests import collections from bs4 import BeautifulSoup import urllib.parse as urp import json import csv def read_name(filename): data = pd.read_csv(filename) data = data["name"] return data.to_numpy() # 获取接口 def get_district_info(district_name): session = requests.Session() url = "https://xiangxi.loupan.com/xinfang/?q=" + urp.quote(district_name) response = requests.get(url) if response.status_code == 200: # print("200") html = response.text text = BeautifulSoup(html, 'lxml') for link in text.find_all(name='a'): if (link.get_text() == district_name): # print(link.get_text()) url = link['href'] # print(url) break return url return None def main(district_name): tmp = [] url = get_district_info(district_name) id = url.split("/")[-1].split(".")[0] url = "https://xiangxi.loupan.com/info/" + str(id) + ".html" print(url) session = requests.Session() response = requests.get(url) if response.status_code == 200: html = response.text text = BeautifulSoup(html, 'html.parser') tmp.append(district_name) for li in text.find_all('li'): # print(li.get_text()) if "建筑类型:" in li.get_text(): jzlx = li.get_text().strip("建筑类型:") tmp.append(jzlx) elif "占地面积:" in li.get_text(): zdmj = li.get_text().strip("占地面积:") tmp.append(zdmj) elif "参考起价:" in li.get_text(): ckjg = li.get_text().strip("参考起价:") tmp.append(ckjg) elif "容积率:"

最低0.47元/天 解锁文章
2790

被折叠的 条评论
为什么被折叠?



