链家小区网址:https://m.lianjia.com/bj/xiaoqu/
Github:https://github.com/why19970628/Python_Crawler/tree/master/LianJia
目标:统计北京每个区的小区
1.爬取每个区域的链接:
2. 爬取每个区域各个小区的链接:
3.爬取进入详情页
4. 爬取工作
爬取链家数据还是比较慢的,大约一秒一个,我们可以尝试使用多线程和进程的方式来提高爬取效率。
- 多线程
5.保存文件
housedetail=[]
def run():
url="https://m.lianjia.com/bj/xiaoqu/"
headers = {'Referer': url,
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
req = request.Request(url, headers=headers)
html = request.urlopen(req).read().decode('utf-8')
html = etree.HTML(html)
links = html.xpath('//ul[@class="level2 active"]/li/a/@href')[1:-1]
area = html.xpath('//ul[@class="level2 active"]/li/a/text()')[1:-1]
for area,link in zip(area,links):
for n in range(1,3):
start1 = time.time()
print('线程1··正在爬取'+area+'~~第'+str(n)+'页·····')
url = "https://m.lianjia.com/bj/xiaoqu/pg" + str(n) + "/"
headers = {'Referer': url,
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
req = request.Request(url, headers=headers)
html = request.urlopen(req).read().decode('utf-8')
html = etree.HTML(html)
links = html.xpath('//li[@class="pictext"]/a/@href')
for i in links:
result = {}
result['area']=area
req = request.Request(i, headers=headers)
html = request.urlopen(req).read().decode('utf-8')
html = etree.HTML(html)
result['name'] = html.xpath('//div[@class="xiaoqu_head_title lazyload_ulog"]/h1/text()')[0]
# print(result['name'])
result['address'] = \
html.xpath('//div[@class="xiaoqu_head_title lazyload_ulog"]/p[@class="xiaoqu_basic"]/span/text()')[0]
# print(result['address'])
result['price'] = html.xpath('//div[@class="xiaoqu_price"]/p/span/text()')[0]
# print(result['price'])
a = html.xpath('//p[@class="text_cut"]/span[@class="sub_title"]/text()')[0]
b = html.xpath('//p[@class="text_cut"]/em/text()')[0]
result['jianzhu'] = str(a) + str(b)
result['num'] = html.xpath('//div[@class="worth_card"]/div[@class="worth_guide"]/ul/li/text()')[3]
#print(result['num'])
result['link'] = i
housedetail.append(result)
time.sleep(0.5)
print('爬取' + area + '第' + str(n) + '页完成,用时:', time.time() - start1)
df = pandas.DataFrame(housedetail)
df.to_csv('housedata5.csv', index=False)
线程参考文章:https://blog.csdn.net/yexudengzhidao/article/details/86750810