链家房价信息爬虫

最新推荐文章于 2023-12-29 10:31:10 发布

Ray_Ding

最新推荐文章于 2023-12-29 10:31:10 发布

阅读量1.5k

点赞数 1

分类专栏： python实例爬虫文章标签： python实例爬虫

本文链接：https://blog.csdn.net/weixin_38983802/article/details/81396040

版权

python实例同时被 2 个专栏收录

1 篇文章 0 订阅

订阅专栏

爬虫

1 篇文章 0 订阅

订阅专栏

朋友准备买房子，我帮着爬取了一下链家的房子信息。python版本为3.6，使用openpyxl写入excel。在clears = lj.find_all(‘li’, attrs={‘class’: ‘clear’})中，’li’和’class’: ‘clear’时该房子连接的信息（鼠标放在

上会选中该房子的全部信息）。

爬虫

# 链家网二手房信息爬取
import re
import time
import requests
import pandas as pd
from bs4 import BeautifulSoup
from openpyxl import Workbook

url = 'https://bj.lianjia.com/ershoufang/'
page = ('pg')
# 设置请求头部信息
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
    'Accept': 'text/html;q=0.9,*/*;q=0.8',
    'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
    'Accept-Encoding': 'gzip',
    'Connection': 'close',
    'Referer': 'http://www.baidu.com/link?url=_andhfsjjjKRgEWkj7i9cFmYYGsisrnm2A-TN3XZDQXxvGsM9k9ZZSnikW2Yds4s&amp;wd=&amp;eqid=c3435a7d00146bd600000003582bfd1f'
    }

# 翻页
for i in range(1, 101):
    if i == 1:
        i = str(i)
        a = (url + page + i + '/')
        r = requests.get(url=a, headers=headers)
        html = r.content
    else:
        i = str(i)
        a = (url + page + i + '/')
        r = requests.get(url=a, headers=headers)
        html2 = r.content
        html = html + html2
    # 每次间隔0.5秒
    time.sleep(0.5)

# 解析抓取的页面内容
lj = BeautifulSoup(html, 'html.parser')

clears = lj.find_all('li', attrs={'class': 'clear'})
houseInfo = []
guanzhuInfo = []
daikanInfo = []
timeInfo = []
subwayInfo = []
positionInfo = []
totalpriceInfo = []

#将信息加入对应的列表中
for clear in clears:
    houseInfo.append(clear.find('div', 'houseInfo').get_text())

    guanzhuInfo.append(clear.find(text=re.compile('人关注')))  
    daikanInfo.append(clear.find(text=re.compile('次带看')))

    timeInfo.append(clear.find('div', 'timeInfo').get_text())

    temp = clear.find('span', 'subway')
    if temp is None:
        subwayInfo.append('')
    else:
        subwayInfo.append(temp.get_text())

    positionInfo.append(clear.find('div', 'positionInfo').get_text())
    totalpriceInfo.append(clear.find('div', 'totalPrice').get_text())


pd.set_option('display.max_colwidth',5000)

data ={
    'houseInfo': houseInfo,
    'guanzhuInfo': guanzhuInfo, 'daikanInfo': daikanInfo,
    'timeInfo': timeInfo, 'subwayInfo': subwayInfo, 'positionInfo': positionInfo, 'totalpriceInfo': totalpriceInfo
}
#创建excel
wb = Workbook()
sheet = wb.active
#创建一个新的sheet
sheet.title = "New Sheet"
#将列表传入excel对应的行中
sheet.append(houseInfo)
sheet.append(guanzhuInfo)
sheet.append(daikanInfo)
sheet.append(timeInfo)
sheet.append(subwayInfo)
sheet.append(positionInfo)
sheet.append(totalpriceInfo)

#保存excel
wb.save('链家房子信息.xlsx')

print('finish!')

Ray_Ding

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
链家房价信息爬虫

欢迎使用Markdown编辑器写博客朋友准备买房子，我帮着爬取了一下链家的房子信息。python版本为3.6，使用openpyxl写入excel。在clears = lj.find_all(‘li’, attrs={‘class’: ‘clear’})中，’li’和’class’: ‘clear’时该房子连接的信息（鼠标放在上会选中该房子的全部信息）。爬虫# 链家网二手房信息爬取im...
复制链接

扫一扫

专栏目录