python网络爬虫从入门到实践 第5章 (二)

python网络爬虫从入门到实践 第5章 (二)

https://beijing.anjuke.com/sale/

\xa0 是不间断空白符  
我们通常所用的空格是 \x20 ,是在标准ASCII可见字符 0x20~0x7e 范围内。
而 \xa0 属于 latin1 (ISO/IEC_8859-1)中的扩展字符集字符,代表空白符nbsp(non-breaking space)。
latin1 字符集向下兼容 ASCII ( 0x20~0x7e )。通常我们见到的字符多数是 latin1 的,比如在 MySQL 数据库中。

这里也有一张简陋的Latin1字符集对照表。

\u3000 是全角的空白符

代码:

import requests
from bs4 import BeautifulSoup

headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36'}
link = 'https://beijing.anjuke.com/sale/'
r = requests.get(link, headers = headers)

soup = BeautifulSoup(r.text, 'lxml')
house_list = soup.find_all('li', class_="list-item")

for house in house_list:
    name = house.find('div', class_ ='house-title').a.text.strip()
    price = house.find('span', class_='price-det').text.strip()
    price_area = house.find('span', class_='unit-price').text.strip()

    no_room = house.find('div', class_='details-item').span.text
    area = house.find('div', class_='details-item').contents[3].text
    floor = house.find('div', class_='details-item').contents[5].text
    year = house.find('div', class_='details-item').contents[7].text
    broker = house.find('span', class_='broker-name').text
    broker = broker[1:]
    address = house.find('span', class_='comm-address').text.strip()
    address = address.replace('\xa0\xa0\n                    ','  ')
    tag_list = house.find_all('span', class_='item-tags')
    tags = [i.text for i in tag_list]
    print (name, price, price_area, no_room, area, floor, year, broker, address, tags)
访问 多页
import requests
from bs4 import BeautifulSoup
import time

headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36'}
for i in range(1,11):
    link = 'https://beijing.anjuke.com/sale/p' + str(i)
    r = requests.get(link, headers = headers)
    print ('现在爬取的是第', i, '页')

    soup = BeautifulSoup(r.text, 'lxml')
    house_list = soup.find_all('li', class_="list-item")

    for house in house_list:
        name = house.find('div', class_ ='house-title').a.text.strip()
        price = house.find('span', class_='price-det').text.strip()
        price_area = house.find('span', class_='unit-price').text.strip()

        no_room = house.find('div', class_='details-item').span.text
        area = house.find('div', class_='details-item').contents[3].text
        floor = house.find('div', class_='details-item').contents[5].text
        year = house.find('div', class_='details-item').contents[7].text
        broker = house.find('span', class_='broker-name').text
        broker = broker[1:]
        address = house.find('span', class_='comm-address').text.strip()
        address = address.replace('\xa0\xa0\n                    ','  ')
        tag_list = house.find_all('span', class_='item-tags')
        tags = [i.text for i in tag_list]
        print (name, price, price_area, no_room, area, floor, year, broker, address, tags)
    time.sleep(10)

但是 这个网站 有 反 爬虫的 措施
需要 把 延时 时间 设置的长一点

自我实践:
# 首先提取出第一页每个房子的链接
import requests
from bs4 import BeautifulSoup
import pprint



headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36'}
link = 'https://beijing.anjuke.com/sale/'
r = requests.get(link, headers = headers)

soup = BeautifulSoup(r.text, 'lxml')
house_list = soup.find_all('li', class_="list-item")

output_list = []
for house in house_list:
    name = house.find('div', class_ ='house-title').a.text.strip()
    link = house.find('div', class_ ='house-title').a['href']
    output_list.append([name, link])


print (output_list)

pprint.pprint(output_list, width=1)

print("------------------------------------------------------\n")

# 接下来进入每个房源的页面,获取其中的数据
for each in output_list:
    title = each[0]
    link = each[1]

    r = requests.get(link, headers=headers)
    soup = BeautifulSoup(r.text, 'lxml')

    total_price = soup.find('div', class_='basic-info clearfix').contents[1].text.strip()
    room = soup.find('div', class_='basic-info clearfix').contents[3].text.strip()
    size = soup.find('div', class_='basic-info clearfix').contents[5].text.strip()

    house_info = soup.find_all('div', class_='houseInfo-content')
    neighbor = house_info[0].text.strip()
    price_per_squarem = house_info[2].text.strip()

    print([title, total_price, room, size, neighbor, price_per_squarem])


print("++++++++++++++++++++++++++++++++++++++\n")
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值