python网络爬虫从入门到实践 第5章 (二)
https://beijing.anjuke.com/sale/
\xa0 是不间断空白符
我们通常所用的空格是 \x20 ,是在标准ASCII可见字符 0x20~0x7e 范围内。
而 \xa0 属于 latin1 (ISO/IEC_8859-1)中的扩展字符集字符,代表空白符nbsp(non-breaking space)。
latin1 字符集向下兼容 ASCII ( 0x20~0x7e )。通常我们见到的字符多数是 latin1 的,比如在 MySQL 数据库中。
这里也有一张简陋的Latin1字符集对照表。
\u3000 是全角的空白符
代码:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36'}
link = 'https://beijing.anjuke.com/sale/'
r = requests.get(link, headers = headers)
soup = BeautifulSoup(r.text, 'lxml')
house_list = soup.find_all('li', class_="list-item")
for house in house_list:
name = house.find('div', class_ ='house-title').a.text.strip()
price = house.find('span', class_='price-det').text.strip()
price_area = house.find('span', class_='unit-price').text.strip()
no_room = house.find('div', class_='details-item').span.text
area = house.find('div', class_='details-item').contents[3].text
floor = house.find('div', class_='details-item').contents[5].text
year = house.find('div', class_='details-item').contents[7].text
broker = house.find('span', class_='broker-name').text
broker = broker[1:]
address = house.find('span', class_='comm-address').text.strip()
address = address.replace('\xa0\xa0\n ',' ')
tag_list = house.find_all('span', class_='item-tags')
tags = [i.text for i in tag_list]
print (name, price, price_area, no_room, area, floor, year, broker, address, tags)
访问 多页
import requests
from bs4 import BeautifulSoup
import time
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36'}
for i in range(1,11):
link = 'https://beijing.anjuke.com/sale/p' + str(i)
r = requests.get(link, headers = headers)
print ('现在爬取的是第', i, '页')
soup = BeautifulSoup(r.text, 'lxml')
house_list = soup.find_all('li', class_="list-item")
for house in house_list:
name = house.find('div', class_ ='house-title').a.text.strip()
price = house.find('span', class_='price-det').text.strip()
price_area = house.find('span', class_='unit-price').text.strip()
no_room = house.find('div', class_='details-item').span.text
area = house.find('div', class_='details-item').contents[3].text
floor = house.find('div', class_='details-item').contents[5].text
year = house.find('div', class_='details-item').contents[7].text
broker = house.find('span', class_='broker-name').text
broker = broker[1:]
address = house.find('span', class_='comm-address').text.strip()
address = address.replace('\xa0\xa0\n ',' ')
tag_list = house.find_all('span', class_='item-tags')
tags = [i.text for i in tag_list]
print (name, price, price_area, no_room, area, floor, year, broker, address, tags)
time.sleep(10)
但是 这个网站 有 反 爬虫的 措施
需要 把 延时 时间 设置的长一点
自我实践:
# 首先提取出第一页每个房子的链接
import requests
from bs4 import BeautifulSoup
import pprint
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36'}
link = 'https://beijing.anjuke.com/sale/'
r = requests.get(link, headers = headers)
soup = BeautifulSoup(r.text, 'lxml')
house_list = soup.find_all('li', class_="list-item")
output_list = []
for house in house_list:
name = house.find('div', class_ ='house-title').a.text.strip()
link = house.find('div', class_ ='house-title').a['href']
output_list.append([name, link])
print (output_list)
pprint.pprint(output_list, width=1)
print("------------------------------------------------------\n")
# 接下来进入每个房源的页面,获取其中的数据
for each in output_list:
title = each[0]
link = each[1]
r = requests.get(link, headers=headers)
soup = BeautifulSoup(r.text, 'lxml')
total_price = soup.find('div', class_='basic-info clearfix').contents[1].text.strip()
room = soup.find('div', class_='basic-info clearfix').contents[3].text.strip()
size = soup.find('div', class_='basic-info clearfix').contents[5].text.strip()
house_info = soup.find_all('div', class_='houseInfo-content')
neighbor = house_info[0].text.strip()
price_per_squarem = house_info[2].text.strip()
print([title, total_price, room, size, neighbor, price_per_squarem])
print("++++++++++++++++++++++++++++++++++++++\n")