selenium_Selenium

selenium

Many websites and online services provide APIs that allow developers to access data from their platform; however, these APIs often only provide a limited amount of information and many websites or services do not offer APIs. In case you want to extract information from webpages without APIs that provide you with the necessary information you can make use of web scraping. This blogpost runs you through the basics of using Selenium and Beautifulsoup4 to webscrape data using Python and focusses on the process that can be used to effectively scrape data from the web. Basic knowledge of python and html is assumed.

许多网站和在线服务都提供API ,这些API允许开发人员从其平台访问数据。 但是,这些API通常仅提供有限的信息,许多网站或服务不提供API。 如果您想从不带API的网页中提取信息,而这些API不能为您提供必要的信息,则可以使用网络抓取 。 这篇博文向您介绍了使用SeleniumBeautifulsoup4使用Python对数据进行网络抓取的基础知识,并重点介绍了可用于有效地从网络抓取数据的过程。 假定具备python和html的基本知识。

关于爱彼迎 (About Airbnb)

Airbnb allows people to rent out their properties on the platform of Airbnb. Travellers can then book these properties for shorter or longer periods of time. The company was founded in August 2008 in San Francisco, California and currently has an annual revenue stream of over 2.5 Billion US Dollars [1] . In the US alone the platform has 660,000 listings [2]. Every individual listing contains a lot of information like the facilities offered, the location, information about the host and reviews.

Airbnb允许人们在Airbnb平台上出租他们的财产。 然后,旅客可以预订这些属性,时间较短或较长。 该公司成立于2008年8月在加利福尼亚州的旧金山,目前年收入超过25亿美元[1] 。 仅在美国,该平台就有660,000个列表[2] 。 每个单独的列表都包含很多信息,例如所提供的设施,位置,有关房东的信息和评论。

Image for post
Image by author) 作者提供的图片)

设置环境 (Setting up your environment)

In order to use the libraries beautifulsoup4 and selenium to extract information from a webpage you will have to install these libraries on your machine. Once you have installed them you can load them into your python environment by these commands:

为了使用beautifulsoup4和selenium库从网页中提取信息,您将必须在计算机上安装这些库。 安装它们后,可以通过以下命令将它们加载到python环境中:

from bs4 import BeautifulSoup 
from selenium import webdriver
# The following packages will also be used in this tutorial
import pandas as pd # All database operations
import numpy as np  # Numerical operations
import time         # Tracking time
import requests     # HTTP requests
import re           # String manipulation
from sklearn.feature_extraction.text import CountVectorizer # BagOfWords (cleaning)
from joblib import Parallel, delayed # Parallellization of tasks

Besides installing and loading these packages you also have to install this webdriver for your preferred browser and put the .exe or .jar file in the same map as you have your python script. I used the chrome driver when writing the code for this project.

除了安装和加载这些软件包外,您还必须为首选的浏览器安装此网络驱动程序 ,并将.exe或.jar文件放在与python脚本相同的映射中。 在为该项目编写代码时,我使用了chrome驱动程序。

入门 (Getting started)

Once you have your environment setup it is time to actually start scraping some data. A great first step is to try and load the static html of a single page into python. This can be done with only the BeautifulSoup library. Generally, you only need to use selenium when the content you want to scrape is added via javascript on the website. This will become clear later!

设置好环境后,就该开始实际抓取一些数据了。 伟大的第一步是尝试将单个页面的静态html加载到python中。 只能使用BeautifulSoup库完成此操作。 通常,仅当通过网站上的javascript添加要抓取的内容时,才需要使用Selenium。 稍后将变得清楚!

For now, I will try to scrape the static html content from this webpage. If this webpage is unavailable to you at the moment you are reading this blog post, feel free to use any page from airbnb that contains listings.

现在,我将尝试从该网页中抓取静态html内容。 如果您在阅读此博客文章时无法访问此网页,请随时使用airbnb上包含列表的任何页面。

Image for post
Image by author) 作者提供的图片)

The following chunk of code allows you to take the content of the webpage and puts it in a soup object. The soup object is part of the beautifulsoup library.

以下代码块使您可以获取网页的内容并将其放入汤对象中。 汤对象是beautifulsoup库的一部分。

def getPage(url):
	''' returns a soup object that contains all the information 
	of a certain webpage'''
	result = requests.get(url)
	content = result.content
	return BeautifulSoup(content, features = "lxml")


url_page = "YOUR URL"
page = getPage(url_page)

The biggest advantage of using the beautifulsoup library is that it comes with a lot of prebuilt functions that can be used to more effectively extract data from the webpage. An overview of all these functions can be found in the documentation. Here I focus more on a process that can be used to extract information from a webpage instead of a discussion of the functions that are available.

使用beautifulsoup库的最大优点是它附带了许多预建函数,可用于更有效地从网页提取数据。 在文档中可以找到所有这些功能的概述。 在这里,我更多地关注可用于从网页提取信息的过程,而不是讨论可用功能。

The first step in trying to extract information off a webpage is always to check how the website is constructed. A brief look at the website shows that the information of the different listings is shown underneath each other. Every listing contains both an image and also an area in which some standard information is given. A great first step is trying to extract the information for every listing. Using the inspection tool from Chrome (ctrl-shift-i) it can be seen that these two elements are part of an object with class “_8ssblpx”. Using this information, we can use built-in beautifulsoup functions to extract this information from the webpage.

尝试从网页提取信息的第一步始终是检查网站的构建方式。 简要浏览该网站可发现,不同列表的信息显示在彼此下方。 每个清单都包含图像和提供一些标准信息的区域。 伟大的第一步是尝试为每个清单提取信息。 使用Chrome的检查工具(ctrl-shift-i),可以看到这两个元素是“ _8ssblpx”类对象的一部分。 使用此信息,我们可以使用内置的beautifulsoup函数从网页中提取此信息。

Image for post
Image by author) 作者提供的图片)
def getRoomClasses(soupPage):
	''' This function returns all the listings that can 
	be found on the page in a list.'''
	rooms = soupPage.findAll("div", {"class": "_8ssblpx"})
	result = []
	for room in rooms:
		result.append(room)
	return result

One major advantage of extracting the object that contains all data we want to scrape is that we only need to request the webpage one time. This is a lot faster compared to requesting the webpage multiple times. Saving this object in the final dataset also allows the creation of additional features without the need of scraping the website again. Now that we have the object that contains all the information about a listing it is time to start focussing on what information we really want to extract. Since in this case all information that is shown on the webpage is relevant (except for the total price which can be calculated based on the price per night) we will be extracting all this information for every individual listing. Besides the visible information, I will also be scraping the link towards the detailed page of the listing (see figure 5) because it will be useful later. To extract this information I make use of the find function from beautifulsoup and use exactly the same approach as earlier:

提取包含我们要抓取的所有数据的对象的一个​​主要优点是,我们只需要一次请求该网页。 与多次请求网页相比,这要快得多。 在最终数据集中保存该对象还可以创建其他功能,而无需再次抓取网站。 现在,我们有了包含有关列表的所有信息的对象,是时候开始关注我们真正想要提取的信息了。 由于在这种情况下,网页上显示的所有信息都是相关的(总价(可以根据每晚价格计算)除外),我们将为每个单独的清单提取所有这些信息。 除了可见的信息之外,我还将抓取指向清单详细页面的链接(请参见图5),因为以后将很有用。 为了提取此信息,我利用了beautifulsoup中的find函数,并使用了与之前完全相同的方法:

Image for post
Image by author) 作者提供的图片)
def getListingLink(listing):
	''' This function returns the link of the listing'''
	return "http://airbnb.com" + listing.find("a")["href"]


def getListingTitle(listing):
	''' This function returns the title of the listing'''
	return listing.find("meta")["content"]


def getTopRow(listing):
	''' Returns the top row of listing information'''
	return listing.find("div", {"class": "_167qordg"}).text


def getRoomInfo(listing):
	''' Returns the guest information'''
	return listing.find("div", {"class":"_kqh46o"}).text


...

Using these techniques, the following dataset can be generated:

使用这些技术,可以生成以下数据集:

Image for post
Image by author) 作者提供的图片)

This is technically speaking the end of the webscraping part (at least for this page). All the information that I wanted to extract from the webpage (and even more) can be found in the dataset; however, in order to be able to use this dataset to perform analyses further data cleaning is required. The result of the data cleaning will be shown at the end of this blog post.

从技术上讲,这是爬网部分的结尾(至少对于此页面而言)。 我想从网页中提取的所有信息(甚至更多)都可以在数据集中找到。 但是,为了能够使用此数据集执行分析,需要进一步清理数据。 数据清理的结果将显示在此博客文章的末尾。

搜寻给定城市的所有列表 (Scraping all listings for a given city)

The next step is scraping all the listing for a given city instead of one single page. At the bottom of the airbnbpage you can click a button to move on to the next page. Every city has fifteen different pages that are displayed. Our goal is to automate the task of going to the next page. We start by identifying the arrow object (see picture). Airbnb has encoded this button as the class “_i66xk8d” of type “li”. Knowing that links can be identified by the tag ‘a’, we can identify the link that takes us to the next page by the following code:

下一步是抓取给定城市的所有列表,而不是一页。 在airbnbpage的底部,您可以单击一个按钮以转到下一页。 每个城市有十五个不同的页面显示。 我们的目标是使转到下一页的任务自动化。 我们首先确定箭头对象(参见图片)。 Airbnb已将此按钮编码为“ li”类型的“ _i66xk8d”类。 知道链接可以用标签“ a”标识,我们可以通过以下代码来标识将我们带到下一页的链接:

def findNextPage(soupPage):
	''' Finds the next page with listings if it exists '''
	try:
		nextpage = "https://airbnb.com" + soupPage.find("li", {"class": "_i66xk8d"}).find("a")["href"]
	except: # When he can't find the button, I assume he reached the end
		nextpage = "no next page"
	return nextpage
Image for post
Image by author) 作者提供的图片)

Once this has been done, the next step of scraping all the pages is fairly simple since we can just use the link towards the next page to scrape the content of the next page and do this until there are no more next pages found:

完成此操作后,下一步便是删除所有页面,这非常简单,因为我们可以使用指向下一页的链接来抓取下一页的内容,并执行此操作,直到找不到更多下一页:

def getPages(url):
	''' This function returns all the links to the pages containing 
	listings for one particular city '''
	result = []
	while url != "no next page": 
		page = getPage(url)
		result = result + [page]
		url = findNextPage(page)
	return result


def extractPages(url):
	''' This function outputs a dataframe that contains all information of a particular
	city. It thus contains information of multiple listings coming from multiple pages.'''
	pages = getPages(url)
	# Do for the first element to initialize the dataframe
	df = extractInformation(pages[0])
	# Loop over all other elements of the dataframe
	for pagenumber in range(1, len(pages)):
		df = df.append(extractInformation(pages[pagenumber]))
	return df

Now it is just a matter of cleaning the dataset and we have scraped all the listings available on airbnb for a given city! If you want to scrape multiple cities you can define a list containing URLs and loop over this list.

现在只需要清理数据集,我们就可以删除airbnb上给定城市的所有列表! 如果要抓取多个城市,可以定义一个包含URL的列表,然后在该列表上循环。

获取清单的详细信息 (Getting detailed information of the listings)

Like mentioned earlier the link towards a detailed page of every listing was also saved during the scraping of the data. This is necessary if we want to scrape more detailed information like by example the full description of the listings or reviews. Unfortunately our previous method of scraping a static website cannot be used because Airbnb loads the content of these detailed pages onto the webpage using javascript. Javascript takes a couple of seconds to load. Our previous method does not wait for the page to load which means essential information is not saved.

如前所述,在抓取数据期间,还保存了指向每个列表的详细页面的链接。 如果我们要抓取更详细的信息(例如,清单或评论的完整说明),则这是必要的。 不幸的是,由于Airbnb使用javascript将这些详细页面的内容加载到了网页上,因此无法使用我们以前的刮除静态网站的方法。 Javascript加载需要花费几秒钟的时间。 我们以前的方法不等待页面加载,这意味着未保存重要信息。

Image for post
Image by author) 作者提供的图片)

In order to solve this problem we make use of Selenium. Selenium uses a webdriver to also load Javascript elements of the website. Afterwards we can use the previous methods to scrape information from the webpage. The following chunk of code can be used to set up the webdriver:

为了解决这个问题,我们利用Selenium 。 Selenium使用网络驱动程序还可以加载网站的Javascript元素。 之后,我们可以使用以前的方法从网页中抓取信息。 以下代码段可用于设置网络驱动程序:

def setupDriver(url, waiting_time = 2.5):
	''' Initializes the driver of selenium'''
	driver = webdriver.Chrome()
	driver.get(url)
	time.sleep(waiting_time) 
	return driver

Note that I also defined a waiting time. This waiting time is the time required to load all content. Generally speaking websites try to make their pages load in less than 2 seconds because of SEO reasons.

请注意,我还定义了等待时间。 此等待时间是加载所有内容所需的时间。 一般来说,由于SEO原因,网站会尝试在不到2秒的时间内加载页面。

One additional challenge for the airbnb page is the presence of read more buttons. These buttons need to be clicked, otherwise not all content is displayed. Luckily the selenium driver also allows finding an element and clicking on it. Extracting the page is then done as follows:

airbnb页面的另一个挑战是阅读更多按钮的存在。 这些按钮需要单击,否则不会显示所有内容。 幸运的是,Selenium驱动程序还允许查找元素并单击它。 然后按以下步骤提取页面:

def getJSpage(url):
	''' Extracts the html of the webpage including the JS elements,
	output should be used as the input for all functions extracting specific information
	from the detailed pages of the listings '''
	driver = setupDriver(url)
	read_more_buttons = driver.find_elements_by_class_name("_1d079j1e")
	try: # not all pages have buttons
		for i in range(2, len(read_more_buttons)):
			read_more_buttons[i].click()
	except:
		pass
	html = driver.page_source
	driver.close()
	return BeautifulSoup(html, features="lxml")

Now, everything is the same as it was! We loop over all the urls we want to scrape (which are saved in our previous dataframe), then we save the content of the page, extract the information and clean the data. Unfortunately however, this can take a very long time to run if you want to scrape lots of data. In the script linked below you can find more details about the information that was scraped using the methods described above and how it was cleaned. The goal was to introduce the basic concepts behind beautifulsoup and selenium. This goal has been achieved. In total 114 features were extracted after cleaning for a total of 5204 observations. The scraping and cleaning were not perfect, but that’s okay. Remember that we started with absolutely no data and all of this data is automatically extracted. Airbnb is a large company and although their webpages generally follow the same patterns, they often change the structure of their pages. It actually happened during the project that the structure of Airbnb’s webpages changed which means the code breaks! Besides this, Airbnb also does A/B-testing which means that you might suddenly see a unique different page. The fact we managed to create a high quality structured dataset from this is already great.

现在,一切都一样! 我们遍历所有要抓取的网址(保存在我们之前的数据框中),然后保存页面内容,提取信息并清理数据。 但是,不幸的是,如果要抓取大量数据,这可能需要很长时间才能运行。 在下面链接的脚本中,您可以找到有关使用上述方法抓取的信息以及如何清除信息的更多详细信息。 目的是介绍Beautifulsoup和Selenium背后的基本概念。 这个目标已经实现。 清洗后总共提取了114个特征,共进行5204次观察。 刮擦和清洁并不完美,但是没关系。 请记住,我们从一开始就没有数据,所有这些数据都是自动提取的 。 Airbnb是一家大公司,尽管他们的网页通常遵循相同的模式,但他们经常更改其页面结构。 实际上,在项目进行过程中,Airbnb的网页结构发生了变化,这意味着代码被破坏了! 除此之外,Airbnb还进行A / B测试 ,这意味着您可能会突然看到一个独特的不同页面。 我们设法从中创建高质量的结构化数据集这一事实已经非常棒了。

Image for post
Image by author) 作者提供的图片)

The entire script and a sample of the data can be found on Github.

整个脚本和数据示例可以在Github上找到。

翻译自: https://medium.com/analytics-vidhya/scraping-airbnb-fe1e895bd925

selenium

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值