python网络爬虫学习

最新推荐文章于 2024-08-27 09:41:07 发布

工具人

最新推荐文章于 2024-08-27 09:41:07 发布

阅读量352

点赞数

文章标签： python 爬虫

本文链接：https://blog.csdn.net/aston_baymax/article/details/112795154

版权

###2021/01/17###

###网络爬虫简介###

一爬取前调研

在爬取前首先要对目标站点的规模和结构进行了解，包括网站自身的robots.txt和Sitemap文件，外部工具如Google搜索和WHOIS。

1 检查robots.txt文件

参考链接：http://www.robotstxt.org

下面的代码为示例文件，来自http://example.python-scraping.com/robots.txt

# section 1
User-agent:BadCrawler
Disallow：/

禁止用户代理为BadCrawler的爬虫爬取该网站。

# section 2
User-agent:*
Crawl-delay:5
Disallow:/trap

无论那种用户代理，任意两次下载请求之间要有5s的抓取延迟。

/trap链接，用于封禁爬取了不允许访问的链接的恶意爬虫。

# section 3
Sitemap：http://example.python-scraping.com/sitemap.xml

网站提供的Sitemap文件（网站地图），帮助爬虫定位网站最新的内容而无需爬取每一个网页。

参考链接：http://www.sitemap.org/protocol.html

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
	<url>
	<loc>http://example.python-scraping.com/places/default/view/Afghanistan-1</loc>
	</url>
	<url>
	<loc>http://example.python-scraping.com/places/default/view/Aland-Islands-2</loc>
	</url>
	<url>
	<loc>http://example.python-scraping.com/places/default/view/Albania-3</loc>
	</url>
	...
</urlset>

网站地图提供所有网页的链接，该文件可能存在缺失、过期或不完整的问题。

2 估算网络的大小

对于大型的网站可能会有数百个网页的站点，使用串行下载的效率较低，可使用分布式下载来解决（第四章，后期在此处添加超链接）。

通过检查Google爬虫的结果估算网络的大小，可以通过Google搜索site关键词过滤域名结果，从而获取该信息，通过以下网站获得该接口其他高级搜索参数的用法。

参考网站：http://www.google.com/advanced_search

3 识别网络所用的技术

使用detectem模块（依赖于python3.5+的环境以及Docker）

Docker安装失败，以后换了linux系统再说

4 寻找网站的所有者

可以使用WHOIS协议查询域名的注册者

import whois
print(whois.whois('appspot.com'))

其中appspot.com为要查寻的域名

5 第一个网络爬虫

5.1抓取与爬取

抓取：通常针对特定的网站，并在这些站点上获取指定信息。

爬取：以通用的方式构建，其目标是一系列顶级域名的网站或整个网络。爬取网络，从不同的站点或页面获取小而通用的信息，跟踪链接到其他页面中。

网络爬虫：爬取指定的一系列网站，或是在多个站点甚至整个互联网中进行更广泛的爬取。

5.2下载网页

使用python的urllib模块下载url，默认使用Python-urllib/3.x作为用户代理。

功能：传入url参数，会下载网页并返回html值；若发生5XX的错误，表示服务器端存在问题，可以进行自动重试；修改默认的用户代理为‘wswp’（Web Scraping With Python的首字母缩写）

import urllib.request
from urllib.error import URLError, HTTPError, ContentTooShortError

NumRetry = 2    #重新尝试的次数，总次数为NumRetry+1
Useragernt = 'wawp' #修改默认的用户代理为‘wswp’（Web Scraping With Python）

def download(url, user_agent = Useragernt, num_retries = NumRetry):
	print('Downloading：', url, 'The number of attempts is', NumRetry - num_retries + 1)
	request = urllib.request.Request(url)
	request.add_header('User-agent', user_agent)
	try:
		html = urllib.request.urlopen(url).read()
	except (URLError, HTTPError, ContentTooShortError) as e:
		print('Downloading error:',e.reason)
		html = None
		if num_retries > 0:
			if hasattr(e, 'code') and 500 <= e.code < 600:
				# recursively retry 5xx HTTP errors
				return download(url, num_retries - 1)
	return html

if __name__=="__main__":
	# url = 'http://pic.netbian.com/'
	url = input("Please enter the URL:")
	download(url)

###2021/01/18###

5.3 批量爬取网站

5.3.1 网站地图爬虫

解析网站地图，运用正则表达式从标签中提取出url，”(.*)“，更新代码以处理编码转换。之后介绍CSS选择器（超链接）。

import re
import urllib.request
from urllib.error import URLError, HTTPError, ContentTooShortError

NumRetry = 2    #重新尝试的次数，总次数为NumRetry+1
Useragernt = 'wawp' #修改默认的用户代理为‘wswp’（Web Scraping With Python）

def download(url, user_agent = Useragernt, num_retries = NumRetry, charset = 'utf-8'):
	print('Downloading：', url, 'The number of attempts is', NumRetry - num_retries + 1)
	request = urllib.request.Request(url)
	request.add_header('User-agent', user_agent)
	try:
		resp = urllib.request.urlopen(request)
		cs = resp.headers.get_content_charset()
		if not cs:
			cs = charset
		html = resp.read().decode(cs)
	except (URLError, HTTPError, ContentTooShortError) as e:
		print('Downloading error:',e.reason)
		html = None
		if num_retries > 0:
			if hasattr(e, 'code') and 500 <= e.code < 600:
				# recursively retry 5xx HTTP errors
				return download(url, num_retries - 1)
	return html

def crawl_sitemap(url):
	# download the sitemap file
	sitemap = download(url)
	# extract the sitemap links
	links = re.findall('<loc>(.*)</loc>', sitemap)
	# dowmlod each link
	for link in links:
		html = download(link)

if __name__=="__main__":
	# url = 'http://example.python-scraping.com/sitemap.xml'
	url = input("Please enter the URL:")
	crawl_sitemap(url)

5.3.2 ID遍历爬虫

测试网站：http://example.python-scraping.com/view/1

可以利用网站结构的特点进行访问，例如若一组url路径中只在相同的某个位置的路径有区别（如结尾的国家/地区名和ID号），此时通常web服务器会忽略这个字符串只用ID来匹配数据库中的相关记录。考虑到这些记录可能被删除，因此存在ID不连续的问题而出错，因此可以设置一个“原谅值”，当连续出现错误次数超过这个值的时候才会停止遍历。

import itertools
from ws0117 import download

def crawl_site(url, max_errors = 5, num_errors=0):
	for page in itertools.count(1):
		pg_url = '{}{}'.format(url, page)
		html = download(pg_url)
		if html is None:
			num_errors += 1
			if num_errors == max_errors:
				break
		else:
			num_errors = 0

crawl_site('http://example.python-scraping.com/view/-')

5.3.3 链接爬虫

通过正则表达式减少链接爬虫下载的无用链接，在下载链接时由于浏览器知道用户当前浏览的网页，因此可以使用相对链接进行访问，如：/index/1，从而进行省略，而urllib没有上下文，需要将相对链接转化为绝对链接。可以使用urllib的parse模块来实现该功能。此外由于页面存在循环链接的情况，为了避免下载重复的网页，我们要对下载过的网页进行记录。

对于示例网站，我们想要的是列表索引页和页面其格式如下：

# 列表索引页
http：//example.python-scraping.com/index/1
http：//example.python-scraping.com/index/2
# 页面格式
http：//example.python-scraping.com/view/Afghanistan-1
http：//example.python-scraping.com/view/Aland-Island-2

因此可以用’/(index|view)/'来匹配页面

import re
from urllib.parse import urljoin
from ws0117 import download

def link_crawler(start_url, link_regex):
	crawl_queue = [start_url]
	# keep track which URL's have seen before
	seen = set(crawl_queue)
	while crawl_queue:
		url = crawl_queue.pop()
		html = download(url)
		if not html:
			continue
		for link in get_links(html):
			# check if link matches expected regex
			if re.match(link_regex, link):
				abs_link = urljoin(start_url, link)
				# check if already seen this link
				if abs_link not in seen:
					seen.add(abs_link)
					crawl_queue.append(abs_link)

def get_links(html):
	# return a list of links from html
	# 用一个正则表达式提取网页的链接
	webpage_regex = re.compile("""<a[^>]+href=["'](.*?)["']""", re.IGNORECASE)
	return webpage_regex

link_crawler('http://example.python-scraping.com', '/(index|view)/')

5.4 高级功能

5.4.1 解析robots.txt

使用urllib库中的robotparser模块进行robot.txt文件的解析。

from urllib import robotparser

rp = robotparser.RobotFileParser()
rp.set_url('http://example.python-scraping.com/robots.txt')
rp.read()
url = 'http://example.python-scraping.com'
user_agent = 'BadCrawler'
print(rp.can_fetch(user_agent, url))
user_agent = 'GoodCrawler'
print(rp.can_fetch(user_agent, url))

工具人

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
python网络爬虫学习

###2021/01/17######网络爬虫简介###一.爬取前调研在爬取前首先要对目标站点的规模和结构进行了解，包括网站自身的robots.txt和Sitemap文件，外部工具如Google搜索和WHOIS。1.检查robots.txt文件参考链接：http://www.robotstxt.org下面的代码为示例文件，来自http://example.python-scraping.com/robots.txt# section 1User-agent:BadCrawlerDisall
复制链接

扫一扫