一、初见网络爬虫

最新推荐文章于 2021-02-23 09:27:15 发布

Witness_236

最新推荐文章于 2021-02-23 09:27:15 发布

阅读量170

点赞数

分类专栏： Python网络数据采集（学习笔记）

本文链接：https://blog.csdn.net/qq_26079279/article/details/79572248

版权

Python网络数据采集（学习笔记）专栏收录该内容

2 篇文章 0 订阅

订阅专栏

如何不通过浏览器的帮助来格式化和处理数据

本章任务: 首先向网络服务器发送GET请求以获取具体网页，再从网页读取HTML内容，最后做一些简单的信息提取，将我们要找的内容分离出来。

一、网络连接

1、互联网实现过程（待补充）

1.2、网络浏览器：创建信息的数据包，发送他们，然后把获取的数据解释成漂亮的图像、声音、文字、视频。

1.3、Python是如何实现的：

from urllib.request import urlopen
html = urlopen("https://movie.douban.com/celebrity/1044973/")
print(html.read())

urllib : Python的标准库，，包含了从网络请求数据，处理cookie，甚至改变请求头和用户代理这些元数据的函数。

urlopen：打开并读取一个从网络获取的远程对象。

二、BeautifulSoup

1、功能：通过定位HTML标签来格式化和组织复杂的网络信息，用简单易用的Python对象为我们展现XML结构信息。

2、安装：

Mac: sudo easy_install pip        //Mac

pip install beautifulsoup4

pip install beautifulsoup4     //  windows: cmd进入 pip.exe 所在文件夹

sudo apt-get install python-bs4  // Linux

3、运行

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://movie.douban.com/celebrity/1044973/")
bsObj = BeautifulSoup(html.read(), "lxml")
# print(bsObj.h1)
print(bsObj)

4、可靠的网络连接

html = urlopen("https://movie.douban.com/celebrity/1044973/")

可能出现的异常：

（1）网页在服务器上不存在（获取页面的时候出错）

（2）服务器不存在

第一种，返回HTTP error，“404 page not found”“500 Internet server error”

处理异常：

try:
	html = urlopen("https://movie.douban.com/celebrity/1044973/")
except HTTPError as e:
	print(e)
	# 返回空值，中断程序，或者执行另一个方案
else:
	# 程序继续（若已在上面异常捕获中返回或中断）
	# 则不需要使用 else ，这段就不会执行。

第二种，链接打不开或打错了，URLopen会返回一个None

if html is None:
	print("URL is not found")
else:
	# 程序继续

AttributeError：

若要调用一个不存在的标签，就会出现AttributeError。

如

print(bsObj.nonExistingTag.someTag)    # 报错

为避免，需检查：

try:
	badContent = bsObj.nonExistingTag.anotherTag
except AttributeError as e:
	print("Tag was not found")
else:
	if badContent == None:
		print("Tag was not found")
	else:
		print("badContent")

重新组织代码：（返回网页标题）

from urllib.request import urlopen
from bs4 import BeautifulSoup
from urllib import HTTPError,URLError
def getTittle(url):
	try:
		html = urlopen(url)
	except (HTTPError, URLError) as e:
		return None
	try:
		bsObj = BeautifulSoup(html.read())
		tittle = bsObj.body.h1
	except AttributeError as e:
		return None
	return tittle
tittle = getTittle("https://movie.douban.com/celebrity/1044973/")
if tittle == None:
	print("tittle could not be found")
else:
	print(tittle)

Witness_236

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
一、初见网络爬虫

如何不通过浏览器的帮助来格式化和处理数据本章任务: 首先向网络服务器发送GET请求以获取具体网页，再从网页读取HTML内容，最后做一些简单的信息提取，将我们要找的内容分离出来。一、网络连接1、互联网实现过程（待补充）1.2、网络浏览器：创建信息的数据包，发送他们，然后把获取的数据解释成漂亮的图像、声音、文字、视频。1.3、Python是如何实现的：from urllib.request imp...
复制链接

扫一扫

专栏目录