python网络爬虫

最新推荐文章于 2022-08-16 16:23:35 发布

酸奶可乐

最新推荐文章于 2022-08-16 16:23:35 发布

阅读量165

点赞数

分类专栏： python

本文链接：https://blog.csdn.net/weixin_44229819/article/details/104373537

版权

python 专栏收录该内容

13 篇文章 0 订阅

订阅专栏

基本爬取流程

进行网络数据获取时，得知道以下：

	我要从网络获取什么资源，或者对获取的资源有什么要求
	
	我是否已经正常获取了想要的资源
	
	我该如何获取资源
	
	我该使用什么方法来获取或者存取资源
	
	response对象的处理
	
进行网络数据解析时，得知道以下：

	网络资源获取后的资源类型是什么类型
	
	关于网络资源html文档的基本结构是什么
	
	网络标签树是什么
	
	网络标签是什么
	
	标签的基本组成是什么
	
	对获取的资源我该如何来遍历
	
	遍历的同时，经常或出现的异常是什么，我该如何去躲避

requests库

安装：pip install requests
若出现安装错误，请再尝试一次

How to get page?
	fist: Send a requests
		then: Accepted a response
			next: View the response status_code
				finally: Set response's encoding

库函数介绍

requests.get(url,params,*args)
	url 	:'www.baidu.com'
	params	:{'SunFeng':123} or 'str'

comment:Url most has 'www.',don't like this url :'baidu.com'
		The requests.get() back a  response-objeck，Our's need in attribute

response = requests.get('www.baidu.com')
#response is a object
'''
In requests's response object ,it has those attribute
	response.status_code
	response.text
	response.encoding
	response.apparent_encoding
	response.content
deal with All_Error
	response.raise_for_status()
	response has a deal with all error ,if the 'response.status_code' not is 200,'response.raise_for_status()' will raise a 'requests.HTTPError'
'''
示例：
import requests
response = requests.get("http://www.baidu.com")
try:
	response.raise_for_status()
	response.encoding = response.apparent_encoding
	print(response.text)
except:
	print("HTTP ERROR!!!!")

获取资源：

函数	功能	参数
requests.get(url,params,*kwargs)	向 url 地址链接获得网络资源执行相关参数要求	url:“http://www.baba.com”，params = 一个字典或字符串
requests.head(url)	获得 url 地址链接的头信息，用很少的网络带宽获取 url 的概要信息	url：同上

提交资源：

函数	功能	参数
requests.post(url,data=None,json=None,*kwargs)	向 url 地址链接请求获取资源后【字典数据存到form,字符串存到：data】，顺便提交新的信息	url,data=None,json=None,*kwargs

requests.put()	向 url 地址链接上传一个新的资源并覆盖原来的资源【用法和 post 一致，但是会覆盖原有已提交的资源】
requests.patch()	向url 地址链接处更新内容局部地址

删除资源：

函数	功能	参数
requests.delete()	向 url 地址链接发送 url 处资源的删除请求

所有参数

名称	功能
params	字典或者字节数据，作为参数增加到提交的 url 地址链接
data	字典，字节，文件对象【给 url 存储的数据】
json	一种文件格式【用于提交的资源】，用法和data一样
headers	相当于访问时所带的身份证【如：headers = {‘user-agent:’'chrome/10’}】
cookies	字典或者cookieJar类型
auth	支持HTTP认证
files	向服务器传输文件所使用的字段【字典类型files = {‘file’：open{""}}】
timeout	设置超时时间【单位：s】
proxies	爬取时的服务器代理，可用于隐藏爬虫的原ip地址【防止逆追踪】

'''
注意在下载资源时，可以用’response.content‘来获取对应的二进制文件
'''

BeautifulSoup库

'''
beautifulsoup库主要解析 .html and .xml 两种页面使获取的文档能够让我们直接处理。好比一个文件经过fo = open（）以后，对fo对象操作相当于对文件直接操作。
可以看作：	
	html文档 = 标签树 = beautifulsoup类
	作用：解析·遍历·维护【标签树】
对于获取页面的数据后，资源是一个我们读不懂的数据格式，此时需要来用beautifulsoup库来解析网页数据
'''
>>>response = requests.get(url)#获取页面源
>>>demo = response.text#获取返回页面的文本
>>>soup = BeautifulSoup(<page>，parser)#将获取的页面进行解析，将获得解析后的一个对象
>>>print（soup.prettify()）#获得解析后的资源

描述BeautifulSoup类每个标签的基本元素

名称	功能	用法格式
tag	标签	<> and </>
name	标签名字	<1tag>.name
attribute	标签属性【返回一个字典形式】	<1tag>.attrs
navigable string	标签的非属性字符串【可以跨越多个标签】	<1tag>.string
comment	注释

'''
获取某个标签的信息【demo是一个已经从requests.get()的页面】： 
使用：soup.标签来获取标签
	注：此种方法只能获取到第一个tag标签
'''
>>>from bs4 import BeautifulSoup
>>>soup = BeautifulSoup(demo,'html.preser')
>>>soup.tag#返回第一个tag标签

'''获取上一层标签
>>>soup.tag.parent
'''
>>>soup.a.parent.name#a标签的上一层标签为p
p
注：在获取标签的非属性字符串时，有可能返回的是一个注释内容

标签树的遍历方法
	标签树定义：.html文件就是一个有着树形结构的数据，由一个头部伸出很多枝头【类似二叉树】
	
	三种遍历方法：
		由根节点向叶子遍历【下行遍历】：
				下行遍历时，儿子节点包括'\n'和navigablestring类型,并非所有的儿子节点都是标签类型
		由叶子向根节点遍历【上行遍历】：
		平级节点之间的遍历【平行遍历】：
		
注意事项：用于平行遍历和下行遍历时，遍历每次获得有可能时navigableString类型

记忆方法：如果一次获得多个标签，那么这个返回的就是一个迭代类型。其余除了<>.contents返回列表类型外，其余都是返回节点类型

下行遍历

函数	功能
<>.contents	返回当前节点的所有儿子节点标签【列表类型】
<>.children	子节点的迭代类型，和<>.contents一样用来的遍历【所有儿子】节点
<>.descendants	子孙节点的迭代类型，用于遍历【所有子孙】节点

上行遍历

函数	功能
<>.parent	返回父亲的标签
<>.parents	返回所有先辈的迭代类型，用于遍历先辈节点

平行遍历

函数	功能
<>.next_sibling	返回当前同父节点下的下一个标签
<>.next_siblings	迭代类型，返回按照html文件顺序的所有同父节点下的后续节点
<>.previous_sibling	返回当前同父节点下的上一个节点
<>.previous_siblings	迭代类型，返回按照html文件顺序所有同父节点下的前序节点

BeautifulSoup库注意事项
	读取所有的html文件，解析完后的编码格式：	采用'utf-8'【能很好的支持中文等第三方语言】
	【python3是默认使用'utf-8'】
	若需要很好的查看html文档，使用print(soup.prettify())

数据标记：
	XML：
		html文档隶属于XML文档，采用<></>来表达所属关系，可扩展性强
	JSON：
		采用{‘key’:'value'}来表示数据相当于组合数据类型
	YAML：
		采用无符号的缩进来表达数据的所属关系。可读写行非常好，应用一般在系统的配置清单上
数据获取方法：
	遍历解析
	搜索查找
	融合查找

资源查找

<tag>.find_all()
		 |  find_all(self, name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs)
		 |name：名字，attrs：参数，recursive：是否查找所有子孙节点
		 名字可以是一个列表清单【'a','b'】
		 |      Look in the children of this PageElement and find all
		 |      PageElements that match the given criteria.
		 |      
		 |      All find_* methods take a common set of arguments. See the online
		 |      documentation for detailed explanations.
		 |      
		 |      :param name: A filter on tag name.
		 |      :param attrs: A dictionary of filters on attribute values.
		 |      :param recursive: If this is True, find_all() will perform a
		 |          recursive search of this PageElement's children. Otherwise,
		 |          only the direct children will be considered.
		 |      :param limit: Stop looking after finding this many results.
		 |      :kwargs: A dictionary of filters on attribute values.
		 |      :return: A ResultSet of PageElements.
		 |      :rtype: bs4.element.ResultSet
返回一个列表类型，用于储存搜索的结果对于的所属标签列表
<tag>.get()
		 |  get(self, key, default=None)
		 |      Returns the value of the 'key' attribute for the tag, or
		 |      the value given for 'default' if it doesn't have that
		 |      attribute.
返回一个从属性字典里面对应的值

#获取数据源
import requests
response = requests.get("http://www.zuihaodaxue.com/zuihaodaxuepaiming2019.html")
try:
    response.raise_for_status()
    response.encoding = response.apparent_encoding
except:
    print("Internet Error!!!")
demo = response.text
from bs4 import BeautifulSoup
soup = BeautifulSoup(demo,'html.parser')


#将想要获得的数据存入列表中
x=[["排名","大学名称","地址","总得分","生源质量"]]
for tag in soup.find_all('tr','alt'):
    #print(tag.prettify())
    element = []
    for args in tag.contents[:5]:
        element.append(args.string)
    x.append(element)

#输入要查询的大学排名
data= input("输入要查询的大学：")
for i in x[1:]:
    if i[1] == data:
        line = x[0]
        print("{:^20}{:<30}{: ^20}{: ^20}{: ^20}".format(line[0],line[1],line[2],line[3],line[4]))
        line = i
        print("{:^20}{:<30}{: ^20}{: ^20}{: ^20}".format(line[0],line[1],line[2],line[3],line[4]))

运行结果在这里插入图片描述

很明显的一个问题： 排列不整齐
具体怎么对齐请看我前面的帖子

酸奶可乐

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
python网络爬虫

基本爬取流程requests库BeautfulSoup库获取网页页面解析数据资源requests库安装：pip install requests若出现安装错误，请再尝试一次How to get page? fist: Send a requests then: Accepted a response next: View the response status_code ...
复制链接

扫一扫