初入爬虫（python）

最新推荐文章于 2024-03-24 15:53:38 发布

置顶 weixin_45021446

最新推荐文章于 2024-03-24 15:53:38 发布

阅读量183

点赞数

本文链接：https://blog.csdn.net/weixin_45021446/article/details/103316317

版权

1.访问网页以及修改报头

	**1.1**导入包 import request,该包里的get()函数用于打开并访问网页，而headers用于伪装成浏览器访问网站，避免被发现。
	url='https://www.baidu.com'
	headers={'user-agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64)\AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'}
	respone_html = requests.get(url, headers=headers)
	print(respone_html.status_code)
	**1.2**导入import urllib.request 
	req=urllib.request.Request(url=url,headers=headers)
	data=urllib.request.urlopen(req)
	print(data.getcode())
	**1.3**导入import urllib.request 
	url='https://www.baidu.com'
	req=urllib.request.Request(url=url)
	req.add_header('user-agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64)\AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36')
	data=urllib.request.urlopen(req)
	print(data.getcode())

超时设置

在urlopen或者get方法里加入timeout=n参数:
data=urllib.request.urlopen(req,timeout=n)
respone_html = requests.get(url, headers=headers,timeout=n)

使用代理

from urllib import request
url=‘https://www.baidu.com’
使用ProxyHandler传入代理构建一个handler
headler=request.ProxyHandler({“http”:“223,241,78.43:8010”})
使用headler创建一个opener
opener=request.build_opener(handler)
使用opener去发送请求
respone=opener.open(url)
#--------------------------------------
from urllib import request
url=‘https://www.baidu.com’
使用ProxyHandler传入代理构建一个handler
headler=request.ProxyHandler({“http”:“223,241,78.43:8010”})
使用headler创建一个opener
opener=request.build_opener(handler,request.HTTPHandler)
创建全局默认对象
request.install_opener(opener)
访问网页
data=request.urlopen(url).read().decode(“utf-8”)

2.以特定的方式解码

	查看网页的编码格式，不同的网页用不同的编码，不按特定解码会出现乱码
	respone_html.encoding=“（编码格式）”或者respone_html.encoding=respone_html.apparent_encoding

3.得到网页源代码

得到源代码的方法有text和content两种方法，text是得到解码后的内容，而content是以二进制方式的内容，查看内容需要调用decode(“编码格式”)方法解码
html=respone_html.text或html=respone_html.content

4.从网页源代码中提取需要的内容的三种方式

1.使用BeautfulSoup提取网页
首先导入包 from bs4 import BeautifulSoup
BeautifulSoup4主要解析器，以及优缺点：
在这里插入图片描述
解析完可以使用find或findAll方法来提取需要的内容或URL
2.使用xpath提取网页

以下是提取某网页中的链接，主题，下一个链接

3.使用正则表达式提取网页
这里不概述

4.保存内容的方式

4.1保存在文本文件中
可用with open或open函数保存：
with open(“name.后缀名”，“mode”，encoding=“编码”) as fp:
fp.write()
用open保存可详见菜鸟教程https://www.runoob.com/python3/python3-file-methods.html

4.2保存在csv文件中
with open(path,‘w’,encoding=‘utf-8’,newline=’’) as fp:
writer = csv.writer(fp,dialect=‘excel’)
for data in datas:
writer.writerow(data)
4.3保存在数据库中
详见菜鸟教程
**https://www.runoob.com/python3/python-mysql-connector.html
**

Cookie

http是无状态的协议，即无法维持会话之前的状态。仅使用http协议，再继续浏览该网站的其他网页时，这种登陆状态会消失。所以要保存对应的会话信息，即cookie或通过session保存会话信息。

import urllib.request
import urllib.parse
import http.cookiejar

url='xxxxxxxxxxxxxxxxxxxxxx' #填入需要的网页
postdata=urllib.parse.urlencode({"username":"xxxx","password":"xxxx"}).encode("utf-8")
req=urllib.request.Request(url,postdata)
req.add_header({'user-agent':'Mozilla/5.0 (Windows NT 6.3; Win64; x64)\AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'})
#使用http.cookiejar.CookieJar()创建CookieJar对象
cjar=http.cookiejar.CookieJar()
#使用HTTPCookieProcessor创建cookie处理器，并起构建opener对象
opener=urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cjar))
#将opener安装为全局
urllib.request.install_opener(opener)
#访问网页并得到数据
data=opener.open(req).read()

处理异常

访问网页时，会出现URLError或HTTPError，而URLError是HTTPError的父类，所以可以使用URLError判断是什么异常，hasattr()函数可以判断是否有这些属性
import urllib.error
try:
访问网页
except urllib.error.URLError as e:
if hasattr(e,“code”):
print(e.code)
if hasattr(e,“reason”):
print(e.reason)