爬虫随笔

最新推荐文章于 2023-02-07 17:04:23 发布

被子怪

最新推荐文章于 2023-02-07 17:04:23 发布

阅读量400

点赞数

分类专栏：爬虫文章标签：爬虫入门新手随笔

本文链接：https://blog.csdn.net/weixin_43316934/article/details/86487227

版权

爬虫专栏收录该内容

0 篇文章 0 订阅

订阅专栏

爬虫随笔

爬虫的小知识

1、urllib.request.urlopen(“http://baidu.com”): 打开一个网站
2、pip 安装时要在pip的文件下，不是在python中
3、selenium用于网站的驱动和渲染，比如：from selenium import webdriver
4、driver = webdriver.Chrome() #打开一个chrome浏览器
5、driver = webdriver.Chrome()
driver.get(“http://baidu.com”):打开百度
page_source:打印源代码
6、phantomjs:跟chromedriver类似，不会出现浏览器。执行的是js代码，相当于网络控制台，完全在命令行里面操作
7、LXML：提供了xpath网页解析方式
8、pipy.python.org：python的api
9、beautifulsoup:网页解析库,from bs4 import BeautifulSoup
10、response = request.get(“http://baidu.com”)
print(response.text) 打印响应体
print(response.status_code) 打印状态码，200是正常
print(response.header) 打印响应头
print(response.content) 打印二进制数据

爬虫原理

1、爬虫就是模拟浏览器，发送请求，从网页中爬取数据
2、基本流程：
发送请求–>获取响应内容–>解析内容–>保存数据（结构化的存储）
3、network,服务器个浏览器的交互过程，请求和响应
request ：
请求方式：GET：请求的信息全都包含在参数中，可以直接通过url直接访问
POST：比如登录时，比GET多了一个form date，用于存放参数，只能通过表单进入，不能通过url访问
请求url ：统一资源定位符，比如一个网页文档，一张图片，一个视频都可以用一个url唯一来确定
请求头（request headers）：重要的配置信息,User-Agent
请求体：比如POST的form date,请求的额外信息
response：
响应状态：是一个状态码，比如，300以上的状态码用于跳转，404用于报错
响应头：response headers ,以键值对的形式存在
响应体：包含响应的源代码或二进制数据

urllib 库

1、是python内置的HTTP请求库，最基本的请求库
2、包含以下几个模块：request（请求模块），error（异常处理模块），parse（url解析模块），robots.txt（判断网站是否可以爬取）

urlopen:

urllib的请求

//urlopen的函数原型
urllib.request.urlopen(url,date=None,[timeout]*,cafile=None,capath=None,cadefault=False,context=None)

//urlopen第一个参数的使用方法，get类型的请求
imoport urllib.request
response = urllib.request.urlopen("http://www.baidu.com")
print(response.read().decode("utf-8"))

//urlopen第一个参数的使用方法，POST类型的请求，需要往里面传参数
import urllib.request
import urllib.parse
data = bytes(urllib.parse.urlencode({"word":"hello"}),encoding = "utf-8")
response = urllib.request.urlopen("http://httpbin.org/post",data = data)
print(response.read())

//urlopen第三个参数timeout的用法,过了timeout时间，还没有响应，就抛出timeout的异常
import urllib.request
response = urllib.request.urlopen("http://httpbin.org/get",timeout =1)
//read  方法是用来打印响应的内容，但是是bytes类型，需要把它转成字符型 
print(response.read().decode("utf-8"))

urllib的响应

//打印出响应的类型
import urllib.request
response = urllib.request.urlopen("http://www.python.org")
print(type(response))

//状态码 和 响应头
import urllib.request
response = urllib.request.urlopen("http://www.python.org")
print(response.status)
print(response.getheaders())
print(response.getheader("Server"))

Request:用来发送更复杂的请求

//把url构建成了一个request,再传给urlopen，是可行的
import urllib.request
request = urllib.request.Request("http://www.python.org")
response = urllib.request.urlopen(request)
print(response.read().decode("utf-8"))

//定义一个Request类型的Req,代替url,urlopen不能往里面传参数
from urllib import request,parse
url = "http://httpbin.org/post"
headers = {"User_Agent":"Mozilla/4.0(comoatible;MSIE 5.5;Winsows NT)","Host":"httpbin.org"}
dict = {"name":"Germey"}
data = bytes(parse.urlencode(dict),encoding = "utf-8")
req = request.Request(url=url,data = data,headers = headers,method = "POST")
response = request.urlopen(req)
print(response.read().decode("utf-8"))

//用req.add_header()方法替代往req中传header,结果跟上面的一样
from urllib import request,parse
url = "http://httpbin.org/post"
dict = {"name":"Germey"}
data = bytes(parse.urlencode(dict),encoding = "utf-8")
req = request.Request(url=url,data = data,method = "POST")
req.add_header("User_Agent","Mozilla/4.0(comoatible;MSIE 5.5;Winsows NT)")
response = request.urlopen(req)
print(response.read().decode("utf-8"))

被子怪

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
爬虫随笔

爬虫随笔爬虫的小知识1、urllib.request.urlopen(“http://baidu.com”): 打开一个网站2、pip 安装时要在pip的文件下，不是在python中3、selenium用于网站的驱动和渲染，比如：from selenium import webdriver4、driver = webdriver.Chrome() #打开一个chrome浏览器5、dr...
复制链接

扫一扫

专栏目录