python爬虫基础

最新推荐文章于 2023-06-09 09:58:38 发布

qq_41911569

最新推荐文章于 2023-06-09 09:58:38 发布

阅读量191

点赞数

本文链接：https://blog.csdn.net/qq_41911569/article/details/83141524

版权

爬虫的定义

通过程序模拟浏览器请求站点的行为，把站点返回的HTML代码/JSON数据/二进制数据（图片、视频）爬到本地，进而提取自己需要的数据，存放起来使用。

爬虫两种常见的请求方式

urlopen:返回的网页内容实际上是没有被解码或的，在read()得到内容后通过指定decode()函数参数，可以使用对应的解码方式。
而requests.get()方法请求了站点的网址，然后打印出了返回结果的类型，状态码，编码方式，Cookies等内容

import requests

url = 'http://www.baidu.com'
response = requests.get(url)
print(response)  #返回的是一个响应
print(response.status_code)  #打印状态码
print(response.cookies)  #打印cookie信息

结果：
在这里插入图片描述

requests常见的方法

上传，删除数据数据

通过http://httpbin.org/pos可以查看你上传和删除的信息

response = requests.post('http://httpbin.org/post',data={'name':'linxu'})
print(response.text)
response = requests.delete('http://httpbin.org/delete',data={'name':'linux'})
print(response.text)

向基本的请求提添加内容

# 带参数的get请求
url1 = 'https://movie.douban.com/subject/4864908/comments?start=20&limit=20&sort=new_score&status=P'
data ={
    'start':20,
    'limit':20,
    'sort':'new_score',
    'status':'P'
}
url = 'https://movie.douban.com/subject/4864908/comments?'
response = requests.get(url,params=data)
print(response.url)插入代码片

结果：
在这里插入图片描述

添加头信息

#添加headers信息
url = 'http://www.cbrc.gov.cn/chinese/jrjg/index.html'
proxy = {'http':'115.223.127.47:8010','https':'60.20.221.239:9999'}
user_agent = 'Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0'
#注意：headers信息要以字典的形式传入
headers ={
    "User-Agent":user_agent
}
response = requests.get(url,headers=headers)
print(response.text)

解析为json格式

# 解析json格式：
ip = '8.8.8.8'
url = 'http://ip.taobao.com/service/getIpInfo.php?ip=%s' %ip
response = requests.get(url)
content = response.json()
print(content)
print(type(content))

结果：
在这里插入图片描述

获取二进制数据

以下载图片为例；

#获取二进制数据
url = 'http://img003.hc360.cn/k2/M06/0D/DE/wKhQxVhckL2EB_eTAAAAAMUN00A976.jpg'
response = requests.get(url)
with open('./haha.jpg','wb') as f:
    #conetent属性对应的是response的二进制文件
    f.write(response.content)

状态码的设置

利用三源运算符，当前面的访问成功打印sucess,否则退出程序

# 状态码的设置
url = 'http://www.cbrc.gov.cn/chinese/jrjg/index.html'
user_agent = 'Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0'
headers = {
     'User-Agent':user_agent
 }
response = requests.get(url,headers=headers)
exit() if response.status_code！=200 else print('suess')在这里插入代码片

url解析

分解url

from urllib import parse
from urllib.parse import urlencode

url = 'http://www.google.com/search?hl=en&q=urlparse&btnG=Google+Search'
parsed_tuple = parse.urlparse(url)
#将url分为六个部分，返回的是一个元组
print(parsed_tuple)
print(parsed_tuple.scheme)
print(parsed_tuple.netloc)
#将元组中的六 部分组成一个url
print(parse.urlunparse(parsed_tuple))

结果：在这里插入图片描述

生成url

#urlencode
#自己生成一个Url
#通过字典编码的格式重构IP
params = {
    'name':'westos',
    'age':20
}
base_url = 'http://www.baidu.com?'
url = base_url + urlencode(params)
print(url)

结果：
在这里插入图片描述

cookie

cookie 的获取与两种方式urlopen和requests，requests方法比较简洁，容易理解，

urlopen的方式

获取一个cookie

from http import cookiejar
from  urllib.request import HTTPCookieProcessor
from urllib import request

# 如何将cookie保存到变量中，或者文件中
#1声明一个cookieJar
cookie = cookiejar.CookieJar()
#2利用urlib.request创建一个cookie处理器
handler = HTTPCookieProcessor(cookie)
#通过coookei创建opener
opener = request.build_opener(handler)
#打开url 界面
response = opener.open('http://www.baidu.com')
#打印该页面的cookie信息
print(cookie)
for i in cookie:
    print(i)

结果：
在这里插入图片描述

把一个cookie保存到文件

#2.将cookei信息以指定的格式保存到文件中
#设置保存cookie的文件名
cookieFilename = 'cookie.txt'
#声明一个Mozillocookie用来保存并且可以写入文件
cookie = cookiejar.MozillaCookieJar(cookieFilename)
#创建一个cookei处理器
handler = HTTPCookieProcessor(cookie)
#创建一个opener
opener = request.build_opener(handler)
#打开网页
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True,ignore_expires=True)

从文件中敬加载cookie信息

cookiename = 'cookie.txt'
cookie = cookiejar.MozillaCookieJar(cookiename)
cookie.load(filename=cookiename)
handle = HTTPCookieProcessor(cookie)
opener = request.build_opener(handle)
response = opener.open('http://www.baidu.com')
print(response.read().decode('utf-8'))

反爬的两种方式

模拟浏览器
模拟浏览器是将浏览器的代理user-agent添加到请求的头文件当中。
找IP代理
在其他网站上获取ip，通过获取道到的ip来发送请求。适用于对某一网站进行连续的方问，把代理ip存入一个字典，每次访问都使用的是不同的IP。

rom urllib import  request
from urllib.error import URLError


url = 'http://httpbin.org/get'
proxy = {'http':'115.223.127.47:8010','https':'60.20.221.239:9999'}
user_agent = 'Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0'

proxy_support = request.ProxyHandler(proxy)
#调用opener作用类似于urlopen，是定制的
opener = request.build_opener(proxy_support)
# 伪装浏览器
opener.addheaders = [('User-Agent',user_agent)]
#安装opener
request.install_opener(opener)
#代理ip的选择
response = request.urlopen(url)
content = response.read().decode('utf-8')
print(content)

qq_41911569

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python爬虫基础

爬虫的定义通过程序模拟浏览器请求站点的行为，把站点返回的HTML代码/JSON数据/二进制数据（图片、视频）爬到本地，进而提取自己需要的数据，存放起来使用。爬虫两种常见的请求方式urlopen:返回的网页内容实际上是没有被解码或的，在read()得到内容后通过指定decode()函数参数，可以使用对应的解码方式。而requests.get()方法请求了站点的网址，然后打印出了返回...
复制链接

扫一扫