Python爬虫笔记之requests库

最新推荐文章于 2024-08-08 18:42:28 发布

Mr_Stutter

最新推荐文章于 2024-08-08 18:42:28 发布

阅读量191

点赞数

分类专栏： Python网络爬虫文章标签： request python 多线程

本文链接：https://blog.csdn.net/qq_53715621/article/details/113887807

版权

Python网络爬虫专栏收录该内容

14 篇文章 0 订阅

订阅专栏

文章目录

前言
一、requests库安装
二、get方法
三、Response对象
四、通用代码框架
五、多线程
总结

前言

记录一些requests库常用内容

一、requests库安装

pip install requests

二、get方法

r=requests.get(url,**kwargs) 提交向服务器请求资源的Request对象，返回包含服务器资源的Response对象。
kwargs常用参数：

headers：字典类型，可用来模拟浏览器，在F12的网络中查找，使用r.request.headers获得当前请求头。
cookies: 字典类型，可用来跳过登录界面，在F12的网络中查找。
timeout：整数类型，设定超时时间，以秒为单位。
proxies：字典类型，设定代理ip，可用来防止ip被封，可通过此网址获得当前ip。
设置方法：

import requests
headers={"user-agent":"Mozilla/5.0"}
cookies={"domain": "...", "httpOnly": true, "name": "...", "path": "/...", "secure": true, "value": "..."}
proxies={"http":"http://...:..."
	"https":"https://...:..."}
r=requests.get(url,headers=headers,cookies=cookies,proxies=proxies,timeout=30)

三、Response对象

r.status_code：请求状态，200：连接成功。
r.headers：响应内容头部信息。
r.text：响应内容的字符串形式。
r.content：响应内容二进制形式。
r.encoding：根据请求头判断编码形式。
r.apparent_encoding：根据内容判断编码形式。

四、通用代码框架

import requests
def getHTMLText(url):
    try:
        r=requests.get(url,timeout=30)
        r.raise_for_status()
        r.encoding=r.apparent_encoding
        return r.text
    except:
        return "产生异常"
 if __name__=='__main__':
    url="http://www.baidu.com"
    print(getHTMLText(url))

五、多线程

对于文件读写频繁的IO密集程序宜采用多线程缩短时间

1、启动线程

import threading
t=Thread(target<函数名>,args=None<参数>)
#t.setDaemon(False) #后台线程，主程序结束后仍运行
t.start()
#t.join() #线程等待，阻塞主线程

2、线程锁

协调各线程关系，防止同时使用公共资源

lock=threading._RLock()
lock.acquire(timeout=2) #获取锁
<处理公共资源>
lock.release() #释放锁

3、多线程分任务爬虫

预先将任务分好，各线程分别执行自己的任务

定义线程类

class myThread(threading.Thread):
    def __init__(self,name,worklist):
        threading.Thread.__init__(self)
        self.name=name #线程名
        self.worklist=worklist #任务列表
    def run(self):
        print("on:"+self.name)
        GetHtml(self.name,self.worklist) #运行线程
        print("off:"+self.name)

定义函数

def GetHtml(name,worklist):
    for i in range(worklist[0],worklist[1]+1):
        try:
            r=requests.get(urls[i])
            time.sleep(random.random()*5) #控制访问速度
            lock.acquire()
            print(r.text) #使用线程锁将公共资源夹起来
            lock.release()
        except Exception as e:
            print(name,"Error:",e)

生成线程

def startThread(num):
    for i in range(num):
        thread=myThread('#'+str(i+1),worklists[i])
        thread.start()
        threads.append(thread) #线程列表
    for thread in threads:
        thread.join() #线程等待

主程序

urls=[]
threads=[]
worklists=[[0,2],[3,5],[6,8]] #分任务
lock=threading.Lock()
for i in range(9):
    urls.append('http://127.0.0.1:5000/'+str(i+1))    
startThread(len(worklists))

4、多线程队列爬虫

将任务分给空闲的线程，直到所有任务结束，线程才结束。

定义线程类

class myThread(threading.Thread):
    def __init__(self,name,q):
        threading.Thread.__init__(self)
        self.name=name
        self.q=q
    def run(self):
        print("on:"+self.name)
        GetHtml(self.name,self.q)
        print("off:"+self.name)

定义函数

def GetHtml(name,q):
    while True:
        try:
            url=q.get(timeout=2) #把任务从队列中取出
        except:
            break
        try:
            r=requests.get(url)
            time.sleep(random.random()*5)
            lock.acquire()
            print(r.text)
            lock.release()
        except Exception as e:
            print(name,"Error:",e)

生成线程

def startThread(urls):
    workqueue=queue.Queue(len(urls))
    for url in urls:
        workqueue.put(url) #把任务放入队列
    for i in range(3):
        thread=myThread('#'+str(i+1),workqueue)
        thread.start()
        threads.append(thread)
    for thread in threads:
        thread.join()

主程序

urls=[]
threads=[]
lock=threading.Lock()
for i in range(9):
    urls.append('http://127.0.0.1:5000/'+str(i+1))
startThread(urls)

总结

requests库调试方便，使用简单，适用于小规模爬虫，多页面爬取可使用多线程节省时间。可采用伪装浏览器头、设置cookie、设置ip代理、控制爬取频率来防止被限制爬虫。

Mr_Stutter

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录