Python多线程爬取网站image的src属性实例

最新推荐文章于 2022-04-06 20:23:23 发布

pergoods

最新推荐文章于 2022-04-06 20:23:23 发布

阅读量1.6k

点赞数

分类专栏： python编程以及安装和配置文章标签：多线程 python

本文链接：https://blog.csdn.net/pergoods/article/details/84895518

版权

python编程以及安装和配置专栏收录该内容

27 篇文章 0 订阅

订阅专栏

# coding=utf-8
'''
Created on 2017年5月16日
@author: chenkai
Python多线程爬取某单无聊图图片地址(requests+BeautifulSoup+threading+Queue模块)
'''

import requests
from bs4 import BeautifulSoup
import threading
import Queue
import time

class Spider_Test(threading.Thread):
def __init__(self,queue):
threading.Thread.__init__(self)
self.__queue = queue
def run(self):
while not self.__queue.empty():
page_url=self.__queue.get() [color=red]#从队列中取出url[/color]
print page_url
self.spider(page_url)
def spider(self,url):
r=requests.get(url) [color=red]#请求url[/color]
soup=BeautifulSoup(r.content,'lxml') [color=red]#r.content就是响应内容，转换为lxml的bs对象[/color]
imgs = soup.find_all(name='img',attrs={}) #查找所有的img标签，并获取标签属性值（为列表类型）
for img in imgs:
if 'onload' in str(img): [color=red]#img属性集合中包含onload属性的为动态图.gif,[/color]
print 'http:'+img['org_src']
else:
print 'http:'+img['src']

def main():
queue=Queue.Queue()
url_start = 'http://jandan.net/pic/page-'
for i in range(293,295):
url = url_start+str(i)+'#comment'
queue.put(url) [color=red]#将循环拼接的url放入队列中[/color]

threads=[]
thread_count=2 [color=red]#默认线程数（可自动修改）[/color]
for i in range(thread_count):
threads.append(Spider_Test(queue))
for i in threads:
i.start()
for i in threads:
i.join()

if __name__ == '__main__':[color=red] #在.py文件中使用这个条件语句，可以使这个条件语句块中的命令只在它独立运行时才执行[/color]
time_start = time.time()
main() [color=red]#调用main方法[/color]
print time.time()-time_start

[color=red]#背景知识[/color]
'''
q = Queue.Queue(maxsize = 10)
Queue.Queue类即是一个队列的同步实现。队列长度可为无限或者有限。可通过Queue的构造函数的可选参数maxsize来设定队列长度。如果maxsize小于1就表示队列长度无限。
将一个值放入队列中
q.put(10)
调用队列对象的put()方法在队尾插入一个项目。put()有两个参数，第一个item为必需的，为插入项目的值；第二个block为可选参数，默认为
1。如果队列当前为空且block为1，put()方法就使调用线程暂停,直到空出一个数据单元。如果block为0，put方法将引发Full异常。
将一个值从队列中取出
q.get()
调用队列对象的get()方法从队头删除并返回一个项目。可选参数为block，默认为True。如果队列为空且block为True，get()就使调用线程暂停，直至有项目可用。如果队列为空且block为False，队列将引发Empty异常。

'''

[color=red]如果想要下载图片需要
import urllib

再替换spider方法即可[/color]

def spider(self,url):
r=requests.get(url)
soup=BeautifulSoup(r.content,'lxml')
imgs = soup.find_all(name='img',attrs={})
urls=[]
for img in imgs:
if 'onload' in str(img):
print 'http:'+img['org_src']
urls.append('http:'+img['org_src'])
else:
print 'http:'+img['src']
url = urls.append('http:'+img['src'])
#下载图片
k=0
for urlitem in urls:
k+=1
if '.jpg' in urlitem:
urllib.urlretrieve(url=urlitem,filename='F:\image\\'+str(k)+'.jpg')

[color=red]-----------多线程访问百度实例[/color]
#coding:utf-8
import requests
import threading
import time
import sys

url = 'https://www.baidu.com'

def get_baidu():
global url
time_start = time.time()
r = requests.get(url=url)
times = time.time()-time_start
sys.stdout.write('status:%s time:%s current_time:%s\n'%(r.status_code,times,time.strftime('%H:%M:%S')))

def main():
threads = []
thread_count = 10
for i in range(thread_count):
t = threading.Thread(target=get_baidu,args=())
threads.append(t)
for i in range(thread_count):
threads[i].start()

for i in range(thread_count):
threads[i].join()

if __name__=='__main__':
main()

pergoods

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Python多线程爬取网站image的src属性实例

# coding=utf-8'''Created on 2017年5月16日 @author: chenkaiPython多线程爬取某单无聊图图片地址(requests+BeautifulSoup+threading+Queue模块)'''import requestsfrom bs4 import BeautifulSoupimport threading...
复制链接

扫一扫

专栏目录