Python3多线程爬虫实战

破竹15

已于 2023-09-02 23:52:25 修改

阅读量50

点赞数 1

分类专栏： Python/Django 文章标签：爬虫 python 开发语言

于 2019-08-05 23:22:00 首次发布

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/qq_40500571/article/details/131567317

版权

Python/Django 专栏收录该内容

20 篇文章 2 订阅

订阅专栏

#!/usr/bin/env python3

# -*- coding:utf-8 -*-

"""

pip install beautifulsoup4

pip install lxml

pip install requests

pip install threadpool

"""

import requests

from urllib import request

from urllib.request import urlretrieve

from bs4 import BeautifulSoup

from Download import Download

from contextlib import suppress

import os

import time

import sys

import threadpool

from concurrent.futures import ThreadPoolExecutor

import ssl

from unittest.mock import Mock

from requests.models import Response

ssl._create_default_https_context = ssl._create_unverified_context

from requests.adapters import HTTPAdapter

session = requests.Session()

session.mount('http://', HTTPAdapter(max_retries=3))

session.mount('https://', HTTPAdapter(max_retries=3))

# 爬取目标

domain = 'https://www.baidu.com/'

main_url = domain+'thread.php?fid-24.html'

gif_url = domain+'thread.php?fid-29-page-1.html'

cur_path = os.getcwd() + '/images'

gif_path = os.getcwd() + '/gifs'

# 设置报头，Http协议

header = {

"Accept":"image/webp,image/apng,image/*,*/*;q=0.8",

"Content-Type":"charset=utf-8",

'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.104 Safari/537.36',

"Connection":"keep-alive",

}

def update_header(referer):

header['Referer'] = '{}'.format(referer)

def downloadContent(url):

the_response = Response()

the_response.code = "expired"

the_response.error_type = "expired"

the_response.status_code = 400

the_response.encoding='UTF-8'

the_response._content = b'{ "key" : "a" }'

i = 0

while i < 3:

with suppress(requests.exceptions.RequestException,requests.exceptions.SSLError,requests.exceptions.ReadTimeout,requests.exceptions.ConnectionError):

the_response= requests.get(url, headers=header,timeout=5,stream=True)

return the_response

time.sleep(2)

i += 1

if i>=3 :

print(url,' times:',i)

return the_response

def replace(tag,content):

tag.replace_with(content)

def mkdir(dir):

if not os.path.exists(dir):

os.makedirs(dir,exist_ok=True)

def getDirname(pathN):

return os.path.dirname(pathN)

def getName(pathN):

return pathN[len(getDirname(pathN))+1:]

def saveContent(content,to_file):

mkdir(getDirname(to_file))

with open(to_file, 'wb') as file:

file.write(content)

def download(url:str,save_file:str):

content=downloadContent(url).content

saveContent(content,save_file)

# urlretrieve(url,save_file)

def downloadInThreadPool(tdpool,url,save_file):

argsList=[([url,save_file], None)]

reqs = threadpool.makeRequests(download, argsList) # 构建任务队列

for req in reqs: # 提交任务

tdpool.putRequest(req)

#请求网页并转网页编码

def getUrlAsSoup(url,aHeader):

html=downloadContent(url)

code=html.encoding

print(code)

html=html.text

html=html.encode(code)

html=html.decode('utf-8')

# parser = 'html.parser'

parser = 'lxml'

soup = BeautifulSoup(html ,parser)

# soup = soup.prettify()

return soup

def getPage(sp):

pages=sp.find('div',class_='pages').find_all('a')

for herf in pages:

# herf = herf.prettify()

print(herf.attrs['href'])

def getSubPage(sp):

pages=sp.find_all('a',class_='subject')

# print(sp)

herfs=[]

for page in pages:

# herf = herf.prettify()

tmp={"name":page.string,

"url":page.attrs['href']}

herfs.append(tmp)

# print(tmp)

return herfs

def getImages(sp):

images=sp.find_all('ignore_js_op',class_='att_img')

# print(sp)

image_herfs=[]

for img in images:

pic=img.find('img')

# herf = herf.prettify()

image_herfs.append(pic.attrs['src'])

# print("image：",pic.attrs['src'])

return image_herfs

if __name__ == "__main__" :

pool = threadpool.ThreadPool(16) # 创建线程池

# thread_pool_size = 32 # 线程池大小

# executor = ThreadPoolExecutor(max_workers=thread_pool_size)

# futures = []

all_images=[]

for i in range(13,1616):

page_url=domain+'thread.php?fid-29-page-'+str(i)+'.html'

print(page_url)

page = getUrlAsSoup(page_url, header)

getPage(page)

sub_pages=getSubPage(page)

for sub_url in sub_pages:

img_url_sp=getUrlAsSoup(domain+sub_url['url'], header)

tmp={"name":sub_url['name'],

"image_urls":getImages(img_url_sp)}

all_images.append(tmp)

# print(tmp)

if tmp["image_urls"]!=[]:

print(i,sub_url['name'])

for image_url in tmp["image_urls"]:

save_name=gif_path+'/'+str(i)+'_'+ tmp["name"]+'/'+getName(image_url)

# print(image_url)

# print(save_name)

downloadInThreadPool(pool,image_url,save_name)

pool.wait()

print('finish..',i)

print('下载完成')

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Python3多线程爬虫实战

第一次写贴子，试试水很多玩Python3的都会遇到编码问题，如果直接去处理未知编码的网页，不是Python内部编码格式utf8会出现乱码，下面介绍一种将未知编码的字串转换为utf8来避免乱码的方法，在很多Python编码转换的场景中都可以使用，这段是自己写的爬虫中的一段代码，代码比较简短，聪明的你一定能抓住其中的重点#请求网页并转网页编码def getHtmlAndDealCode(...
复制链接

扫一扫

专栏目录

破竹15 CSDN认证博客专家 CSDN认证企业博客

码龄7年

121: 原创

1万+: 周排名

7万+: 总排名

14万+: 访问

: 等级

1632: 积分

1338: 粉丝

196: 获赞

28: 评论

401: 收藏

私信

关注

热门文章

分类专栏

Linux 17篇
代码运维 3篇
Shadertoy 1篇
Python/Django 20篇
C/C++/CMake 13篇
渲染 7篇
CocosCreator3.x
渲染与数学 10篇
Cocos 2篇
Filament 3篇
建模 2篇
Android 16篇
opengles 5篇
Qt 14篇
前端技术 2篇
后端技术 2篇
杂项 1篇
正则表达式 1篇
仓库版本管理 3篇
ceph 7篇

最新评论

Windows ping不通VMware虚拟机解决方法
CWC_21: 要先在虚拟机里面配置网络，再修改windows网络配置中的虚拟机的ip地址，二者不能重复，否则不能ping通22端口
python/C++将数字转换为汉字（3个程序）
CSDN-Ada助手: 哇, 你的文章质量真不错，值得学习！不过这么高质量的文章, 还值得进一步提升, 以下的改进点你可以参考下: (1)提升标题与正文的相关性；(2)增加除了各种控件外，文章正文的字数；(3)使用更多的站内链接。
Qt开源编辑器qsciscintilla的一些用法
破竹15: 可以全局搜下缩进关键字的接口，几年没搞这个了
Filament引擎分析--command抽象设备API
破竹15: 不是创建两次，代码会使用在两个地方
Filament引擎分析--command抽象设备API
哲学家♂: 为啥要创建两次GLVertexBuffer

最新文章

目录

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。