python爬取图片反爬虫_多进程爬取某图片网站(python爬虫)

之前玩儿爬虫时候发现的一个贼菜的图片网站,貌似没有任何反爬虫机制。在队友建议下,直接开启多进程爬取,heihei,直接爬了几个G的图片,感觉没人能阻止我,可以把这个站爬空。初学的同学可以去感受一下,怎么简单怎么来。

代码如下:

import re

import requests

import time

from multiprocessing import Pool

from lxml import etree

import os

import uuid

# 第一个主页面地址

rooturl = 'http://www.win4000.com/zt/huyan_'

# http://www.win4000.com/zt/fengjing.html

# 模拟浏览器请求头

header = {

"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"

" AppleWebKit/537.36 (KHTML, like Gecko) "

"Chrome/70.0.3538.110 Safari/537.36"

}

count = 0

# 图片集url

def graph_set(rooturl):

set = []

title = []

results = requests.get(rooturl, headers=header)

text = results.text

res = re.findall('.*href="(.*)" alt="',text)

selector = etree.HTML(text)

tt = selector.xpath('//div[contains(@class,"tab_tj")]//li//p')

for url in res:

set.append(url)

for tit in tt[:24]:

title.append(tit.text)

return title,set

# 图片页面解析原图集合

def parser(tup):

response = requests.get(tup[0],headers=header)

text = response.text

originset = re.findall('href="(.*)" class=.*查看原图',text)

time.sleep(1)

oringin(originset.pop(),tup[1])

# 图集原图集合

def oringin(page,name):

print(name+'正在爬取')

dir = 'G:\python 资源\python project\美桌网壁纸爬取\护眼图片\\'

oringin = []

response = requests.get(page,headers=header)

res = re.findall('li.*href="(.*)".*>

for url in res:

result = re.findall('(.*)" target', url)

oringin.append(result)

num = len(oringin)

for url in oringin:

count = uuid.uuid1()

res = requests.get(url.pop(), headers=header)

with open(dir+str(count)+'.jpg','wb') as file:

file.write(res.content)

# time.sleep(1)

# print(oringin)

def main(rooturl):

pagename,pageset = graph_set(rooturl)

# for url,name in dict(zip(pageset,pagename)).items():

# orin = parser(url)

# oringin(orin,name)

# print(url,name)

p = Pool()

p.map(parser,zip(pageset,pagename))

if __name__ == '__main__':

for i in range(1,6):

pageurl = rooturl + str(i) + '.html'

print(str(i)+'页面开始爬取......')

main(pageurl)

结果展示:

20181231191054773.png

由于图片太多,所以名字是随机生成的

watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzMwMzg2NTQx,size_16,color_FFFFFF,t_70

watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzMwMzg2NTQx,size_16,color_FFFFFF,t_70

watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzMwMzg2NTQx,size_16,color_FFFFFF,t_70

watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzMwMzg2NTQx,size_16,color_FFFFFF,t_70

watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzMwMzg2NTQx,size_16,color_FFFFFF,t_70

watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzMwMzg2NTQx,size_16,color_FFFFFF,t_70

watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzMwMzg2NTQx,size_16,color_FFFFFF,t_70

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值