图片爬虫,爬整个 网站的图片
工具环境:
- 谷歌
- python
- pycharm
依赖库:
requests 发送http请求,下载图片,lxml 解析html文件
- grequests 基于gevent的异步http请求库,加快爬取速度源文件
- get_image.py 每次发送一个请求
- get_image_gevent.py 每次发送五个请求
注:可以在get_images函数中修改图片存放目录
全部代码:
# -*- coding: utf-8 -*-
# 使用grequests 重写,提高爬图速度
import os
import requests
import grequests
import time
from lxml import html
def get_response(url):
headers = {
"headers" : "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36"
}
response = requests.get(url, headers = headers)
return response
# 获取每个页面的url
def get_page_urls():
start_url = 'http://girl-atlas.com/'
response = get_response(start_url)
page_urls = []
page_urls.append(start_url)
while True:
parsed_body = html.fromstring(response.text)
next_url = parsed_body