爬虫GoGoGo
爬虫真的有意思,今天学到了替换请求header
import requests
import re
import time
import os
headers = {
# 'User-Agent': 'asdsadsase'
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
}
response = requests.get('https://www.vmgirls.com/12985.html', headers=headers)
# print(response.request.headers)
# print(response.text)
html = response.text
# 解析网页
dir_name = re.findall('<h1 class="post-title h3">(.*?)</h1>', html)[-1]
# print(dir_name)
# exit()
if not os.path.exists(dir_name):
os.mkdir(dir_name)
# exit()
urls = re.findall('<a href="(.*?)" alt=".*?" title=".*?"', html)
print(urls)
# 保存图片
for url in urls:
time.sleep(1)
# 图片名字
file_name = url.split('/')[-1]
response = requests.get(url, headers=headers)
with open(dir_name + '/' + file_name, 'wb') as f:
f.write(response.content)
上面这个网站爬图,翻墙后会显著提升速度。。。