requests模块是用来向网站发送http请求的,获取网页的HTML数据
BeautifulSoup模块是用来从HTML文本中提取我们想要的数据;
Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象。所有对象可以归纳为4种类型: Tag , NavigableString , BeautifulSoup , Comment
requests是向服务器进行请求,请求有两类,GET和POST;可以在request.get()里面添加header参数
BeautifulSoup其实就是将request之后服务器返回的response进行解析;用lxml来解析据说会更快,python自带的解析器是html.parser
流程就是用requests模拟对url的一个请求并获得response,将response经过BeautifulSoup处理,用find或find_all()提取出想要的标签tag,如p,a,img等,可以用class_来精确到html类;注意用class_,因为class是python的关键字,需要区分
from bs4 import BeautifulSoup
import os
import requests
class GetPicture():
def __init__(self):
self.dir = 'D:/python-pic'
self.headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'}
self.web_url = 'https://unsplash.com'
def mkdir(self,path):
path = path.strip()
is_exist = os.path.exists(path)
if is_exist:
print("The folder is already exist")
else:
os.makedirs(path)
print("Creating the folder:" + path)
def save_pic(self,count,url):
print("Starting save the picture, will take some times...")
img_name = str(count) + '.jpg'
f = open(img_name,'ab')
img = requests.get(url,self.headers)
f.write(img.content)
f.close()
print("'" + img_name + "' is saved successfully" )
def get_pic(self):
r = requests.get(self.web_url,self.headers)
all_img = BeautifulSoup(r.text,'html.parser').find_all('img',class_="_2zEKz")
count = 1
for img in all_img:
pic_url = img['src']
self.mkdir(self.dir)
os.chdir(self.dir)
self.save_pic(count,pic_url)
count+=1
get1 = GetPicture()
get1.get_pic()
对于HTTP的简介可以参见:
https://www.liaoxuefeng.com/wiki/1016959663602400/1017804782304672