Python爬虫入门

最新推荐文章于 2024-10-08 20:27:53 发布

alex_mist

最新推荐文章于 2024-10-08 20:27:53 发布

阅读量134

点赞数

分类专栏： python 文章标签： python

本文链接：https://blog.csdn.net/weixin_40710708/article/details/105072563

版权

python 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

requests模块是用来向网站发送http请求的，获取网页的HTML数据
BeautifulSoup模块是用来从HTML文本中提取我们想要的数据；
Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构，每个节点都是Python对象。所有对象可以归纳为4种类型: Tag , NavigableString , BeautifulSoup , Comment

requests是向服务器进行请求，请求有两类，GET和POST；可以在request.get()里面添加header参数

BeautifulSoup其实就是将request之后服务器返回的response进行解析；用lxml来解析据说会更快，python自带的解析器是html.parser

流程就是用requests模拟对url的一个请求并获得response，将response经过BeautifulSoup处理，用find或find_all（）提取出想要的标签tag，如p,a,img等，可以用class_来精确到html类；注意用class_,因为class是python的关键字，需要区分

from bs4 import BeautifulSoup
import os
import requests


class GetPicture():
    def __init__(self):
        self.dir = 'D:/python-pic'
        self.headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'}
        self.web_url = 'https://unsplash.com'

    def mkdir(self,path):
        path = path.strip()
        is_exist = os.path.exists(path)
        if is_exist:
            print("The folder is already exist")
        else:
            os.makedirs(path)
            print("Creating the folder:" + path)

    def save_pic(self,count,url):
        print("Starting save the picture, will take some times...")
        img_name = str(count) + '.jpg'
        f = open(img_name,'ab')
        img = requests.get(url,self.headers)
        f.write(img.content)
        f.close()
        print("'" + img_name + "' is saved successfully" )


    def get_pic(self):
        r = requests.get(self.web_url,self.headers)
        all_img = BeautifulSoup(r.text,'html.parser').find_all('img',class_="_2zEKz")
        count = 1
        for img in all_img:
            pic_url = img['src']
            self.mkdir(self.dir)
            os.chdir(self.dir)
            self.save_pic(count,pic_url)
            count+=1
get1 = GetPicture()
get1.get_pic()