【Python】网络爬虫-批量下载图片

最新推荐文章于 2024-07-08 15:39:07 发布

VegB

最新推荐文章于 2024-07-08 15:39:07 发布

阅读量2.3k

点赞数

分类专栏： Python 文章标签： python 网络爬虫文件下载

本文链接：https://blog.csdn.net/daphne566/article/details/54767235

版权

Python 专栏收录该内容

14 篇文章 0 订阅

订阅专栏

本文介绍了一个使用Python进行网络爬虫的实践项目，重点讲解如何批量下载图片。涉及知识点包括BeautifulSoup模块、文件I/O操作、requests模块和urllib.request的urlretrieve()函数。通过该项目，读者可以学会从HTML文件中提取图片链接并将其保存到本地。

摘要由CSDN通过智能技术生成

Description

Yixiaohan/show-me-the-code 第0008题 && 第0009题 && 第0013题
0008 ：一个HTML文件，找出里面的正文。
0009 ：一个HTML文件，找出里面的链接。
0013 ：用 Python 写一个爬图片的程序。

Notes

这个小项目中涉及BeautifulSoup模块的使用、文件I/O操作、从网络上下载文件等内容。几个知识点：

requests模块的使用 && Response类对象
1. request这个模块可以处理HTML请求，GET/POST/DELETE/PUT等都可以。
  response = requests.get(url)
  调用上述函数之后，会返回一个Response类对象。
2. 因为不同网站的编码方式可能有所不同，所以在这个项目中显示地将response的编码方式改成utf-8。
  response.encoding = "utf-8"
3. 得到从Response类对象的text属性得到html原文。
  html_code = response.text
  另一种得到html原文的方式如下：
  html_code = urllib.request.urlopen(url).read()
BeautifulSoup模块
1. 建立BeautifulSoup对象，参数是html文件
  soup = BeautifulSoup(html_code, "html.parser")
  如果要打开本地的html文件的话，可以采用如下方式：
  soup = BeautifulSoup(open('index.html'))
2. 找到想爬的内容，比如要找到全部的连接：
  links = soup.findAll('a')
  如果要找到每个’a’标签中的纯链接部分，可以采用get()函数，挑选想要的属性：
  print(link.get('href'))
  如果要得到链接的文字部分，可以采用：
  print(link.string)
3. findAll()函数加其他限制条件，比如只挑选某一个class的img：
  imgs = soup.findAll('img', {'class' : "BDE_Image"})
用python从网上下载文件
用urllib.request模块的urlretrieve()函数。
urllib.request.urlretrieve(src, fileName)

My Code

"""
* 0008 && 0009 && 0013
  by VegB
  2017/1/26
"""

from bs4 import BeautifulSoup
import requests
import urllib.request

"""
request这个模块可以处理HTML请求，GET/POST/DELETE/PUT等都可以
调用上述函数之后，会返回一个Response类对象
"""

raw_url = "http://tieba.baidu.com/p/4945979003?see_lz=1&pn="
cnt = 0

for pageNum in range(1, 2):
    url = raw_url + str(pageNum)

    response = requests.get(url)
    response.encoding = "utf-8" 
    # 原来百度的编码方式可能是gb2312啥的 windows的编码方式是gbk，用gbk的方式去解释就会出问题，还是设置为utf-8好了
    # print(response.text)

    html_code = response.text; # Response类对象的text属性，得到html原文
    soup = BeautifulSoup(html_code, "html.parser")

    # websiteCode = urllib.request.urlopen(url).read()
    # soup = BeautifulSoup(websiteCode, "html.parser") # 建立一个BeautifulSoup对象，参数是html文件 或者BeautifulSoup(open('index.html'))

    # 爬链接
    # 输出到文件？
    """
    links = []
    links = soup.findAll('a')
    cnt = 0; 
    for link in links:
        print("LINK %d:", cnt)
        print(link.get('href'))
        print(link.string)
        cnt += 1
    """

    # 下载图片
    imgs = []
    imgs = soup.findAll('img', {'class' : "BDE_Image"}) # 定好类名，不要那些广告的图片
    for img in imgs:
        src = img.get('src')
        print("IMAGE %d:"%cnt)
        print(src)
        fileName = str(cnt) + ".jpg"
        urllib.request.urlretrieve(src, fileName)
        cnt += 1

    pageNum += 1


# 找正文并写入文件
url = "http://162.105.146.180:8130/" # 爬自己写的网站好了哈哈哈
response = requests.get(url)
response.encoding = 'utf-8'
html_code = response.text

soup = BeautifulSoup(html_code)
html_body = soup.findAll('body')
print(html_body)

fp = open('html_body.txt','w')
for li
fp.close()