Python3爬取某乎网站的图片，并保存到本地和数据库（亲测可用）

最新推荐文章于 2024-07-22 17:25:13 发布

置顶 Blogfish

最新推荐文章于 2024-07-22 17:25:13 发布

阅读量4.4k

点赞数 5

分类专栏： Python3 文章标签： Python爬虫爬虫数据库

本文链接：https://blog.csdn.net/wangjianhuahua/article/details/84991402

版权

Python3 专栏收录该内容

40 篇文章 6 订阅

订阅专栏

语言：Python3.7

数据库：mysql

需要导入的工具包，如下：

from urllib.request import urlopen    #注意这里的写法urllib不能直接写为import urllib要加上它的对象request
from bs4 import BeautifulSoup
import re
import time
import pymysql.cursors
import urllib.request

写入数据库需要创建表，代码如下：

CREATE TABLE imgtest(
id INT PRIMARY KEY AUTO_INCREMENT,
img LONGBLOB, -- 图片字段
content LONGTEXT -- 文本字段

);

#==============实现爬取图片到本地文件夹====================

url = "https://www.zhihu.com/question/22918070"
html = urllib.request.urlopen(url).read().decode('utf-8')
soup = BeautifulSoup(html,'html.parser')
print(soup.prettify())

#用Beautiful Soup结合正则表达式来提取包含所有图片链接（img标签中，class=**，以.jpg结尾的链接）的语句
links = soup.find_all('img',"origin_image zh-lightbox-thumb",src=re.compile(r'.jpg$'))
print(links)

# 设置保存图片的路径，否则会保存到程序当前路径
path = r'D:\pcongfile\images'            #路径前的r是保持字符串原始值的意思，就是说不对其中的符号进行转义
for link in links:
    print(link.attrs['src'])
    #保存链接并命名，time.time()返回当前时间戳防止命名冲突
    urllib.request.urlretrieve(link.attrs['src'],path+'\%s.jpg' % time.time())  #使用request.urlretrieve直接将所有远程链接数据下载到本地
print('==========图片已写本地文件夹==========')

#==============实现爬取图片到本地数据库====================

#请求URL并使用UTF-8编码
resp = urlopen("https://www.zhihu.com/question/22918070").read().decode("utf-8")
#指定一个解析器
soup = BeautifulSoup(resp,"html.parser")
for ListUrl in soup.findAll('img',"origin_image zh-lightbox-thumb",src=re.compile(r'.jpg$')):
    if not re.search("\.(jpg|JPG)$",ListUrl["src"]): #如果不是已jpg或者JPG结尾的才输出
        print(ListUrl.string,"<-------->","https://www.zhihu.com/question/22918070"+ListUrl["src"])

    #获取数据库链接  charset='utf8mb4'
    conn = pymysql.connect(host='localhost', user='root', password='123', db='pytest',port=3306)
    try:
        #获取内容
        with conn.cursor() as cursor:
            sql="insert into`imgtest`(`img`,`content`)values(%s,%s)"
            #执行sql
            cursor.execute(sql,(ListUrl.get_text(),"https://en.wikipedia.org"+ListUrl["src"]))
            conn.commit()  #提交
    finally:
        conn.close()
print('==========图片已写入数据库==========')

本地文件夹效果图：