Python爬虫-爬取慕课网课程

最新推荐文章于 2021-09-08 20:18:49 发布

allenxguo

最新推荐文章于 2021-09-08 20:18:49 发布

阅读量4.6k

点赞数

分类专栏： Python 文章标签： python 正则表达式图片爬虫 html

本文链接：https://blog.csdn.net/gx864102252/article/details/72848359

版权

Python 专栏收录该内容

59 篇文章 0 订阅

订阅专栏

Python爬取网络图片使用正则表达式解析Html格式的文件(其他更好的方法以后会继续更新)

获取慕课网课程图片

网站链接
http://www.imooc.com/search/?words=python
这里写图片描述
图1 网站页面

从网站上获取课程图片
首先查看页面html代码
这里写图片描述
图2 html代码

这里写图片描述
图3 html代码

可知图片的一个标签链接样式是
http://szimg.mukewang.com/5859ed790001b9da05400300-360-202.jpg
获取这个链接图片即可

Python代码

环境是python3.6 IDE是pycharm

import re                      #正则表达式模块
from urllib import request     #urllib的request模块可以非常方便地抓取URL内容
                                #也就是发送一个GET请求到指定的页面，然后返回HTTP的响应

req = request.urlopen('http://www.imooc.com/search/?words=python')
buf = req.read()
buf = buf.decode('utf-8')

listurl = re.findall(r'http:.+\.jpg', buf)  #从数据中查找http:开头 .jpg结尾的链接

i = 0                   #计数器
for url in listurl:
    f = open('E:/Temp/' + str(i) + '.jpg', 'wb')    #选择保存
    req_ = request.urlopen(url)                       #打开这个url(图片链接)
    buf_ = req_.read()              #读取数据到buf中
    f.write(buf_)                   #将数据写入文件
    i += 1
    f.close()