爬虫初学1

最新推荐文章于 2024-07-27 12:20:46 发布

mr_xinL

最新推荐文章于 2024-07-27 12:20:46 发布

阅读量135

点赞数

分类专栏：爬虫文章标签： python

原文链接：https://blog.csdn.net/sunon_/article/details/90634253?depth_1-utm_source=distribute.pc_relevant.none-task&utm_source=distribute.pc_relevant.none-task

版权

爬虫专栏收录该内容

13 篇文章 0 订阅

订阅专栏

模仿代码，爬取新浪图片

import urllib.request
import re
import chardet
'''#打开网页，读取网页，网页解码'''
page = urllib.request.urlopen('http://photo.sina.com.cn/')  # 打开网页
htmlCode = page.read()  # 获取网页源代码
# print(chardet.detect(htmlCode))  # 打印返回网页的编码方式      #使用中，chardet.detect()返回字典，其中confidence是检测精确度，encoding是编码形式
# print(htmlCode.decode('utf-8'))  # 打印网页源代码    #Python decode() 方法以 encoding 指定的编码格式解码字符串。默认编码为字符串编码。该方法返回解码后的字符串。

'''#网页数据存入'''
pageFile = open('D:\MEITU\pageCode.txt', 'wb')  # 以写的方式打开pageCode.txt
pageFile.write(htmlCode)  # 写入
pageFile.close()  # 开了记得关

'''#正则，找到图片'''
data = htmlCode.decode('utf-8')
reg = r'src="(.+?\.jpg)"'  # 正则表达式
reg_img = re.compile(reg)  # 编译一下，运行更快
imglist = reg_img.findall(data)  # 进行匹配
# for img in imglist:
#     print(img)

'''#下载图片到本地'''
x = 0
for img in imglist:
    print(img)
    urllib.request.urlretrieve(img, 'D:\MEITU\PIG\%s.jpg' % x) #保存在指定文件夹内
    x += 1    #出现HTTP Error 502: Bad Gateway，需要加入请求头

# ————————————————
# 版权声明：本文为CSDN博主「sunon_」的原创文章，遵循CC
# 4.0
# BY - SA版权协议，转载请附上原文出处链接及本声明。
# 原文链接：https: // blog.csdn.net / sunon_ / article / details / 90634253

博主：sunon_
博文：第一个Python爬虫
原文链接：https://blog.csdn.net/sunon_/article/details/90634253?depth_1-utm_source=distribute.pc_relevant.none-task&utm_source=distribute.pc_relevant.none-task

mr_xinL

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
爬虫初学1

模仿代码，爬取新浪图片import urllib.requestimport reimport chardet'''#打开网页，读取网页，网页解码'''page = urllib.request.urlopen('http://photo.sina.com.cn/') # 打开网页htmlCode = page.read() # 获取网页源代码# print(chardet.de...
复制链接

扫一扫

专栏目录