python爬虫难点_python网页爬虫浅析

最新推荐文章于 2023-10-11 10:09:00 发布

urcarlllll

最新推荐文章于 2023-10-11 10:09:00 发布

阅读量557

点赞数

文章标签： python爬虫难点

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/weixin_34901508/article/details/113672802

版权

Python网页爬虫简介：

有时候我们需要把一个网页的图片copy 下来。通常手工的方式是鼠标右键 save picture as ...

python 网页爬虫可以一次性把所有图片copy 下来。

步骤如下：

1. 读取要爬虫的html

2. 对爬下来的html 进行存储并处理：存储原始html

过滤生成list

正则匹配出picture的连接

3. 根据连接保存图片到本地

主要的难点：熟悉urllib ,

正则匹配查找图片链接

代码如下：import urllib.request

import os

import redef getHtml(url): #get html

page = urllib.request.urlopen(url)

html = page.read()

return html

def write(html, htmlfile): #write html into a file name html.txt

try:

f = open(htmlfile, mode='w')

f.writelines(str(html))

f.close()

except TypeError:

print ("write html file failed")def getImg2(html, initialFile, finalFile):

reg = '"*' #split string html with " and write in file name re.txt

imgre1 = re.compile(reg)

imglist = re.split(imgre1, str(html))

f1 = open(initialFile, mode='w')

for index in imglist:

f1.write("\n")

f1.write(index)

f1.close

reg2 = "^https.*jpg" # match items start with "https" and ends with "jpg"

imgre2 = re.compile(reg2)

f2 = open(initialFile, mode='r')

f3 = open(finalFile, mode='w')

tempre = f2.readlines()

for index in tempre:

temp = re.match(imgre2,index)

if temp != None:

f3.write(index)

#f3.write("\n")

f2.close()

f3.close()def saveImg2(p_w_picpathfile): #save p_w_picpath

f_imglist2 = open(p_w_picpathfile, mode='r')

templist = f_imglist2.readlines()

x = 0

for index in templist:

urllib.request.urlretrieve(index,'%s.jpg' %x)

x = x + 1html = "https://p_w_picpath.baidu.com/search/index?tn=baidup_w_picpath&ct=201326592&lm=-1&cl=2&ie=gbk&word=%BA%FB%B5%FB&fr=ala&ala=1&alatpl=adress&pos=0&hs=2&xthttps=111111"

htmlfile = "D:\\New\\html.txt"

SplitFile = "D:\\New\\re.txt"

imgefile = "D:\\New\\imglist.txt"html = getHtml(html)

print("get html complete!")

getImg2(html, SplitFile, imgefile)

print("get Image link list complete! ")

saveImg2(imgefile)

print("Save Image complete!")

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python爬虫难点_python网页爬虫浅析

Python网页爬虫简介：有时候我们需要把一个网页的图片copy 下来。通常手工的方式是鼠标右键 save picture as ...python 网页爬虫可以一次性把所有图片copy 下来。步骤如下：1. 读取要爬虫的html2. 对爬下来的html 进行存储并处理：存储原始html过滤生成list正则匹配出picture的连接3. 根据连接保存图片到本地主要的难点：熟悉urllib ,正则匹...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。