Python 爬虫小程序（正则表达式的应用）

最新推荐文章于 2023-09-06 09:26:55 发布

fs906642458

最新推荐文章于 2023-09-06 09:26:55 发布

阅读量777

点赞数

文章标签： python

文章原地址：http://pmghong.blog.51cto.com/3221425/1334086

目标：通过正则表达式写一个爬虫程序，抓下网页的所有图片。

思路

1. 获取网页源代码

2. 获取图片

3. 下载图片

第一步，打开URL 获取源代码

[root@node1 python]# mkdir image
[root@node1 python]# cd image
[root@node1 python]# vim getHtml.py
#!/usr/bin/python
import re
import urllib
                                              
def getHtml(url):
        html = urllib.urlopen(url)
        scode = html.read()
        return scode
                                              
print getHtml('http://tieba.baidu.com/p/1762577651')

第二步，获取图片相关地址（正则匹配）

从取回的源代码中分析图片相关URL 的构造，然后通过正则表达式将图片地址提取出来

源文件中图片的标签是这样子的：

<img class="BDE_Image"src="http://imgsrc.baidu.com/forum/w%3D580/sign=2e8f3ca53af33a879e6d0012f65d1018/4ece3bc
79f3df8dc2ab63004cd11728b46102899.jpg" width="560" height="400" changedsize="true">

要获取的是 http://imgsrc.baidu.com /xxxxxxx.jpg

#!/usr/bin/python
import re
import urllib
                                                   
def getHtml(url):
        html = urllib.urlopen(url)
        scode = html.read()
        return scode
                                                   
def getImage(source):
        re = r'src="(.*?\.jpg)" width='
        imgre = re.compile(re)
        images = re.findall(imgre,source)
        return images
                                                   
source = getHtml('http://tieba.baidu.com/p/1762577651')
print getImage(source)

第三步，下载获取到的图片

上一步已经将取到的图片地址存放在一个列表中了，现在只有对这个列表做一个遍历即可

#!/usr/bin/python
import re
import urllib
                                       
def getHtml(url):
        html = urllib.urlopen(url)
        scode = html.read()
        return scode
                                       
def getImage(source):
        re = r'src="(.*?\.jpg)" width='
        imgre = re.compile(re)
        images = re.findall(imgre,source)
        for i in images:
                urllib.urlretrieve(i,'1.jpg')
                                       
source = getHtml('http://tieba.baidu.com/p/1762577651')
print getImage(source)

但是这样会有一个问题，就是每个图片保存下来后都会被命名为1.jpg ，换句话说就是后面的图片会覆盖前面的图片，所以只能保存到一个图片。因此还需要一步，对图片进行命名

#!/usr/bin/python
import re
import urllib
                                 
def getHtml(url):
        html = urllib.urlopen(url)
        scode = html.read()
        return scode
                                 
def getImage(source):
        re = r'src="(.*?\.jpg)" width='
        imgre = re.compile(re)
        images = re.findall(imgre,source)
        x = 0
        for i in images:
                urllib.urlretrieve(i,'%s.jpg' % x)
                x+=1
                                 
source = getHtml('http://tieba.baidu.com/p/1762577651')
print getImage(source)

执行结果：

[root@node1 image]# python getHtml.py
[root@node1 image]# ls
11.jpg  13.jpg  15.jpg  17.jpg  19.jpg  20.jpg  3.jpg  5.jpg  7.jpg  9.jpg  10.jpg
12.jpg  14.jpg  16.jpg  18.jpg  1.jpg   2.jpg   4.jpg  6.jpg  8.jpg  getHtml.py

fs906642458

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Python 爬虫小程序（正则表达式的应用）

目标：通过正则表达式写一个爬虫程序，抓下网页的所有图片。思路1. 获取网页源代码2. 获取图片3. 下载图片第一步，打开URL 获取源代码[root@node1 python]# mkdir image[root@node1 python]# cd image[root@node1 python]# vim getHtml.
复制链接

扫一扫