linux下用python写简单的爬虫程序

最新推荐文章于 2024-04-28 07:24:43 发布

blog_liuliang

最新推荐文章于 2024-04-28 07:24:43 发布

阅读量4k

点赞数 2

分类专栏： linux python 文章标签： python url 正则表达式 linux 爬虫

本文链接：https://blog.csdn.net/blog_liuliang/article/details/51508668

版权

linux 同时被 3 个专栏收录

30 篇文章 0 订阅

订阅专栏

学习

29 篇文章 0 订阅

订阅专栏

python

17 篇文章 0 订阅

订阅专栏

linux下用python写简单的爬虫程序

简述下这个爬虫程序的基本原理：

HTTP请求
通过起始url获得页面内容
正则表达式
通过正则表达式获取想要的信息
获取到本地

http请求

geturl.py

#coding=utf-8
import urllib

def getHtml(url):
    page = urllib.urlopen(url)
    html = page.read()
    return html

html = getHtml("http://tieba.baidu.com/p/2738151262")

print html

新建一个geturl.py,在里面定义一个getHtml（）函数获取网页内容。

正则表达式

**通过正则表达式获取你所想要的内容：

import re
import urllib

def getHtml(url):
    page = urllib.urlopen(url)
    html = page.read()
    return html

def getImg(html):
    reg = r'src="(.+?\.jpg)" pic_ext'
    imgre = re.compile(reg)
    imglist = re.findall(imgre,html)
    return imglist      

html = getHtml("http://tieba.baidu.com/p/2460150866")
print getImg(html)

正则表达式：

** 可选项

在子模式后面加上问号，它就变成了可选项。它可能出现在匹配字符串，但并非必须的。

r’(heep://)?(www.)?python.org’

只能匹配下列字符：

‘http://www.python.org’

‘http://python.org’

‘www.python.org’

‘python.org’

**　重复子模式

(pattern)* : 允许模式重复0次或多次

(pattern)+ : 允许模式重复1次或多次

(pattern){m,n} : 允许模式重复m~ n 次

我们又创建了getImg()函数，用于在获取的整个页面中筛选需要的图片连接。re模块主要包含了正则表达式：

　　re.compile() 可以把正则表达式编译成一个正则表达式对象.

　　re.findall() 方法读取html 中包含 imgre（正则表达式）的数据。

获取图片url效果图：
这里写图片描述

将图片保存到本地

这里主要运用了urllib.urlretrieve()方法，将远程数据下载到本地

利用for循环对图片进行遍历，并且对其重命名1.

#coding=utf-8
import urllib
import re

def getHtml(url):
    page = urllib.urlopen(url)
    html = page.read()
    return html

def getImg(html):
    reg = r'src="(.+?\.jpg)" pic_ext'
    imgre = re.compile(reg)
    imglist = re.findall(imgre,html)
    x = 0
    for imgurl in imglist:
        urllib.urlretrieve(imgurl,'%s.jpg' % x)
        x+=1


html = getHtml("http://tieba.baidu.com/p/2460150866")

print getImg(html)

获取到的图片保存在默认的程序存放目录
这里写图片描述

哈哈，简单python爬虫程序就到这里了。

这里是脚注的内容. ↩

blog_liuliang

关注

2
点赞
踩
11

收藏

觉得还不错? 一键收藏
0
评论
linux下用python写简单的爬虫程序

linux下用python写简单的爬虫程序简述下这个爬虫程序的基本原理：HTTP请求通过起始url获得页面内容正则表达式通过正则表达式获取想要的信息获取到本地http请求geturl.py#coding=utf-8import urllibdef getHtml(url): page = urllib.urlopen(url) html = page.read()
复制链接

扫一扫