python爬虫之爬取网页

最新推荐文章于 2023-10-13 13:58:37 发布

红金龙-时光

最新推荐文章于 2023-10-13 13:58:37 发布

阅读量867

点赞数

分类专栏： Python 文章标签： python

Python 专栏收录该内容

24 篇文章 0 订阅

订阅专栏

本文代码源自 Python爬虫抓取网页图片
希望读者点击原文进行阅览，本人使用Python2.7+Pycharm编译正常
本文采用Markdown格式编写

# -*- coding: utf-8 -*-
import urllib
import re
import time
import os

#显示下载进度
def schedule(a,b,c):
  '''''
  a:已经下载的数据块
  b:数据块的大小
  c:远程文件的大小
   '''
  per = 100.0 * a * b / c
  if per > 100 :
    per = 100
  print '%.2f%%' % per

def getHtml(url):
  page = urllib.urlopen(url)
  html = page.read()
  return html

def downloadImg(html):
  reg = r'src="(.+?\.jpg)" pic_ext'
  imgre = re.compile(reg)
  imglist = re.findall(imgre, html)
  #定义文件夹的名字
  t = time.localtime(time.time())
  foldername = str(t.__getattribute__("tm_year"))+"-"+str(t.__getattribute__("tm_mon"))+"-"+str(t.__getattribute__("tm_mday"))
  picpath = 'D:\\ImageDownload\\%s' % (foldername) #下载到的本地目录

  if not os.path.exists(picpath):   #路径不存在时创建一个
    os.makedirs(picpath)
  x = 0
  for imgurl in imglist:
    target = picpath+'\\%s.jpg' % x
    print 'Downloading image to location: ' + target + '\nurl=' + imgurl
    image = urllib.urlretrieve(imgurl, target, schedule)
    x += 1
  return image;



if __name__ == '__main__':
  print '''         *************************************
      **      Welcome to use Spider   **
      **     Created on  2014-05-13   **
      **       @author: cruise         **
      *************************************'''

  html = getHtml("http://tieba.baidu.com/p/2460150866")

  downloadImg(html)
  print "Download has finished."

这是百度之后得到的一个有用的代码，测试一下正常使用
很显然，我们对百度贴吧的图片没什么意向，新浪博客首页经常有个拍妹子的博客，我就盯上那个了，慢慢改变一下代码，学习一下

红金龙-时光

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python爬虫之爬取网页

本文代码源自 Python爬虫抓取网页图片希望读者点击原文进行阅览，本人使用Python2.7+Pycharm编译正常本文采用Markdown格式编写# -*- coding: utf-8 -*-import urllibimport reimport timeimport os#显示下载进度def schedule(a,b,c): ''''' a:已经下载的数据块 b
复制链接

扫一扫

专栏目录