爬轮子哥的博客

爬轮子哥的博客

因为有些时候网络条件不太好,所以索性把轮子哥博客爬下来,离线看。之前看了他的正则引擎文章,自己也写出来了一个简单的正则引擎,收获还是挺多的。有了正则工具,构造可配置词法分析器就方便的很,然后就会想干脆做个编译器吧,于是就得开始做语法分析器,然后就得代码生成,再设计一套虚拟机汇编,让你的代码跑起来。总之都在一步步提高自己的能力。真心觉得轮子博客文章质量很高,我有一个愿望,就是能尽量遍历一边轮子哥写过的项目,不需要达到那么高的水平,只要懂这个东西怎么做出来的就行,学到造这个东西的所有想法我觉得就ok了。

python我是现学现用的,这个小程序的难度不大,就是很繁琐。

# -*- coding: UTF-8 -*-
import shutil
import urllib2
import re
import os

def getCss():
    "获得css文件"
    response = urllib2.urlopen("http://www.cppblog.com/vczh/archive/2008/04/21/47719.html")
    res = response.read();
    pattern = re.compile(r'\"\S*\.css\"')
    s = re.findall(pattern,res)
    for css in s:
        n = len(css)-1
        css=css[1:n]
        if(css.find("http://")==-1):
            cssfullpath="http://www.cppblog.com"+css
        else: cssfullpath=css
        response = urllib2.urlopen(cssfullpath)
        res = response.read()
        pattern = re.compile(r'\w*\.css')
        filename = re.findall(pattern,css)
        print "this is file name "
        print filename[0]
        fo = open(filename[0],"wb")
        fo.write(res)
        fo.close()
    return

def signleFileHandle(url,dirname):
    htmpattern = re.compile(r'\w*\.html')
    result = re.findall(htmpattern,url)
    htmlname = result[0]
    if(os.path.isfile(dirname+'/'+htmlname)):
        print "File exist:",dirname+'/'+htmlname
        return
    response = urllib2.urlopen(url)
    res = response.read()
    pattern = re.compile(r'(<div class = \"post\">[\d\D]*)<div class = \"postDesc\">')
    a = re.findall(pattern,res)
    #download imgsrc
    pattern = re.compile(r'src=\"http://(.+?\.jpg|.+?\.png)\"')
    pic = re.findall(pattern,a[0])
    if(pic==[]):
        print "No pictures:"
    maddsrc = a[0]
    for item in pic:
        item = "http://" + item
        print item
        imgpattern = re.compile(r'\w*\.jpg|\w*\.png')
        imgname = re.findall(imgpattern,item)
        imgpath=dirname+'/'+imgname[0]
        if(os.path.isfile(imgpath)==False):
            f = urllib2.urlopen(item)
            data = f.read()
            with open(imgpath, "wb") as code:
                code.write(data)
            maddsrc = maddsrc.replace(item,imgname[0])
            print "Downlod picture:",imgpath
        else:
            print "Img exist:"
    #gen html
    fo = open("head.html","r+")
    str = fo.read()
    tail = "</div></body></html>"
    maddsrc = str +  maddsrc + tail
    fo.close()
    filename = dirname+'/'+htmlname
    if(os.path.isfile(filename)==False):
        fo = open(filename,"wb")
        fo.write(maddsrc)
        fo.close()
        print "Generate article :",filename
    else:
        print "File exist:"
    return

def getCategory(url):
    "sort by category"
    htmpattern = re.compile(r'\w*\.html')
    result = re.findall(htmpattern,url)
    htmlname = result[0]

    url=url+"?Show=All"
    response = urllib2.urlopen(url)
    res = response.read()
    pattern = re.compile(r'(<div class=\"entrylist\">[\d\D]*)<div class=\"entrylistItemPostDesc\">')
    a = re.findall(pattern,res)
    if(a==[]):
        print "Failed:",url
        return
    fo = open("head.html","r+")
    str = fo.read()
    tail = "</div></body></html>"
    maddsrc = str +  a[0] + tail
    fo.close()


    filename = htmlname
    print "\n\n======= Generate Category html:",filename
    #get the essay url
    pattern = re.compile(r'href=\"(http://www.cppblog.com/vczh/archive/\w*/\w*/\w*/\w*\.html)\"')
    res = re.findall(pattern,maddsrc)
    for item in res:
        pattern = re.compile(r'http://www.cppblog.com/vczh/archive/\w*/\w*/\w*/(.+?\.html)')
        crec = re.findall(pattern,item)
        dirname = filename[0:len(filename)-5]
        if(os.path.isdir(dirname)==False):
            os.mkdir(dirname)
            print "Make directory:",dirname
        maddsrc = maddsrc.replace(item,dirname+'/'+crec[0])
        fo = open(filename,"wb")
        fo.write(maddsrc)
        fo.close()
        #signleFileHandle(item,dirname)
    return

## 主逻辑
def getHtml():
    "sort by category"
    response = urllib2.urlopen("http://www.cppblog.com/vczh/default.html?page=1&OnlyTitle=1")
    res = response.read()
    pattern = re.compile(r'(http://www.cppblog.com/vczh/category/\d*\.html)\"')
    category = re.findall(pattern,res)
    for item in category:
        getCategory(item)
    return


def genIndexfile():
    "生成首页"
    fo = urllib2.urlopen("http://www.cppblog.com/vczh/default.html?page=1&OnlyTitle=1")
    src = fo.read()
    pattern = re.compile(r'href="http://www.cppblog.com/vczh/category/(.*?\.html">)')
    clist = re.findall(pattern,src)
    print clist
    pattern = re.compile(r'href="http://www.cppblog.com/vczh/category/\d*.html">(.+?)</a>')
    nlist = re.findall(pattern,src)
    fo.close()
    fo = open("head.html","r+")
    str = fo.read()
    fo.close()
    fo = open("index.html","wb")
    tail = "</div></body></html>"
    for i in range(0,len(nlist)):
        str=str+"<li><a href=\""+clist[i]+nlist[i]+"</a></li>"
    str=str+tail
    fo.write(str)
    fo.close()
return


getCss()
getHtml()

执行这段代码,会在当前目录生成目录文件夹和目录对应的html,以及在目录文件夹下,会把该目录下的文章都下载下来,并且删掉多余的标签,只保留正文部分。首页的话自己改一下genIndexfile()里的内容就行了。

里边用到的head.html我是这样写的,图个简便。

<html>
<head id="Head"><title>
</title><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /><link type="text/css" rel="stylesheet" href="common.css" /><link id="MainCss" type="text/css" rel="stylesheet" href="style.css" /></head>
<body>

然后要把当前目录下的css文件复制到生成的每个目录中去,因为写完了我才发现head.html里css路径我没改,故作此亡羊补牢之措。把下面代码保存为mvcss.sh,然后$chmod 777 mvcss.sh&&./mvcss.sh

#!/bin/bash
for dirname in `ls -F|grep "/"`
do
    echo $dirname
    cp *.css $dirname
    ls -a $dirname|grep css
done

目录截图:
Alt text

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值