爬轮子哥的博客

最新推荐文章于 2019-12-11 12:15:00 发布

希望之力

最新推荐文章于 2019-12-11 12:15:00 发布

阅读量2.1k

点赞数

分类专栏： Python 文章标签：博客 vczh 爬虫

本文链接：https://blog.csdn.net/tanxsway/article/details/50554059

版权

Python 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

爬轮子哥的博客

因为有些时候网络条件不太好，所以索性把轮子哥博客爬下来，离线看。之前看了他的正则引擎文章，自己也写出来了一个简单的正则引擎，收获还是挺多的。有了正则工具，构造可配置词法分析器就方便的很，然后就会想干脆做个编译器吧，于是就得开始做语法分析器，然后就得代码生成，再设计一套虚拟机汇编，让你的代码跑起来。总之都在一步步提高自己的能力。真心觉得轮子博客文章质量很高，我有一个愿望，就是能尽量遍历一边轮子哥写过的项目，不需要达到那么高的水平，只要懂这个东西怎么做出来的就行，学到造这个东西的所有想法我觉得就ok了。

python我是现学现用的，这个小程序的难度不大，就是很繁琐。

# -*- coding: UTF-8 -*-
import shutil
import urllib2
import re
import os

def getCss():
    "获得css文件"
    response = urllib2.urlopen("http://www.cppblog.com/vczh/archive/2008/04/21/47719.html")
    res = response.read();
    pattern = re.compile(r'\"\S*\.css\"')
    s = re.findall(pattern,res)
    for css in s:
        n = len(css)-1
        css=css[1:n]
        if(css.find("http://")==-1):
            cssfullpath="http://www.cppblog.com"+css
        else: cssfullpath=css
        response = urllib2.urlopen(cssfullpath)
        res = response.read()
        pattern = re.compile(r'\w*\.css')
        filename = re.findall(pattern,css)
        print "this is file name "
        print filename[0]
        fo = open(filename[0],"wb")
        fo.write(res)
        fo.close()
    return

def signleFileHandle(url,dirname):
    htmpattern = re.compile(r'\w*\.html')
    result = re.findall(htmpattern,url)
    htmlname = result[0]
    if(os.path.isfile(dirname+'/'+htmlname)):
        print "File exist:",dirname+'/'+htmlname
        return
    response = urllib2.urlopen(url)
    res = response.read()
    pattern = re.compile(r'(<div class = \"post\">[\d\D]*)<div class = \"postDesc\">')
    a = re.findall(pattern,res)
    #download imgsrc
    pattern = re.compile(r'src=\"http://(.+?\.jpg|.+?\.png)\"')
    pic = re.findall(pattern,a[0])
    if(pic==[]):
        print "No pictures:"
    maddsrc = a[0]
    for item in pic:
        item = "http://" + item
        print item
        imgpattern = re.compile(r'\w*\.jpg|\w*\.png')
        imgname = re.findall(imgpattern,item)
        imgpath=dirname+'/'+imgname[0]
        if(os.path.isfile(imgpath)==False):
            f = urllib2.urlopen(item)
            data = f.read()
            with open(imgpath, "wb") as code:
                code.write(data)
            maddsrc = maddsrc.replace(item,imgname[0])
            print "Downlod picture:",imgpath
        else:
            print "Img exist:"
    #gen html
    fo = open("head.html","r+")
    str = fo.read()
    tail = "</div></body></html>"
    maddsrc = str +  maddsrc + tail
    fo.close()
    filename = dirname+'/'+htmlname
    if(os.path.isfile(filename)==False):
        fo = open(filename,"wb")
        fo.write(maddsrc)
        fo.close()
        print "Generate article :",filename
    else:
        print "File exist:"
    return

def getCategory(url):
    "sort by category"
    htmpattern = re.compile(r'\w*\.html')
    result = re.findall(htmpattern,url)
    htmlname = result[0]

    url=url+"?Show=All"
    response = urllib2.urlopen(url)
    res = response.read()
    pattern = re.compile(r'(<div class=\"entrylist\">[\d\D]*)<div class=\"entrylistItemPostDesc\">')
    a = re.findall(pattern,res)
    if(a==[]):
        print "Failed:",url
        return
    fo = open("head.html","r+")
    str = fo.read()
    tail = "</div></body></html>"
    maddsrc = str +  a[0] + tail
    fo.close()


    filename = htmlname
    print "\n\n======= Generate Category html:",filename
    #get the essay url
    pattern = re.compile(r'href=\"(http://www.cppblog.com/vczh/archive/\w*/\w*/\w*/\w*\.html)\"')
    res = re.findall(pattern,maddsrc)
    for item in res:
        pattern = re.compile(r'http://www.cppblog.com/vczh/archive/\w*/\w*/\w*/(.+?\.html)')
        crec = re.findall(pattern,item)
        dirname = filename[0:len(filename)-5]
        if(os.path.isdir(dirname)==False):
            os.mkdir(dirname)
            print "Make directory:",dirname
        maddsrc = maddsrc.replace(item,dirname+'/'+crec[0])
        fo = open(filename,"wb")
        fo.write(maddsrc)
        fo.close()
        #signleFileHandle(item,dirname)
    return

## 主逻辑
def getHtml():
    "sort by category"
    response = urllib2.urlopen("http://www.cppblog.com/vczh/default.html?page=1&OnlyTitle=1")
    res = response.read()
    pattern = re.compile(r'(http://www.cppblog.com/vczh/category/\d*\.html)\"')
    category = re.findall(pattern,res)
    for item in category:
        getCategory(item)
    return


def genIndexfile():
    "生成首页"
    fo = urllib2.urlopen("http://www.cppblog.com/vczh/default.html?page=1&OnlyTitle=1")
    src = fo.read()
    pattern = re.compile(r'href="http://www.cppblog.com/vczh/category/(.*?\.html">)')
    clist = re.findall(pattern,src)
    print clist
    pattern = re.compile(r'href="http://www.cppblog.com/vczh/category/\d*.html">(.+?)</a>')
    nlist = re.findall(pattern,src)
    fo.close()
    fo = open("head.html","r+")
    str = fo.read()
    fo.close()
    fo = open("index.html","wb")
    tail = "</div></body></html>"
    for i in range(0,len(nlist)):
        str=str+"<li><a href=\""+clist[i]+nlist[i]+"</a></li>"
    str=str+tail
    fo.write(str)
    fo.close()
return


getCss()
getHtml()

执行这段代码，会在当前目录生成目录文件夹和目录对应的html，以及在目录文件夹下，会把该目录下的文章都下载下来，并且删掉多余的标签，只保留正文部分。首页的话自己改一下genIndexfile()里的内容就行了。

里边用到的head.html我是这样写的，图个简便。

<html>
<head id="Head"><title>
</title><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /><link type="text/css" rel="stylesheet" href="common.css" /><link id="MainCss" type="text/css" rel="stylesheet" href="style.css" /></head>
<body>

然后要把当前目录下的css文件复制到生成的每个目录中去，因为写完了我才发现head.html里css路径我没改，故作此亡羊补牢之措。把下面代码保存为mvcss.sh，然后$chmod 777 mvcss.sh&&./mvcss.sh

#!/bin/bash
for dirname in `ls -F|grep "/"`
do
    echo $dirname
    cp *.css $dirname
    ls -a $dirname|grep css
done

目录截图：
Alt text

希望之力

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录