爬轮子哥的博客
因为有些时候网络条件不太好,所以索性把轮子哥博客爬下来,离线看。之前看了他的正则引擎文章,自己也写出来了一个简单的正则引擎,收获还是挺多的。有了正则工具,构造可配置词法分析器就方便的很,然后就会想干脆做个编译器吧,于是就得开始做语法分析器,然后就得代码生成,再设计一套虚拟机汇编,让你的代码跑起来。总之都在一步步提高自己的能力。真心觉得轮子博客文章质量很高,我有一个愿望,就是能尽量遍历一边轮子哥写过的项目,不需要达到那么高的水平,只要懂这个东西怎么做出来的就行,学到造这个东西的所有想法我觉得就ok了。
python我是现学现用的,这个小程序的难度不大,就是很繁琐。
# -*- coding: UTF-8 -*-
import shutil
import urllib2
import re
import os
def getCss():
"获得css文件"
response = urllib2.urlopen("http://www.cppblog.com/vczh/archive/2008/04/21/47719.html")
res = response.read();
pattern = re.compile(r'\"\S*\.css\"')
s = re.findall(pattern,res)
for css in s:
n = len(css)-1
css=css[1:n]
if(css.find("http://")==-1):
cssfullpath="http://www.cppblog.com"+css
else: cssfullpath=css
response = urllib2.urlopen(cssfullpath)
res = response.read()
pattern = re.compile(r'\w*\.css')
filename = re.findall(pattern,css)
print "this is file name "
print filename[0]
fo = open(filename[0],"wb")
fo.write(res)
fo.close()
return
def signleFileHandle(url,dirname):
htmpattern = re.compile(r'\w*\.html')
result = re.findall(htmpattern,url)
htmlname = result[0]
if(os.path.isfile(dirname+'/'+htmlname)):
print "File exist:",dirname+'/'+htmlname
return
response = urllib2.urlopen(url)
res = response.read()
pattern = re.compile(r'(<div class = \"post\">[\d\D]*)<div class = \"postDesc\">')
a = re.findall(pattern,res)
#download imgsrc
pattern = re.compile(r'src=\"http://(.+?\.jpg|.+?\.png)\"')
pic = re.findall(pattern,a[0])
if(pic==[]):
print "No pictures:"
maddsrc = a[0]
for item in pic:
item = "http://" + item
print item
imgpattern = re.compile(r'\w*\.jpg|\w*\.png')
imgname = re.findall(imgpattern,item)
imgpath=dirname+'/'+imgname[0]
if(os.path.isfile(imgpath)==False):
f = urllib2.urlopen(item)
data = f.read()
with open(imgpath, "wb") as code:
code.write(data)
maddsrc = maddsrc.replace(item,imgname[0])
print "Downlod picture:",imgpath
else:
print "Img exist:"
#gen html
fo = open("head.html","r+")
str = fo.read()
tail = "</div></body></html>"
maddsrc = str + maddsrc + tail
fo.close()
filename = dirname+'/'+htmlname
if(os.path.isfile(filename)==False):
fo = open(filename,"wb")
fo.write(maddsrc)
fo.close()
print "Generate article :",filename
else:
print "File exist:"
return
def getCategory(url):
"sort by category"
htmpattern = re.compile(r'\w*\.html')
result = re.findall(htmpattern,url)
htmlname = result[0]
url=url+"?Show=All"
response = urllib2.urlopen(url)
res = response.read()
pattern = re.compile(r'(<div class=\"entrylist\">[\d\D]*)<div class=\"entrylistItemPostDesc\">')
a = re.findall(pattern,res)
if(a==[]):
print "Failed:",url
return
fo = open("head.html","r+")
str = fo.read()
tail = "</div></body></html>"
maddsrc = str + a[0] + tail
fo.close()
filename = htmlname
print "\n\n======= Generate Category html:",filename
#get the essay url
pattern = re.compile(r'href=\"(http://www.cppblog.com/vczh/archive/\w*/\w*/\w*/\w*\.html)\"')
res = re.findall(pattern,maddsrc)
for item in res:
pattern = re.compile(r'http://www.cppblog.com/vczh/archive/\w*/\w*/\w*/(.+?\.html)')
crec = re.findall(pattern,item)
dirname = filename[0:len(filename)-5]
if(os.path.isdir(dirname)==False):
os.mkdir(dirname)
print "Make directory:",dirname
maddsrc = maddsrc.replace(item,dirname+'/'+crec[0])
fo = open(filename,"wb")
fo.write(maddsrc)
fo.close()
#signleFileHandle(item,dirname)
return
## 主逻辑
def getHtml():
"sort by category"
response = urllib2.urlopen("http://www.cppblog.com/vczh/default.html?page=1&OnlyTitle=1")
res = response.read()
pattern = re.compile(r'(http://www.cppblog.com/vczh/category/\d*\.html)\"')
category = re.findall(pattern,res)
for item in category:
getCategory(item)
return
def genIndexfile():
"生成首页"
fo = urllib2.urlopen("http://www.cppblog.com/vczh/default.html?page=1&OnlyTitle=1")
src = fo.read()
pattern = re.compile(r'href="http://www.cppblog.com/vczh/category/(.*?\.html">)')
clist = re.findall(pattern,src)
print clist
pattern = re.compile(r'href="http://www.cppblog.com/vczh/category/\d*.html">(.+?)</a>')
nlist = re.findall(pattern,src)
fo.close()
fo = open("head.html","r+")
str = fo.read()
fo.close()
fo = open("index.html","wb")
tail = "</div></body></html>"
for i in range(0,len(nlist)):
str=str+"<li><a href=\""+clist[i]+nlist[i]+"</a></li>"
str=str+tail
fo.write(str)
fo.close()
return
getCss()
getHtml()
执行这段代码,会在当前目录生成目录文件夹和目录对应的html,以及在目录文件夹下,会把该目录下的文章都下载下来,并且删掉多余的标签,只保留正文部分。首页的话自己改一下genIndexfile()
里的内容就行了。
里边用到的head.html
我是这样写的,图个简便。
<html>
<head id="Head"><title>
</title><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /><link type="text/css" rel="stylesheet" href="common.css" /><link id="MainCss" type="text/css" rel="stylesheet" href="style.css" /></head>
<body>
然后要把当前目录下的css文件复制到生成的每个目录中去,因为写完了我才发现head.html
里css路径我没改,故作此亡羊补牢之措。把下面代码保存为mvcss.sh
,然后$chmod 777 mvcss.sh&&./mvcss.sh
#!/bin/bash
for dirname in `ls -F|grep "/"`
do
echo $dirname
cp *.css $dirname
ls -a $dirname|grep css
done
目录截图: