第一个脚本时getlink.py,它的功能是获取各章节的链接。打开“当时明月”的blog,点击“我的所有文章”,作者所有发表的文章将被分页列出,其中标题形如“.?长篇.?明朝的那些事儿-历史应该可以写得好看\[(\d*)-(\d*)\]”都是该长篇的章节。getlink.py所要做的就是将这些章节及相应链接保存到links.dat中备用。
getlink.py
1
#
-*- coding: utf-8 -*-
2
import
urllib,re,os,sys,pickle
3
from
xml.dom.minidom
import
parse
4
import
xml.dom.minidom
5![None.gif](/Images/OutliningIndicators/None.gif)
6
uid
=
'
1233526741
'
#
当时明月的ID
7
8
#
读取章节及对应的链接
9
chapters
=
{}
#
存储章节及链接,如chapters['0001-0010']='/u/49861fd5010003ii'
10
for
i
in
range(
1
,
100
):
11
filehandle
=
urllib.urlopen(
'
http://blog.sina.com.cn/sns/service.php?m=aList&uid=%s&sort_id=0&page=%d
'
%
(uid,i))
12
myDoc
=
parse(filehandle)
13
myRss
=
myDoc.getElementsByTagName(
"
rss
"
)[0]
14
items
=
myRss.getElementsByTagName(
"
item
"
)
15
for
item
in
items:
16
title
=
item.getElementsByTagName(
"
title
"
)[0].childNodes[0].data
17
link
=
item.getElementsByTagName(
"
link
"
)[0].childNodes[0].data
18
match
=
re.search(ur
'
.?长篇.?明朝的那些事儿-历史应该可以写得好看\[(\d*)-(\d*)\]
'
,title)
19
if
match:
20
#
print match.group(1),":",match.group(2)
21
chapters[
'
%04d-%04d
'
%
(int(match.group(
1
)),int(match.group(
2
)))]
=
item.getElementsByTagName(
"
link
"
)[0].childNodes[0].data
22![None.gif](/Images/OutliningIndicators/None.gif)
23
#
将chapters保存到文件中以备用
24
output
=
open(
'
links.dat
'
,
'
wb+
'
)
25
pickle.dump(chapters, output)
26
output.close()
![None.gif](/Images/OutliningIndicators/None.gif)
2
![None.gif](/Images/OutliningIndicators/None.gif)
3
![None.gif](/Images/OutliningIndicators/None.gif)
4
![None.gif](/Images/OutliningIndicators/None.gif)
5
![None.gif](/Images/OutliningIndicators/None.gif)
6
![None.gif](/Images/OutliningIndicators/None.gif)
7
![None.gif](/Images/OutliningIndicators/None.gif)
8
![None.gif](/Images/OutliningIndicators/None.gif)
9
![None.gif](/Images/OutliningIndicators/None.gif)
10
![None.gif](/Images/OutliningIndicators/None.gif)
11
![None.gif](/Images/OutliningIndicators/None.gif)
12
![None.gif](/Images/OutliningIndicators/None.gif)
13
![None.gif](/Images/OutliningIndicators/None.gif)
14
![None.gif](/Images/OutliningIndicators/None.gif)
15
![None.gif](/Images/OutliningIndicators/None.gif)
16
![None.gif](/Images/OutliningIndicators/None.gif)
17
![None.gif](/Images/OutliningIndicators/None.gif)
18
![None.gif](/Images/OutliningIndicators/None.gif)
19
![None.gif](/Images/OutliningIndicators/None.gif)
20
![None.gif](/Images/OutliningIndicators/None.gif)
21
![None.gif](/Images/OutliningIndicators/None.gif)
22
![None.gif](/Images/OutliningIndicators/None.gif)
23
![None.gif](/Images/OutliningIndicators/None.gif)
24
![None.gif](/Images/OutliningIndicators/None.gif)
25
![None.gif](/Images/OutliningIndicators/None.gif)
26
![None.gif](/Images/OutliningIndicators/None.gif)
当取得links.dat后,下面利用bookdownload.py将所有章节的内容下载下载并整理成最后的全文文件mingthings.txt。这一脚本的关键时第19行,从下载的内容中取得每一篇的实际内容(去除广告、脚本等)。通过分析每篇文章的html源文件发现,我们所要的东西都位于<div id="articleTextxxxxxxxxxxxxxxxx">...</div>中,其中xxxxxxxxxxxxxxxx即为该文链接,因此用正则表达式即可获取该篇的实际内容。
bookdownload.py
1
#
-*- coding: utf-8 -*-
2
import
urllib,re,os,sys,pickle
3
from
xml.dom.minidom
import
parse
4
import
xml.dom.minidom
5![None.gif](/Images/OutliningIndicators/None.gif)
6
uid
=
'
1233526741
'
#
当时明月的ID
7
8
#
读取章节及对应的链接
9
chapters
=
{}
#
存储章节及链接,如chapters['0001-0010']='/u/49861fd5010003ii'
10
links
=
open(
'
links.dat
'
,
'
rb+
'
)
#
links.dat由getlinks.py生成
11
chapters
=
pickle.load(links)
#
从links.dat读取章节及对应链接信息到chapters
12
13
book
=
open(
'
mingthings.txt
'
,
'
w+
'
)
#
mingthings.txt即为最终要生成的全文
14
for
chapter
in
sorted(chapters):
15
print
chapter
#
输出当前正在处理的章节
16
webpage
=
urllib.urlopen(
'
http://blog.sina.com.cn
'
+
chapters[chapter]).read().decode(
'
utf-8
'
)
17![None.gif](/Images/OutliningIndicators/None.gif)
18
#
s: Dot match new line; i: Case insenstive; m: ^$ match at linebreaks
19
match
=
re.search(ur
'
(?siLu).*<div id="articleText
'
+
chapters[chapter][
3
:]
+
'
".*?>(.*?)</div>.*
'
,webpage)
20
if
match:
21
text
=
match.group(
1
)
#
获取每章的内容
22
23
#
整理每章内容
24
text
=
re.sub(ur
'
(?sLu)<(.*?)>
'
,
''
,text)
25
text
=
re.sub(ur
'
(?sLu)( )+
'
,
'
'
,text)
26
text
=
re.sub(ur
"
(?Lum)^( +)
"
,
""
, text)
27
text
=
re.sub(ur
'
(?Lum)^(\s+)
'
,
''
,text)
28
text
=
re.sub(ur
'
(?siLu)(.?长篇.?明朝的那些事儿-历史应该可以写得好看\[\d*])
'
,r
'
\r\n\1\r\n
'
,text)
29
text
=
re.sub(ur
'
(?Lum)^(.*)$
'
,ur
'
\1
'
,text)
30![None.gif](/Images/OutliningIndicators/None.gif)
31
book.write(text.encode(
'
gbk
'
,
'
ignore
'
)
+
"
\r\n\r\n
"
)
32
book.flush
33
34
book.close()
![None.gif](/Images/OutliningIndicators/None.gif)
2
![None.gif](/Images/OutliningIndicators/None.gif)
3
![None.gif](/Images/OutliningIndicators/None.gif)
4
![None.gif](/Images/OutliningIndicators/None.gif)
5
![None.gif](/Images/OutliningIndicators/None.gif)
6
![None.gif](/Images/OutliningIndicators/None.gif)
7
![None.gif](/Images/OutliningIndicators/None.gif)
8
![None.gif](/Images/OutliningIndicators/None.gif)
9
![None.gif](/Images/OutliningIndicators/None.gif)
10
![None.gif](/Images/OutliningIndicators/None.gif)
11
![None.gif](/Images/OutliningIndicators/None.gif)
12
![None.gif](/Images/OutliningIndicators/None.gif)
13
![None.gif](/Images/OutliningIndicators/None.gif)
14
![None.gif](/Images/OutliningIndicators/None.gif)
15
![None.gif](/Images/OutliningIndicators/None.gif)
16
![None.gif](/Images/OutliningIndicators/None.gif)
17
![None.gif](/Images/OutliningIndicators/None.gif)
18
![None.gif](/Images/OutliningIndicators/None.gif)
19
![None.gif](/Images/OutliningIndicators/None.gif)
20
![None.gif](/Images/OutliningIndicators/None.gif)
21
![None.gif](/Images/OutliningIndicators/None.gif)
22
![None.gif](/Images/OutliningIndicators/None.gif)
23
![None.gif](/Images/OutliningIndicators/None.gif)
24
![None.gif](/Images/OutliningIndicators/None.gif)
25
![None.gif](/Images/OutliningIndicators/None.gif)
26
![None.gif](/Images/OutliningIndicators/None.gif)
27
![None.gif](/Images/OutliningIndicators/None.gif)
28
![None.gif](/Images/OutliningIndicators/None.gif)
29
![None.gif](/Images/OutliningIndicators/None.gif)
30
![None.gif](/Images/OutliningIndicators/None.gif)
31
![None.gif](/Images/OutliningIndicators/None.gif)
32
![None.gif](/Images/OutliningIndicators/None.gif)
33
![None.gif](/Images/OutliningIndicators/None.gif)
34
![None.gif](/Images/OutliningIndicators/None.gif)
每天依次运行一遍上面的两个脚本,《明朝的那些事儿-历史应该可以写得好看》的最新全文就在你的硬盘上了。
总之,正则表达式是下载长篇分章节网络小说的利器。