2016.2.24:增加对title中非法字符的处理,解决保存文件时异常退出的问题。
1. 前言
在用Python脚本保存网页一文中,介绍了如何使用python脚本包括某个用户的所有csdn blog页面。如今看来,代码可读性不是很好,所以现在对其进行了重写。
2. 主要结构
2.1 文件列表
目前包括如下几个python文件:
- export_blog.py:导出blog数据的总入口文件;
- web_utils.py:根据URL自动保存页面、获取页面内容的几个函数;
- page_count_parser.py:根据个人blog主页面,获取其博客列表所在的URL列表;
- blog_item_parser.py:获取每个blog列表页面中,具体每个blog的URL、title。
2.2 主要流程
如下:
- 1. 确定blog的主页面(称为main page)
- 2. 从主页面获取所有博客列表所在的URL(称为page lists)
- 3. 遍历每个page list页面,获取详细的每个blog的URL和Title等信息(每个blog,称为blog item)
- 4. 在得到了所有的blog详细信息之后,就从这些URL读取数据,且保存到本地。
2.3 domain字典
上一节给出了几个术语,本节通过一些截图予以说明,并给出对应的HTML部分代码。。。。。TODO
3. 脚本的class和function帮助信息
3.1 export_blog.py
NAME
export_blog - #encoding: utf-8
FILE
d:\examples\python\export_blog\export_blog.py
FUNCTIONS
export_csdn_blogs(user_name, user_id, blog_saved_path, sleep_len)
Read the main_page_url, and parse all the blog information, then save to blog_saved_path.
e.g.:
user_name = 'a_flying_bird'
user_id = 'u013344915'
blog_saved_path = "D:\examples\python\export_blog\2015-07-25"
sleep_len = 5
export_csdn_blogs(user_name, user_id, blog_saved_path, sleep_len)
3.2 web_utils.py
NAME
web_utils
FILE
d:\examples\python\export_blog\web_utils.py
FUNCTIONS
fix_content(content)
<script type="text/javascript">
var protocol = window.location.protocol;
document.write('<script type="text/javascript" src="' + protocol + '//csdnimg.cn/pubfooter/js/repoAddr2.js?v=' + Math.random() + '"></' + 'script>');
</script>
While parsing the line of 'document.write...', there is some error. So we will delete this line.
get_page_content(url)
Get the web page's content. If filename is specified, save the content to this file.
save_page(url, filename)
Save the web page specified by url.
3.3 page_count_parser.py
NAME
page_count_parser - #encoding: utf-8
FILE
d:\examples\python\export_blog\page_count_parser.py
CLASSES
HTMLParser.HTMLParser(markupbase.ParserBase)
PageCountParser
class PageCountParser(HTMLParser.HTMLParser)
| Get the page count from this 'div'.
|
| example:
| <div id="papelist" class="pagelist">
| <span> 137鏉℃暟鎹? 鍏?0椤?/span>
| <strong>1</strong>
| <a href="http://blog.csdn.net/u013344915/article/list/2">2</a>
| <a href="http://blog.csdn.net/u013344915/article/list/3">3</a>
| <a href="http://blog.csdn.net/u013344915/article/list/4">4</a>
| <a href="http://blog.csdn.net/u013344915/article/list/5">5</a>
| <a href="http://blog.csdn.net/u013344915/article/list/6">...</a>
| <a href="http://blog.csdn.net/u013344915/article/list/2">涓嬩竴椤?/a>
| <a href="http://blog.csdn.net/u013344915/article/list/10">灏鹃〉</a>
| </div>
|
| Method resolution order:
| PageCountParser
| HTMLParser.HTMLParser
| markupbase.ParserBase
|
| Methods defined here:
|
| __init__(self, user_id)
|
| get_page_count(self)
|
| get_page_lists(self)
|
| handle_data(self, text)
|
| handle_endtag(self, tag)
|
| handle_starttag(self, tag, attrs)
|
| save_page_count(self, attrs)
| Save the pagecount.
|
| example:
| <a href="http://blog.csdn.net/u013344915/article/list/6">...</a>
| <a href="http://blog.csdn.net/u013344915/article/list/2">涓嬩竴椤?/a>
| <a href="http://blog.csdn.net/u013344915/article/list/10">灏鹃〉</a>
|
| Windowns 8, Firefox 38.0.6
| <a href="/u013344915/article/list/2">
|
| ----------------------------------------------------------------------
| Methods inherited from HTMLParser.HTMLParser:
|
| .......
FUNCTIONS
get_page_lists(content, user_id)
Get the page lists' url.
3.4 blog_item_parser.py
NAME
blog_item_parser - #encoding: utf-8
FILE
d:\examples\python\export_blog\blog_item_parser.py
CLASSES
HTMLParser.HTMLParser(markupbase.ParserBase)
BlogItemsParser
__builtin__.object
BlogItem
class BlogItem(__builtin__.object)
| Methods defined here:
|
| __init__(self, id)
|
| dump(self)
|
| ----------------------------------------------------------------------
| Data descriptors defined here:
|
| __dict__
| dictionary for instance variables (if defined)
|
| __weakref__
| list of weak references to the object (if defined)
class BlogItemsParser(HTMLParser.HTMLParser)
| Get all the article's url and title.
|
| Method resolution order:
| BlogItemsParser
| HTMLParser.HTMLParser
| markupbase.ParserBase
|
| Methods defined here:
|
| __init__(self, user_name)
|
| get_blog_items(self)
|
| handle_data(self, text)
|
| handle_endtag(self, tag)
|
| handle_starttag(self, tag, attrs)
|
| save_article_title(self, attrs)
| Save the article_title.
|
| example:
| <a href="http://blog.csdn.net/a_flying_bird/article/details/47028939">
| Linux鐜涓嬪垪鍑烘寚瀹氱洰褰曚笅鐨勬墍鏈夋枃浠?
| </a>
|
| ----------------------------------------------------------------------
| Methods inherited from HTMLParser.HTMLParser:
|
| ........
FUNCTIONS
get_article_items(content, user_name)
4. python脚本
为了简化这一步,我们把python脚本直接打包,放到下载页面http://download.csdn.net/detail/u013344915/8935181。——这个链接被莫名其妙删除了。。。。所以直接拷贝这里的代码。或者访问网盘:http://pan.baidu.com/s/1pJYo2ZD
如果用户没有登录等导致无法下载,也可以直接从这里拷贝。
4.1 export_blog.py
<pre name="code" class="python">#!/usr/bin/env python
#encoding: utf-8
'''''
Export csdn's blog.
e.g.:
1. Linux:
./export_blog.py a_flying_bird ./2015-07-25 5
2. Windows
python export_blog.py 2005-07-27 5
'''
import time
import os
import re
import sys
import web_utils
import page_count_parser
import blog_item_parser
def get_user_id(content):
'''''Get user id from the content of main page.
e.g.:
<script type="text/javascript">
var username = "u013344915";
var _blogger = username;
var blog_address = "http://blog.csdn.net/a_flying_bird";
var static_host = "http://static.blog.csdn.net";
var currentUserName = "u013344915";
</script>
'''
username_pattern = '^var\s+username\s+=\s+\"(u[\d]+)\";$'
lines = content.split('\n')
for line in lines:
#print line
line = line.strip()
matched = re.match(username_pattern, line)
if matched:
return matched.group(1)
return None
# Create a file name.
# In fact, we delete the invalid characters in the blog's title.
# e.g. C/C++ -> CC++
def replace_invalid_filename_char(title, replaced_char='_'):
'''Replace the invalid characaters in the filename with specified characater.
The default replaced characater is '_'.
e.g.
C/C++ -> C_C++
'''
valid_filename = title
invalid_characaters = '\\/:*?"<>|'
for c in invalid_characaters:
#print 'c:', c
valid_filename = valid_filename.replace(c, replaced_char)
return valid_filename
def export_csdn_blogs(user_name, blog_saved_path, sleep_len):
'''''
Read the main_page_url, and parse all the blog information, then save to blog_saved_path.
e.g.:
user_name = 'a_flying_bird'
user_id = 'u013344915'
blog_saved_path = "D:\\examples\\python\\export_blog\\2015-07-25"
sleep_len = 5
export_csdn_blogs(user_name, user_id, blog_saved_path, sleep_len)
'''
step = 1
print "Step %d: mkdir the destination directory: %s" % (step, blog_saved_path)
step = step + 1
if not os.path.exists(blog_saved_path):
os.makedirs(blog_saved_path)
print "Step %d: Retrieve the main page's content." % (step,)
step = step + 1
main_page_url = 'http://blog.csdn.net/%s/' % (user_name,)
content = web_utils.get_page_content(main_page_url)
print "Step %d: Get user id from the main page." % (step,)
step = step + 1
user_id = get_user_id(content)
if user_id is None:
print "Can not get user id from the main page. Correct it first."
return
else:
print "user id: ", user_id
print "Step %d: Get the pagelist's URLs." % (step,)
step = step + 1
page_lists = page_count_parser.get_page_lists(content, user_id)
print "Step %d: Read all of the article information, includes: url, title." % (step,)
step = step + 1
articles = []
for page_list in page_lists:
print "current pagelist: ", page_list
page_list_content = web_utils.get_page_content(page_list)
the_articles = blog_item_parser.get_article_items(page_list_content, user_name)
articles.extend(the_articles)
time.sleep(sleep_len)
print "Step %d: Save the articles." % (step,)
step = step + 1
total_article_count = len(articles)
print "Total count:", total_article_count
index = 1
for article in articles:
print "%d/%d: %s, %s ..." % (index, total_article_count, article.url, article.title)
index = index + 1
web_utils.save_page(article.url, os.path.join(blog_saved_path, replace_invalid_filename_char(article.title) + ".htm"))
time.sleep(sleep_len)
def usage(process_name):
print "Usage: %s user_name saved_path sleep_len" % (process_name,)
print "For example:"
print " user_name: a_flying_bird"
print " savedDirectory: /home/csdn/"
print " sleep_len: 5"
if __name__ == "__main__":
argc = len(sys.argv)
if argc != 4:
usage(sys.argv[0])
sys.exit(-1)
user_name = sys.argv[1]
blog_saved_path = sys.argv[2]
sleep_len = int(sys.argv[3])
export_csdn_blogs(user_name, blog_saved_path, sleep_len)
print "DONE!!!"
4.2 web_utils.py
import urllib2
def fix_content(content):
'''
<script type="text/javascript">
var protocol = window.location.protocol;
document.write('<script type="text/javascript" src="' + protocol + '//csdnimg.cn/pubfooter/js/repoAddr2.js?v=' + Math.random() + '"></' + 'script>');
</script>
While parsing the line of 'document.write...', there is some error. So we will delete this line.
'''
#error_string = '''document.write('<script type="text/javascript" src="' + protocol + '//csdnimg.cn/pubfooter/js/repoAddr2.js?v=' + Math.random() + '"></' + 'script>');'''
#content.replace(error_string, "")
fixed_content = ""
lines = content.split('\n')
for index in range(0, len(lines)):
if lines[index].find('window.location.protocol') > 0:
#print "find the error string."
lines.remove(lines[index + 1])
break
content = ""
for line in lines:
content = content + line + '\n'
#print content
return content
def save_page(url, filename):
'''
Save the web page specified by url.
'''
content = get_page_content(url)
f = open(filename, "wt")
f.write(content)
f.close()
def get_page_content(url):
'''
Get the web page's content. If filename is specified, save the content to this file.
'''
headers = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
req = urllib2.Request(url, headers = headers)
content = urllib2.urlopen(req).read() # 'UTF-8'
content = fix_content(content)
return content
def _test():
url = 'http://blog.csdn.net/a_flying_bird'
filename = "main_page.htm"
save_page(url, filename)
def _test_error_string():
filename = "a.htm"
content = open(filename, "r").read()
content = fix_content(content)
f = open("fix.htm", "wt")
f.write(content)
f.close()
if __name__ == '__main__':
#_test()
_test_error_string()
4.3 page_count_parser.py
#!/usr/bin/env python
#encoding: utf-8
import htmllib
import urllib2
from HTMLParser import HTMLParser
import re
class PageCountParser(HTMLParser):
'''
Get the page count from this 'div'.
example:
<div id="papelist" class="pagelist">
<span> 137条数据 共10页</span>
<strong>1</strong>
<a href="http://blog.csdn.net/u013344915/article/list/2">2</a>
<a href="http://blog.csdn.net/u013344915/article/list/3">3</a>
<a href="http://blog.csdn.net/u013344915/article/list/4">4</a>
<a href="http://blog.csdn.net/u013344915/article/list/5">5</a>
<a href="http://blog.csdn.net/u013344915/article/list/6">...</a>
<a href="http://blog.csdn.net/u013344915/article/list/2">下一页</a>
<a href="http://blog.csdn.net/u013344915/article/list/10">尾页</a>
</div>
'''
def __init__(self, user_id):
HTMLParser.__init__(self)
self.is_page_list = False
self.page_count = 1
self.page_list_url_header = "http://blog.csdn.net/u013344915/article/list/"
# Windows 7
#self.prefix = ""
#self.pattern = "^http://blog.csdn.net/u013344915/article/list/([\\d]+)$"
# Windows 8, Firefox 38.0.6
self.prefix = "http://blog.csdn.net"
self.pattern = "^/%s/article/list/([\d]+)$" % (user_id,)
def _is_page_list(self, tag, attrs):
'''
Whether the tag is responding to article_title.
e.g.:
<div id="papelist" class="pagelist">
'''
if tag != 'div': return False
for attr in attrs:
name, value = attr
if name == 'id' and value == 'papelist': # Oooh, it is papelist, not the pagelist!
print "enter pagelist"
return True
return False
def save_page_count(self, attrs):
'''
Save the pagecount.
example:
<a href="http://blog.csdn.net/u013344915/article/list/6">...</a>
<a href="http://blog.csdn.net/u013344915/article/list/2">下一页</a>
<a href="http://blog.csdn.net/u013344915/article/list/10">尾页</a>
Windowns 8, Firefox 38.0.6
<a href="/u013344915/article/list/2">
'''
for attr in attrs:
name, value = attr
if name == 'href':
matched = re.match(self.pattern, value)
#print "matched:", matched
if matched:
count = int(matched.group(1))
#print "count:", count
if count > self.page_count: self.page_count = count
return
def handle_starttag(self, tag, attrs):
#print "start tag(), tag:", tag
#print "attrs:", attrs
if self._is_page_list(tag, attrs):
self.is_page_list = True
return
if self.is_page_list:
if tag == 'a':
self.save_page_count(attrs)
def handle_endtag(self, tag):
#print "end tag(), tag:", tag
if self.is_page_list and tag == 'div':
self.is_page_list = False
def handle_data(self, text):
#print "handle data(), text:", text
pass
def get_page_count(self):
return self.page_count
def get_page_lists(self):
page_lists = []
for index in range(1, self.page_count + 1):
page_lists.append(self.page_list_url_header + str(index))
return page_lists
def get_page_lists(content, user_id):
'''
Get the page lists' url.
'''
parser = PageCountParser(user_id)
parser.feed(content)
parser.close()
page_count = parser.get_page_count()
print "page count: ", page_count
page_lists = parser.get_page_lists()
for page_list in page_lists:
print page_list
return page_lists
def _test():
content = open('main_page.htm', 'r').read()
get_page_lists(content, 'u013344915')
if __name__ == "__main__":
_test()
4.4 blog_item_parser.py
<pre name="code" class="python">#!/usr/bin/env python
#encoding: utf-8
import htmllib
import urllib2
from HTMLParser import HTMLParser
import re
import platform
'''
article_list={"list_item article_item"}+
"list_item article_item"={article_title} + {article_description} + {article_manage} + {clear}
<div id="article_list" class="list">
<div class="list_item article_item">
<div class="article_title">
<span class="ico ico_type_Original"></span>
<h1>
<span class="link_title">
<a href="http://blog.csdn.net/a_flying_bird/article/details/47028939">
Linux环境下列出指定目录下的所有文件
</a>
</span>
</h1>
</div>
<div class="article_description">
递归方式列出指定目录下的所有子目录和文件。...
</div>
<div class="article_manage">
<span class="link_postdate">2015-07-23 21:27</span>
<span class="link_view" title="阅读次数">
<a href="http://blog.csdn.net/a_flying_bird/article/details/47028939" title="阅读次数">
阅读
</a>
(4)
</span>
<span class="link_comments" title="评论次数">
<a
href="http://blog.csdn.net/a_flying_bird/article/details/47028939#comments"
title="评论次数"
οnclick="_gaq.push(['_trackEvent','function', 'onclick', 'blog_articles_pinglun'])"
>
评论
</a>
(0)
</span>
<span class="link_edit">
<a href="http://write.blog.csdn.net/postedit/47028939" title="编辑">编辑</a>
</span>
<span class="link_delete">
<a
href="javascript:void(0);"
οnclick="javascript:deleteArticle(47028939);return false;"
title="删除">
删除
</a>
</span>
</div>
<div class="clear"></div>
</div>
<div class="list_item article_item">
<div class="article_title">
..........
The key hierarchy of blog's title:
<div id="article_list" class="list">
<div class="list_item article_item">
<div class="article_title">
<span class="ico ico_type_Original"></span>
<h1>
<span class="link_title">
<a href="http://blog.csdn.net/a_flying_bird/article/details/47028939">
Linux环境下列出指定目录下的所有文件
</a>
</span>
Furthermore, only the div of 'article_title' is enough!
'''
class BlogItem(object):
def __init__(self, id):
self.id = id
self.url = None
self.title = None
def dump(self):
print "(%s, %s, %s)" % (self.id, self.url, self.title)
class BlogItemsParser(HTMLParser):
'''
Get all the article's url and title.
'''
def __init__(self, user_name):
HTMLParser.__init__(self)
self.is_article_title = False
self.ready_for_article_title = False # having reading the tag 'a', ready for handle the 'data'.
self.current_article_id = None
self.blogItems = {}
self.is_windows_platform = False
if platform.system() == 'Windows':
self.is_windows_platform = True
#self.prefix = ""
#self.pattern = "^http://blog.csdn.net/a_flying_bird/article/details/([\\d]+)$" # windows 7
# windows 8, Firefox 38.0.6
self.prefix = "http://blog.csdn.net"
self.pattern = "^/%s/article/details/([\d]+)$" % (user_name,)
def _is_start_tag_of_article_title(self, tag, attrs):
'''
Whether the tag is responding to article_title.
e.g.:
<div class="article_title">
'''
if tag != 'div': return False
for attr in attrs:
name, value = attr
if name == 'class' and value == 'article_title': return True
return False
def save_article_title(self, attrs):
'''
Save the article_title.
example:
<a href="http://blog.csdn.net/a_flying_bird/article/details/47028939">
Linux环境下列出指定目录下的所有文件
</a>
'''
for attr in attrs:
name, value = attr
if name == 'href':
matched = re.match(self.pattern, value)
if matched:
id = matched.group(1)
self.current_article_id = id
blogItem = BlogItem(id)
blogItem.url = self.prefix + value
blogItem.title = None
self.blogItems[id] = blogItem
self.ready_for_article_title = True
return
def handle_starttag(self, tag, attrs):
#print "start tag(), tag:", tag
#print "attrs:", attrs
if self._is_start_tag_of_article_title(tag, attrs):
self.is_article_title = True
return
if self.is_article_title:
if tag == 'a':
self.save_article_title(attrs)
def handle_endtag(self, tag):
#print "end tag(), tag:", tag
if self.is_article_title and tag == 'div':
self.is_article_title = False
def handle_data(self, text):
#print "handle data(), text:", text
if self.ready_for_article_title:
self.ready_for_article_title = False
title = text.strip()
if self.is_windows_platform:
title = title.decode('UTF-8').encode('MBCS')
self.blogItems[self.current_article_id].title = title
assert(self.blogItems[self.current_article_id].id
== self.current_article_id)
return
def get_blog_items(self):
return self.blogItems
def get_article_items(content, user_name):
parser = BlogItemsParser(user_name)
parser.feed(content)
parser.close()
blogItems = parser.get_blog_items()
print "article's count:", len(blogItems)
for blogItem in blogItems.values():
blogItem.dump()
return blogItems.values()
def _test():
content = open('main_page.htm', 'r').read()
get_article_items(content)
if __name__ == "__main__":
_test()
5. TODO
以上脚本是在Windows8上面验证通过,还需要在Linux环境上验证。
需要保存博客内容中的图片。