python脚本自动保存blog页面

2016.2.24:增加对title中非法字符的处理,解决保存文件时异常退出的问题。


1. 前言

用Python脚本保存网页一文中,介绍了如何使用python脚本包括某个用户的所有csdn blog页面。如今看来,代码可读性不是很好,所以现在对其进行了重写。


2. 主要结构

2.1 文件列表

目前包括如下几个python文件:

  • export_blog.py:导出blog数据的总入口文件;
  • web_utils.py:根据URL自动保存页面、获取页面内容的几个函数;
  • page_count_parser.py:根据个人blog主页面,获取其博客列表所在的URL列表;
  • blog_item_parser.py:获取每个blog列表页面中,具体每个blog的URL、title。

2.2 主要流程

如下:

  • 1. 确定blog的主页面(称为main page)
  • 2. 从主页面获取所有博客列表所在的URL(称为page lists)
  • 3. 遍历每个page list页面,获取详细的每个blog的URL和Title等信息(每个blog,称为blog item)
  • 4. 在得到了所有的blog详细信息之后,就从这些URL读取数据,且保存到本地。

2.3 domain字典

上一节给出了几个术语,本节通过一些截图予以说明,并给出对应的HTML部分代码。。。。。TODO

3. 脚本的class和function帮助信息

3.1 export_blog.py

NAME
    export_blog - #encoding: utf-8

FILE
    d:\examples\python\export_blog\export_blog.py

FUNCTIONS
    export_csdn_blogs(user_name, user_id, blog_saved_path, sleep_len)
        Read the main_page_url, and parse all the blog information, then save to blog_saved_path.

        e.g.:
        user_name = 'a_flying_bird'
        user_id = 'u013344915'
        blog_saved_path = "D:\examples\python\export_blog\2015-07-25"
        sleep_len = 5
        export_csdn_blogs(user_name, user_id, blog_saved_path, sleep_len)


3.2 web_utils.py

NAME
    web_utils

FILE
    d:\examples\python\export_blog\web_utils.py

FUNCTIONS
    fix_content(content)
        <script type="text/javascript">
            var protocol = window.location.protocol;
            document.write('<script type="text/javascript" src="' + protocol + '//csdnimg.cn/pubfooter/js/repoAddr2.js?v=' + Math.random() + '"></' + 'script>');
        </script>

        While parsing the line of  'document.write...', there is some error. So we will delete this line.

    get_page_content(url)
        Get the web page's content. If filename is specified, save the content to this file.

    save_page(url, filename)
        Save the web page specified by url.

3.3 page_count_parser.py

NAME
    page_count_parser - #encoding: utf-8

FILE
    d:\examples\python\export_blog\page_count_parser.py

CLASSES
    HTMLParser.HTMLParser(markupbase.ParserBase)
        PageCountParser

    class PageCountParser(HTMLParser.HTMLParser)
     |  Get the page count from this 'div'.
     |
     |  example:
     |  <div id="papelist" class="pagelist">
     |      <span> 137鏉℃暟鎹? 鍏?0椤?/span>
     |      <strong>1</strong>
     |      <a href="http://blog.csdn.net/u013344915/article/list/2">2</a>
     |      <a href="http://blog.csdn.net/u013344915/article/list/3">3</a>
     |      <a href="http://blog.csdn.net/u013344915/article/list/4">4</a>
     |      <a href="http://blog.csdn.net/u013344915/article/list/5">5</a>
     |      <a href="http://blog.csdn.net/u013344915/article/list/6">...</a>
     |      <a href="http://blog.csdn.net/u013344915/article/list/2">涓嬩竴椤?/a>
     |      <a href="http://blog.csdn.net/u013344915/article/list/10">灏鹃〉</a>
     |  </div>
     |
     |  Method resolution order:
     |      PageCountParser
     |      HTMLParser.HTMLParser
     |      markupbase.ParserBase
     |
     |  Methods defined here:
     |
     |  __init__(self, user_id)
     |
     |  get_page_count(self)
     |
     |  get_page_lists(self)
     |
     |  handle_data(self, text)
     |
     |  handle_endtag(self, tag)
     |
     |  handle_starttag(self, tag, attrs)
     |
     |  save_page_count(self, attrs)
     |      Save the pagecount.
     |
     |      example:
     |      <a href="http://blog.csdn.net/u013344915/article/list/6">...</a>
     |      <a href="http://blog.csdn.net/u013344915/article/list/2">涓嬩竴椤?/a>
     |      <a href="http://blog.csdn.net/u013344915/article/list/10">灏鹃〉</a>
     |
     |      Windowns 8, Firefox 38.0.6
     |      <a href="/u013344915/article/list/2">
     |
     |  ----------------------------------------------------------------------
     |  Methods inherited from HTMLParser.HTMLParser:
     |
     |  .......

FUNCTIONS
    get_page_lists(content, user_id)
        Get the page lists' url.

3.4 blog_item_parser.py

NAME
    blog_item_parser - #encoding: utf-8

FILE
    d:\examples\python\export_blog\blog_item_parser.py

CLASSES
    HTMLParser.HTMLParser(markupbase.ParserBase)
        BlogItemsParser
    __builtin__.object
        BlogItem

    class BlogItem(__builtin__.object)
     |  Methods defined here:
     |
     |  __init__(self, id)
     |
     |  dump(self)
     |
     |  ----------------------------------------------------------------------
     |  Data descriptors defined here:
     |
     |  __dict__
     |      dictionary for instance variables (if defined)
     |
     |  __weakref__
     |      list of weak references to the object (if defined)

    class BlogItemsParser(HTMLParser.HTMLParser)
     |  Get all the article's url and title.
     |
     |  Method resolution order:
     |      BlogItemsParser
     |      HTMLParser.HTMLParser
     |      markupbase.ParserBase
     |
     |  Methods defined here:
     |
     |  __init__(self, user_name)
     |
     |  get_blog_items(self)
     |
     |  handle_data(self, text)
     |
     |  handle_endtag(self, tag)
     |
     |  handle_starttag(self, tag, attrs)
     |
     |  save_article_title(self, attrs)
     |      Save the article_title.
     |
     |      example:
     |      <a href="http://blog.csdn.net/a_flying_bird/article/details/47028939">
     |          Linux鐜涓嬪垪鍑烘寚瀹氱洰褰曚笅鐨勬墍鏈夋枃浠?
     |      </a>
     |
     |  ----------------------------------------------------------------------
     |  Methods inherited from HTMLParser.HTMLParser:
     |
     |  ........

FUNCTIONS
    get_article_items(content, user_name)

4. python脚本

为了简化这一步,我们把python脚本直接打包,放到下载页面http://download.csdn.net/detail/u013344915/8935181。——这个链接被莫名其妙删除了。。。。所以直接拷贝这里的代码。或者访问网盘:http://pan.baidu.com/s/1pJYo2ZD

如果用户没有登录等导致无法下载,也可以直接从这里拷贝。

4.1 export_blog.py

<pre name="code" class="python">#!/usr/bin/env python  
#encoding: utf-8  
  
''''' 
Export csdn's blog. 
 
e.g.: 
1. Linux:  
./export_blog.py a_flying_bird ./2015-07-25 5 
2. Windows 
python export_blog.py 2005-07-27 5 
'''  
  
import time   
import os   
import re   
import sys  
  
import web_utils  
import page_count_parser  
import blog_item_parser  
  
def get_user_id(content):  
    '''''Get user id from the content of main page. 
     
    e.g.: 
    <script type="text/javascript"> 
        var username = "u013344915"; 
        var _blogger = username; 
        var blog_address = "http://blog.csdn.net/a_flying_bird"; 
        var static_host = "http://static.blog.csdn.net"; 
        var currentUserName = "u013344915";   
    </script> 
    '''  
    username_pattern = '^var\s+username\s+=\s+\"(u[\d]+)\";$'  
    lines = content.split('\n')  
    for line in lines:  
        #print line  
        line = line.strip()  
        matched = re.match(username_pattern, line)  
        if matched:  
            return matched.group(1)  
      
    return None  

# Create a file name.
# In fact, we delete the invalid characters in the blog's title.
# e.g. C/C++ -> CC++
def replace_invalid_filename_char(title, replaced_char='_'):
    '''Replace the invalid characaters in the filename with specified characater.
    The default replaced characater is '_'.
    e.g. 
    C/C++ -> C_C++
    '''
    valid_filename = title
    invalid_characaters = '\\/:*?"<>|'
    for c in invalid_characaters:
        #print 'c:', c
        valid_filename = valid_filename.replace(c, replaced_char)
        
    return valid_filename 
    	
def export_csdn_blogs(user_name, blog_saved_path, sleep_len):  
    ''''' 
    Read the main_page_url, and parse all the blog information, then save to blog_saved_path. 
     
    e.g.: 
    user_name = 'a_flying_bird' 
    user_id = 'u013344915' 
    blog_saved_path = "D:\\examples\\python\\export_blog\\2015-07-25" 
    sleep_len = 5 
    export_csdn_blogs(user_name, user_id, blog_saved_path, sleep_len) 
    '''  
    step = 1  
      
    print "Step %d: mkdir the destination directory: %s" % (step, blog_saved_path)  
    step = step + 1   
    if not os.path.exists(blog_saved_path):  
        os.makedirs(blog_saved_path)  
          
    print "Step %d: Retrieve the main page's content." % (step,)  
    step = step + 1   
    main_page_url = 'http://blog.csdn.net/%s/' % (user_name,)  
    content = web_utils.get_page_content(main_page_url)  
      
    print "Step %d: Get user id from the main page." % (step,)  
    step = step + 1  
    user_id = get_user_id(content)  
    if user_id is None:  
        print "Can not get user id from the main page. Correct it first."  
        return  
    else:  
        print "user id: ", user_id  
      
    print "Step %d: Get the pagelist's URLs." % (step,)  
    step = step + 1   
    page_lists = page_count_parser.get_page_lists(content, user_id)  
      
    print "Step %d: Read all of the article information, includes: url, title." % (step,)  
    step = step + 1   
      
    articles = []  
    for page_list in page_lists:  
        print "current pagelist: ", page_list  
        page_list_content = web_utils.get_page_content(page_list)  
        the_articles = blog_item_parser.get_article_items(page_list_content, user_name)  
        articles.extend(the_articles)  
        time.sleep(sleep_len)  
      
    print "Step %d: Save the articles." % (step,)  
    step = step + 1   
      
    total_article_count = len(articles)  
    print "Total count:", total_article_count   
    index = 1  
    for article in articles:  
        print "%d/%d: %s, %s ..." % (index, total_article_count, article.url, article.title)  
        index = index + 1  
        web_utils.save_page(article.url, os.path.join(blog_saved_path, replace_invalid_filename_char(article.title) + ".htm"))  
        time.sleep(sleep_len)  
      
def usage(process_name):    
    print "Usage: %s user_name saved_path sleep_len" % (process_name,)    
    print "For example:"    
    print "    user_name: a_flying_bird"    
    print "    savedDirectory: /home/csdn/"  
    print "    sleep_len: 5"  
    
if __name__ == "__main__":  
    argc = len(sys.argv)    
    if argc != 4:  
        usage(sys.argv[0])  
        sys.exit(-1)  
      
    user_name = sys.argv[1]    
    blog_saved_path = sys.argv[2]    
    sleep_len = int(sys.argv[3])  
      
    export_csdn_blogs(user_name, blog_saved_path, sleep_len)  
      
    print "DONE!!!"     


 
 

4.2 web_utils.py

import urllib2 

def fix_content(content):
    '''
    <script type="text/javascript">
        var protocol = window.location.protocol;
        document.write('<script type="text/javascript" src="' + protocol + '//csdnimg.cn/pubfooter/js/repoAddr2.js?v=' + Math.random() + '"></' + 'script>');
    </script>
    
    While parsing the line of  'document.write...', there is some error. So we will delete this line.
    '''
    #error_string = '''document.write('<script type="text/javascript" src="' + protocol + '//csdnimg.cn/pubfooter/js/repoAddr2.js?v=' + Math.random() + '"></' + 'script>');'''
    #content.replace(error_string, "")
    fixed_content = ""
    lines = content.split('\n')
    
    for index in range(0, len(lines)):
        if lines[index].find('window.location.protocol') > 0:
            #print "find the error string."
            lines.remove(lines[index + 1])
            break 
    
    content = ""
    for line in lines:
        content = content + line + '\n'
    
    #print content
    return content 
    
def save_page(url, filename):
    '''
    Save the web page specified by url.
    '''
    content = get_page_content(url)
    
    f = open(filename, "wt")  
    f.write(content)  
    f.close() 

def get_page_content(url):
    '''
    Get the web page's content. If filename is specified, save the content to this file.
    '''
    headers = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}  
    req = urllib2.Request(url, headers = headers)  
    content = urllib2.urlopen(req).read() # 'UTF-8'
    content = fix_content(content)

    return content 
    
def _test():
    url = 'http://blog.csdn.net/a_flying_bird'
    filename = "main_page.htm"
    save_page(url, filename)
    
def _test_error_string():
    filename = "a.htm"
    content = open(filename, "r").read()
    content = fix_content(content)
    
    f = open("fix.htm", "wt")
    f.write(content)
    f.close()
    
if __name__ == '__main__':
    #_test()
    _test_error_string()
    


4.3 page_count_parser.py

#!/usr/bin/env python
#encoding: utf-8

import htmllib
import urllib2
from HTMLParser import HTMLParser
import re

class PageCountParser(HTMLParser):
    '''
    Get the page count from this 'div'.

    example:
    <div id="papelist" class="pagelist">
        <span> 137条数据  共10页</span>
        <strong>1</strong>
        <a href="http://blog.csdn.net/u013344915/article/list/2">2</a>
        <a href="http://blog.csdn.net/u013344915/article/list/3">3</a>
        <a href="http://blog.csdn.net/u013344915/article/list/4">4</a>
        <a href="http://blog.csdn.net/u013344915/article/list/5">5</a>
        <a href="http://blog.csdn.net/u013344915/article/list/6">...</a>
        <a href="http://blog.csdn.net/u013344915/article/list/2">下一页</a>
        <a href="http://blog.csdn.net/u013344915/article/list/10">尾页</a>
    </div>
    '''
    def __init__(self, user_id):
        HTMLParser.__init__(self)

        self.is_page_list = False
        self.page_count = 1
        self.page_list_url_header = "http://blog.csdn.net/u013344915/article/list/"
        
        # Windows 7
        #self.prefix = ""
        #self.pattern = "^http://blog.csdn.net/u013344915/article/list/([\\d]+)$"
        
        # Windows 8, Firefox 38.0.6
        self.prefix = "http://blog.csdn.net"
        self.pattern = "^/%s/article/list/([\d]+)$" % (user_id,)

    def _is_page_list(self, tag, attrs):
        '''
        Whether the tag is responding to article_title.

        e.g.:
        <div id="papelist" class="pagelist">
        '''
        if tag != 'div': return False

        for attr in attrs:
            name, value = attr
            if name == 'id' and value == 'papelist':  # Oooh, it is papelist, not the pagelist!
                print "enter pagelist"
                return True

        return False

    def save_page_count(self, attrs):
        '''
        Save the pagecount.

        example:
        <a href="http://blog.csdn.net/u013344915/article/list/6">...</a>
        <a href="http://blog.csdn.net/u013344915/article/list/2">下一页</a>
        <a href="http://blog.csdn.net/u013344915/article/list/10">尾页</a>
        
        Windowns 8, Firefox 38.0.6
        <a href="/u013344915/article/list/2">
        '''
        for attr in attrs:
            name, value = attr
            if name == 'href':
                matched = re.match(self.pattern, value)
                #print "matched:", matched
                if matched:
                    count = int(matched.group(1))
                    #print "count:", count
                    if count > self.page_count: self.page_count = count
                    return

    def handle_starttag(self, tag, attrs):
        #print "start tag(), tag:", tag
        #print "attrs:", attrs

        if self._is_page_list(tag, attrs):
            self.is_page_list = True
            return

        if self.is_page_list:
            if tag == 'a':
                self.save_page_count(attrs)

    def handle_endtag(self, tag):
        #print "end tag(), tag:", tag
        if self.is_page_list and tag == 'div':
            self.is_page_list = False

    def handle_data(self, text):
        #print "handle data(), text:", text
        pass 

    def get_page_count(self):
        return self.page_count
        
    def get_page_lists(self):
        page_lists = []
        for index in range(1, self.page_count + 1):
            page_lists.append(self.page_list_url_header + str(index))
            
        return page_lists 

def get_page_lists(content, user_id):
    '''
    Get the page lists' url.
    '''
    parser = PageCountParser(user_id)
    parser.feed(content)
    parser.close()
    page_count = parser.get_page_count()
    print "page count: ", page_count
    
    page_lists = parser.get_page_lists()
    for page_list in page_lists:
        print page_list 
    
    return page_lists 

def _test():
    content = open('main_page.htm', 'r').read()
    get_page_lists(content, 'u013344915')
    
if __name__ == "__main__":
    _test()
    


4.4 blog_item_parser.py

   <pre name="code" class="python">#!/usr/bin/env python
#encoding: utf-8

import htmllib
import urllib2
from HTMLParser import HTMLParser
import re
import platform 

'''
article_list={"list_item article_item"}+
"list_item article_item"={article_title} + {article_description} + {article_manage} + {clear}

<div id="article_list" class="list">
    <div class="list_item article_item">
        <div class="article_title">   
            <span class="ico ico_type_Original"></span>
            <h1>
                <span class="link_title">
                    <a href="http://blog.csdn.net/a_flying_bird/article/details/47028939">
                        Linux环境下列出指定目录下的所有文件            
                    </a>
                </span>
            </h1>
        </div>

        <div class="article_description">
            递归方式列出指定目录下的所有子目录和文件。...        
        </div>

        <div class="article_manage">
            <span class="link_postdate">2015-07-23 21:27</span>
            <span class="link_view" title="阅读次数">
                <a href="http://blog.csdn.net/a_flying_bird/article/details/47028939" title="阅读次数">
                    阅读
                </a>
                (4)
            </span>

            <span class="link_comments" title="评论次数">
                <a
                    href="http://blog.csdn.net/a_flying_bird/article/details/47028939#comments"
                    title="评论次数"
                    οnclick="_gaq.push(['_trackEvent','function', 'onclick', 'blog_articles_pinglun'])"
                    >
                    评论
                </a>
                (0)
            </span>

            <span class="link_edit">
                <a href="http://write.blog.csdn.net/postedit/47028939" title="编辑">编辑</a>
            </span>

            <span class="link_delete">
                <a
                    href="javascript:void(0);"
                    οnclick="javascript:deleteArticle(47028939);return false;"
                    title="删除">
                    删除
                </a>
            </span>

        </div>

        <div class="clear"></div>

    </div>

    <div class="list_item article_item">
        <div class="article_title">
        ..........

The key hierarchy of blog's title:
<div id="article_list" class="list">
    <div class="list_item article_item">
        <div class="article_title">   
            <span class="ico ico_type_Original"></span>
            <h1>
                <span class="link_title">
                    <a href="http://blog.csdn.net/a_flying_bird/article/details/47028939">
                        Linux环境下列出指定目录下的所有文件            
                    </a>
                </span>

Furthermore, only the div of 'article_title' is enough!
'''

class BlogItem(object):
    def __init__(self, id):
        self.id = id
        self.url = None
        self.title = None

    def dump(self):
        print "(%s, %s, %s)" % (self.id, self.url, self.title)

class BlogItemsParser(HTMLParser):
    '''
    Get all the article's url and title.
    '''
    
    def __init__(self, user_name):
        HTMLParser.__init__(self)

        self.is_article_title = False
        self.ready_for_article_title = False # having reading the tag 'a', ready for handle the 'data'.
        self.current_article_id = None
        self.blogItems = {}
        
        self.is_windows_platform = False 
        if platform.system() == 'Windows':
            self.is_windows_platform = True 
        
        #self.prefix = ""
        #self.pattern = "^http://blog.csdn.net/a_flying_bird/article/details/([\\d]+)$" # windows 7
        
        # windows 8, Firefox 38.0.6
        self.prefix = "http://blog.csdn.net"
        self.pattern = "^/%s/article/details/([\d]+)$" % (user_name,)

    def _is_start_tag_of_article_title(self, tag, attrs):
        '''
        Whether the tag is responding to article_title.

        e.g.:
        <div class="article_title">
        '''
        if tag != 'div': return False

        for attr in attrs:
            name, value = attr
            if name == 'class' and value == 'article_title': return True

        return False

    def save_article_title(self, attrs):
        '''
        Save the article_title.

        example:
        <a href="http://blog.csdn.net/a_flying_bird/article/details/47028939">
            Linux环境下列出指定目录下的所有文件            
        </a>
        '''
        for attr in attrs:
            name, value = attr
            if name == 'href':
                matched = re.match(self.pattern, value)

                if matched:
                    id = matched.group(1)
                    self.current_article_id = id
                    blogItem = BlogItem(id)
                    blogItem.url = self.prefix + value
                    blogItem.title = None
                    self.blogItems[id] = blogItem

                    self.ready_for_article_title = True
                    return

    def handle_starttag(self, tag, attrs):
        #print "start tag(), tag:", tag
        #print "attrs:", attrs

        if self._is_start_tag_of_article_title(tag, attrs):
            self.is_article_title = True
            return

        if self.is_article_title:
            if tag == 'a':
                self.save_article_title(attrs)

    def handle_endtag(self, tag):
        #print "end tag(), tag:", tag

        if self.is_article_title and tag == 'div':
            self.is_article_title = False

    def handle_data(self, text):
        #print "handle data(), text:", text

        if self.ready_for_article_title:
            self.ready_for_article_title = False
            
            title = text.strip()
            if self.is_windows_platform:
                title = title.decode('UTF-8').encode('MBCS')
                
            self.blogItems[self.current_article_id].title = title
            assert(self.blogItems[self.current_article_id].id
                   == self.current_article_id)
            return

    def get_blog_items(self):
        return self.blogItems 
        
def get_article_items(content, user_name):
    parser = BlogItemsParser(user_name)
    parser.feed(content)
    parser.close()
    
    blogItems = parser.get_blog_items()

    print "article's count:", len(blogItems)
    for blogItem in blogItems.values():
        blogItem.dump()
    
    return blogItems.values()

def _test():
    content = open('main_page.htm', 'r').read()
    get_article_items(content)
    
if __name__ == "__main__":
    _test()
    

 

5. TODO

以上脚本是在Windows8上面验证通过,还需要在Linux环境上验证。

需要保存博客内容中的图片。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值