通过Ruby完成从MoveableType到.Text的迁移

最新推荐文章于 2024-11-06 19:00:00 发布

JamesWu111

最新推荐文章于 2024-11-06 19:00:00 发布

阅读量593

点赞数

分类专栏：技术文章标签： ruby rss xhtml class blog xml

技术专栏收录该内容

3 篇文章 0 订阅

订阅专栏

介绍

今年，为了保持对新技术的追踪，我开始学习Ruby和Rails。我的第一个重要的Ruby项目是去写一个脚本用于将MoveableType的日志文件导入到.Text中去。这篇文章介绍了详细的经过以及结果。

旧日志信息

将旧的日志导出的最好方式就是使用XML格式，日志内容已经有一部分可以通过RSS获取了，所以我只是需要通过MovableType模板将我所有的日志都导出到RSS中，而不仅仅知识最新发布的一些Blog。我使用RSS2.0来完成这件事情，最终的XML格式如下:

<rss>

...irrelevant stuff describing the blog...

</channel>

<item>

<title>Welcome!</title>

</item>

...many more items...

</rss>

比较麻烦的是这些日志需要有很多的冗余和错误信息需要进行清理，例如：

l 有一些指向已存在Blog的链接

l 有一些指向先前站点的失效链接

l HTML 内容需要进行清理，有过多的冗余内容

新的Blog

新的日志运行在.Text系统之上，主要的远程管理API是metaWebLog。最困难的部分在于需要通过阅读协议和猜测来弄清楚究竟怎样发布Blog到.Text上。最终，我封装了如下的一个模块。

require 'xmlrpc/client'

module MetaWebLogAPI

class Client

def initialize(server, urlPath, blogid, username, password)

@client = XMLRPC::Client.new(server, urlPath)

@blogid = blogid

@username = username

@password = password

end

def newPost(content, publish)

@client.call('metaWeblog.newPost', @blogid, @username,

@password, content, publish)

end

def getPost(postid)

@client.call('metaWeblog.getPost', postid, @username,

@password)

end

def editPost(postid, content, publish)

@client.call('metaWeblog.editPost', postid, @username,

@password, content, publish)

end

你可以看到通过XML-RPC库，编写这样一个封装变得非常的简单。XML-RPC系统将结构表示为Hash对象，所以完成一个Post也变得非常的简单了。

client = MetaWebLogAPI::Client.new('1.2.3.4', '/path/to/weblogapi',

'blogid', 'username', 'password')

blogpost = {

'title' => 'New post!',

'description' => 'This is the body of my new post',

'pubDate' => Time.gm(2005,05,31,15,0,0,0) # May 31, 2005 at 3:00 PM

}

client.newPost(blogpost, true)

下面的例子说明了我们发布一篇Blog到.Text上面是多么的容易，代码如下：

require 'metaweblogapi'

class MetaWebLogImport

def initialize

# MetaWebLogAPI configuration items

metaBlogServer = 'www.agileprogrammer.com'

metaBlogApi = '/dotnetguy/services/metablogapi.aspx'

metaBlogId = 'dotnetguy'

metaBlogUser = 'myuser'

metaBlogPassword = 'mypassword'

@metaBlogClient = MetaWebLogAPI::Client.new(metaBlogServer,

metaBlogApi, metaBlogId, metaBlogUser, metaBlogPassword)

end

def run

end

MetaWebLogImport.new.run

剩下来的事情很明显：完成run函数

分析

RSS文件中的已有内容需要分析才能够使用。很幸运的，有现成的RSS库可以使用，能够让我们轻松的做到读取RSS文件。

require 'rss/2.0'

class MetaWebLogImport

def read_original_rss

File.open('export.xml') do |file|

@originalRss = RSS::Parser.parse(file.read, false)

end

def run

read_original_rss

end

The RSS::Parser class returns an object that has attributes for all the members of an RSS feed (notably, we'll take advantage of link, title, description, category, and pubDate).

RSS::Parser 类返回一个对象包含了RSS feed的所有属性（特别地，我们将要使用到link,title,description,category和pubDate）

发布

现在所有的旧日志已经可以访问。不过，我们知道有一些链接是指向他自身的，然而我们没有办法知道新的链接地址是什么。旧日志中最有效的是现有链接地址，我们可以使用这个信息来完成一些工作。

我们需要使用一些hashtables来记录这些信息。第一个是@newPostIdsByOldLink，用于记录Blog中所有的post IDs。第二个是@linksRedirect，用于记录新旧两种链接的对应关系。我们将所有的posts都放到hash表中去，然后我们就可以遍历旧日志内容并修复所有的链接。这样做的好处就是你可以将大部分需要手工修改的链接全部修复(例如图片和二进制文件等)，另外还可以将所有无法自动修复的打印出来以便后续处理。

下面就是一些用于完成这项工作的代码：

class MetaWebLogImport

def make_post_content(item, description = "[place-holder content]")

return {

"title" => item.title,

"description" => description,

"dateCreated" => item.pubDate,

"categories" => [ item.category.content ]

}

end

def post_id_to_permalink(postId)

@metaBlogClient.getPost(postId).link

end

def post_items

@originalRss.items.each do |item|

newPostId = @metaBlogClient.newPost(make_post_content(item), false)

@newPostIdsByOldLink[item.link] = newPostId

@linksRedirects[item.link] = post_id_to_permalink(newPostId)

end

def run

read_original_rss

post_items

end

清除与重新Post

我在前面已经提到过了Blog的内容需要进行清理。在我写这个Importer的时候，我并不清除Ruby里面有一个Tidy库。我在老的实现里面使用了shell来调用Tidy工具来实现清理的工作。Tidy用于将我们的HTML文件转变成为XHTML，这样我就可以很轻松的通过REXML来分析了。

这里面使用了不少的技术，在代码后面我会一一进行解释的。

require 'xml/document'

class MetaWebLogImport

def tidy_to_xhtml(input)

open("|tidy -q -b -c -asxml -f /dev/null", "w+") do |cmd|

cmd.puts input

cmd.close_write

cmd.read

end

def cleansed_content(element)

content = "";

element.elements.each { |e| content += e.to_s }

return content

end

def replace_links(xml, xpath, attributeName, oldLink)

xml.elements.each(xpath) do |e|

href = e.attributes[attributeName]

newlink = @linksRedirects[href]

if newlink

e.attributes[attributeName] = newlink

else

if @oldSiteUrlRegex.match(href) then

(@unmatchedLinks[href] ||= []) << oldLink

end

def scan_content_for_links

@originalRss.items.each do |item|

xml = REXML::Document.new(tidy_to_xtml(item.description))

replace_links(xml, '//a', 'href', item.link)

replace_links(xml, '//img', 'src', item.link)

post = make_post_content(item, cleansed_content(xml.elements["//body"]))

@metaBlogClient.editPost(@newPostIdsByOldLink[item.link], post, true)

end

def run

read_original_rss

post_items

scan_content_for_links

end

在scan_content_for_links之后，你可以看到我们仅仅做了很少的一些事情。对于每个Post，我们通过运行Tidy将他们转换成为了XML。然后我们通过XPath查询来查找所有的需要清理的链接信息。然后我们将清理之后的内容重新发布到服务器上面去。

函数tidy_to_xhtml通过shell运行tidy，然后将所有的标准输出的内容捕获，Ruby让这样的事情变得非常的简单。

cleansed_content 函数用于将<body> Tag中的信息提取出来。看起来好像这个函数存在一些问题，不过由于我也是刚刚结束Ruby，所以并不清楚是否存在更加简洁的方式来做这件事情。

最后，replace_links，由于这个函数看起来非常的长，给人一可能做了很多事情的感觉，但是我不打算重构它。这个函数查找XML里面的所有link，使用@linksRedirect去将旧的链接替换成为新的。此外，我跳过了所有的本地链接，因为他们不需要替换。

Apache Redirects

我的旧站点运行在Unix服务器上面，通过Apache提供服务。所以我将所有的重定向信息记录到文件中。

class MetaWebLogImport

def write_htaccess

open('.htaccess', 'w+') do |file|

@linksRedirects.each do |key, value|

match = @oldBlogUrlRegex.match(key)

file.puts 'RedirectPermanent ' + match.post_match + ' ' + value if match

end

def run

read_original_rss

post_items

scan_content_for_links

write_htaccess

end

看看有没有什么遗漏

第一次运行这个脚本的时候，你可以发现还是有不少的链接是断开着的。一个Blog经常会包含一些上传的图片以及一些二进制的文件。如果可以将这些断开的链接记录下来，对于我们的维护是有着很多的好处的。还记得我们有一个叫做@unmatchLinks的哈希表吗?就是用来做这件事情的。

class MetaWebLogImport

def write_unresolved_links_list

open('unresolved_links.txt', 'w+') do |file|

@unmatchedLinks.each do |link, references|

file.puts link + ":"

references.each { |ref| file.puts " " + ref }

file.puts

end

def run

read_original_rss

post_items

scan_content_for_links

write_htaccess

write_unresolved_links_list

end

结束

作为一个Ruby新手，完成这个工作我花了不少功夫来学习，当初为了完成这个程序我用了8个小时的时间（在今天，我可以在2个小时之内重写它）。还有不少的额外时间用于学习XML-RPC是如何工作的，以及如何向.Text添加内容。

JamesWu111

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录