看推特上有人推荐豆瓣上一位[url=http://www.douban.com/people/1272884/notes]先生的日记[/url],看了许久了,觉得果然不错。最近正好在玩Hpricot,便写了个小程序,把这位先生的个人日记全部爬了下来
执行以上程序,可将风行水上先生的个人日记全部爬下来,每篇日记一个txt文件,可以慢慢品味。
我把这些文件打包传上来了,对代码不感兴趣但对文章感兴趣的同学可以看看 :D
ps:本人是java党,所以代码写的像java一般还望大家见谅啊
require 'rubygems'
require 'hpricot'
require 'string'
require 'open-uri'
require 'fileutils'
def write_file(file_content,title)
path = "E:\\"
file_name = path+title+".txt"
file = File.open(file_name,"w+")
file.puts title
file.puts file_content
file.close
end
def get_content_and_title(target_url)
doc = Hpricot(open(target_url))
content = doc.search("pre.note")
title = doc.search("div.note-header")
write_file(content.inner_html.to_gbk,title.at("h3").inner_html.to_gbk)
end
def get_article_url(articles_url)
doc = Hpricot(open(articles_url))
ele = doc.search("div.article")
ele.each do |ab|
arr = ab.children
arr.each do |cd|
begin
attribute = cd.attributes['id']
rescue NoMethodError
end
if(not attribute.nil? and attribute.include? "note-")
det = attribute.split("-")
id = det[1]
url = "http://www.douban.com/note/"+id
get_content_and_title(url)
end
end
end
end
def get_pages(articles_url)
puts articles_url
doc = Hpricot(open(articles_url))
ele = doc.search("span.next")
get_article_url(articles_url)
next_page = ele.at("a")
while next_page
next_page_url = next_page.attributes["href"]
get_article_url(next_page_url)
get_pages(next_page_url)
end
end
get_article_url("http://www.douban.com/people/1272884/notes")
执行以上程序,可将风行水上先生的个人日记全部爬下来,每篇日记一个txt文件,可以慢慢品味。
我把这些文件打包传上来了,对代码不感兴趣但对文章感兴趣的同学可以看看 :D
ps:本人是java党,所以代码写的像java一般还望大家见谅啊