爬取豆瓣个人日记

最新推荐文章于 2021-11-26 09:34:47 发布

wzkyy

最新推荐文章于 2021-11-26 09:34:47 发布

阅读量321

点赞数

文章标签： rubygems Ruby HTML

本文链接：https://blog.csdn.net/wzkyy/article/details/83668651

版权

看推特上有人推荐豆瓣上一位[url=http://www.douban.com/people/1272884/notes]先生的日记[/url]，看了许久了，觉得果然不错。最近正好在玩Hpricot，便写了个小程序，把这位先生的个人日记全部爬了下来


require 'rubygems'
require 'hpricot'
require 'string'
require 'open-uri'
require 'fileutils'

def write_file(file_content,title)
  path = "E:\\"
  file_name = path+title+".txt"
  file = File.open(file_name,"w+")
  file.puts title
  file.puts file_content
  file.close
end

def get_content_and_title(target_url)
   doc = Hpricot(open(target_url))
   content = doc.search("pre.note")
   title = doc.search("div.note-header")
   write_file(content.inner_html.to_gbk,title.at("h3").inner_html.to_gbk)
end

def get_article_url(articles_url)
   doc = Hpricot(open(articles_url))
   ele = doc.search("div.article")
   ele.each do |ab|
      arr = ab.children
      arr.each do |cd|
	  begin
	    attribute = cd.attributes['id']
	  rescue NoMethodError
	  end
	  if(not attribute.nil? and attribute.include? "note-")
              det = attribute.split("-")
	      id = det[1]
	      url = "http://www.douban.com/note/"+id
	      get_content_and_title(url)
          end
      end
   end   
end

def get_pages(articles_url)
   puts articles_url
   doc = Hpricot(open(articles_url))
   ele = doc.search("span.next")
   get_article_url(articles_url)
   next_page = ele.at("a")
   while next_page
       next_page_url = next_page.attributes["href"]
       get_article_url(next_page_url)
       get_pages(next_page_url)
   end
end

get_article_url("http://www.douban.com/people/1272884/notes")

执行以上程序，可将风行水上先生的个人日记全部爬下来，每篇日记一个txt文件，可以慢慢品味。
我把这些文件打包传上来了，对代码不感兴趣但对文章感兴趣的同学可以看看 :D

ps:本人是java党，所以代码写的像java一般还望大家见谅啊

wzkyy

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
爬取豆瓣个人日记

看推特上有人推荐豆瓣上一位[url=http://www.douban.com/people/1272884/notes]先生的日记[/url]，看了许久了，觉得果然不错。最近正好在玩Hpricot，便写了个小程序，把这位先生的个人日记全部爬了下来[code="ruby"]require 'rubygems'require 'hpricot'require 'string're...
复制链接

扫一扫