larbin2.6.2 源代码解读(1)

 正如上面的文章所说的,解读larbin的源代码是为了对下载下来的信息进行处理.为了达到这个目的,需要阅读其中的部分代码.
  首先,最重要的,是larbin文档中,关于customnize的描述:
  In order to customize larbin according to your needs, you have to create a userouput file (see src/interf/useroutput.cc). This file must define the 4 following functions :
  • void loaded (html *page) : This function is called when the fetch ended with success. From the page object, you can
    • get the url of the page by calling the method getUrl()
    • get the content of the page by calling the method getPage()
    • get the list of the sons by calling the method getLinks() (if options.h includes "#define LINKS_INFO")
    • get the http headers by calling the method getHeaders()
    • get the tag with getUrl()->tag (if options.h includes "#define URL_TAGS")
    For more details, see src/fetcher/file.h (for html), src/utils/url.h, src/utils/Vector.h.
  • void failure (url *u, FetchError reason) : This function is called when the fetch ended by an error. u describes the url of the page. A description of its class can be found in src/utils/url.h. reason explains why the fetch failed. enum FetchError is defined in src/types.h.
  • void initUserOutput () : Function for initialising all your data, called after all other initialisations
  • void outputStats(int fds) : This function is called from the webserver if you want to track some data. fds is the file descriptor on which you must write to exchange with the net. This function is called in another thread than the main one with no lock at all, so be carefull !
   首先来看html这个类的数据结构,它是继承自file这个类的.
   相关代码见下:
class file {
 protected:
  // link to the buffer of our connexion
  char *buffer;
  // parsing position
  char *posParse;
 public:
  // Constructor
  file (Connexion *conn);
  // Destructor
  virtual ~file ();
  // Is it a robots.txt
  bool isRobots;
  // current position in the buffer
  uint pos;
  // a string arrives from the server
  virtual int inputHeaders (int size) = 0; // just parse headers
  virtual int endInput () = 0;
};

class html : public file {
 private:
  // Where are we
  url *here;
  // beginning of the current interesting area
  char *area;
  // begining of the real content (end of the headers + 1)
  char *contentStart;
  // base de l'URL
  url *base;
  /* manage a new url : verify and send it */
  void manageUrl (url *nouv);

  /* All the following functions are used for parsing
   * they return 0 if OK, 1 if problem occurs (errno is set) */
  // parse the answer code line
  int parseCmdline ();
  // parse a line of header (ans 30X) => just look for location
  int parseHeader30X ();
  // parse a line of header
  int parseHeader ();
  // functions for parsing headers called by parseHeader
  int verifType ();
  int verifLength ();
  /* The following functions are called by endInput
   * for parsing the content of the file */
  // enter a html section
  void parseHtml ();
  // enter a comment
  void parseComment ();
  // enter a tag
  void parseTag ();
  // enter a tag content
  void parseContent (int action);
}

然后,在loaded(html *page)这个方法中就可以对下载下来的文件进行处理了.
   
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值