larbin2.6.2 源代码解读(1)

最新推荐文章于 2021-01-27 04:21:22 发布

caohao2008

最新推荐文章于 2021-01-27 04:21:22 发布

阅读量2k

点赞数

文章标签： parsing function file url destructor descriptor

本文链接：https://blog.csdn.net/caohao2008/article/details/1663281

版权

正如上面的文章所说的,解读larbin的源代码是为了对下载下来的信息进行处理.为了达到这个目的,需要阅读其中的部分代码.
首先,最重要的,是larbin文档中,关于customnize的描述:
In order to customize larbin according to your needs, you have to create a userouput file (see src/interf/useroutput.cc). This file must define the 4 following functions :

void loaded (html *page) : This function is called when the fetch ended with success. From the page object, you can
- get the url of the page by calling the method getUrl()
- get the content of the page by calling the method getPage()
- get the list of the sons by calling the method getLinks() (if options.h includes "#define LINKS_INFO")
- get the http headers by calling the method getHeaders()
- get the tag with getUrl()->tag (if options.h includes "#define URL_TAGS")
For more details, see src/fetcher/file.h (for html), src/utils/url.h, src/utils/Vector.h.
void failure (url *u, FetchError reason) : This function is called when the fetch ended by an error. u describes the url of the page. A description of its class can be found in src/utils/url.h. reason explains why the fetch failed. enum FetchError is defined in src/types.h.
void initUserOutput () : Function for initialising all your data, called after all other initialisations
void outputStats(int fds) : This function is called from the webserver if you want to track some data. fds is the file descriptor on which you must write to exchange with the net. This function is called in another thread than the main one with no lock at all, so be carefull !

   首先来看html这个类的数据结构,它是继承自file这个类的.
   相关代码见下:
class file {
protected:
// link to the buffer of our connexion
char *buffer;
// parsing position
char *posParse;
public:
// Constructor
file (Connexion *conn);
// Destructor
virtual ~file ();
// Is it a robots.txt
bool isRobots;
// current position in the buffer
uint pos;
// a string arrives from the server
virtual int inputHeaders (int size) = 0; // just parse headers
virtual int endInput () = 0;
};

class html : public file {
private:
// Where are we
url *here;
// beginning of the current interesting area
char *area;
// begining of the real content (end of the headers + 1)
char *contentStart;
// base de l'URL
url *base;
/* manage a new url : verify and send it */
void manageUrl (url *nouv);

/* All the following functions are used for parsing
   * they return 0 if OK, 1 if problem occurs (errno is set) */
// parse the answer code line
int parseCmdline ();
// parse a line of header (ans 30X) => just look for location
int parseHeader30X ();
// parse a line of header
int parseHeader ();
// functions for parsing headers called by parseHeader
int verifType ();
int verifLength ();
/* The following functions are called by endInput
   * for parsing the content of the file */
// enter a html section
void parseHtml ();
// enter a comment
void parseComment ();
// enter a tag
void parseTag ();
// enter a tag content
void parseContent (int action);
}

然后,在loaded(html *page)这个方法中就可以对下载下来的文件进行处理了.

caohao2008

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
larbin2.6.2 源代码解读(1)

正如上面的文章所说的,解读larbin的源代码是为了对下载下来的信息进行处理.为了达到这个目的,需要阅读其中的部分代码. 首先,最重要的,是larbin文档中,关于customnize的描述: In order to customize larbin according to your needs, you have to create a userouput file (see src/i
复制链接

扫一扫