正如上面的文章所说的,解读larbin的源代码是为了对下载下来的信息进行处理.为了达到这个目的,需要阅读其中的部分代码.
首先,最重要的,是larbin文档中,关于customnize的描述:
In order to customize larbin according to your needs, you have to create a userouput file (see src/interf/useroutput.cc). This file must define the 4 following functions :
相关代码见下:
class file {
protected:
// link to the buffer of our connexion
char *buffer;
// parsing position
char *posParse;
public:
// Constructor
file (Connexion *conn);
// Destructor
virtual ~file ();
// Is it a robots.txt
bool isRobots;
// current position in the buffer
uint pos;
// a string arrives from the server
virtual int inputHeaders (int size) = 0; // just parse headers
virtual int endInput () = 0;
};
class html : public file {
private:
// Where are we
url *here;
// beginning of the current interesting area
char *area;
// begining of the real content (end of the headers + 1)
char *contentStart;
// base de l'URL
url *base;
/* manage a new url : verify and send it */
void manageUrl (url *nouv);
/* All the following functions are used for parsing
* they return 0 if OK, 1 if problem occurs (errno is set) */
// parse the answer code line
int parseCmdline ();
// parse a line of header (ans 30X) => just look for location
int parseHeader30X ();
// parse a line of header
int parseHeader ();
// functions for parsing headers called by parseHeader
int verifType ();
int verifLength ();
/* The following functions are called by endInput
* for parsing the content of the file */
// enter a html section
void parseHtml ();
// enter a comment
void parseComment ();
// enter a tag
void parseTag ();
// enter a tag content
void parseContent (int action);
}
然后,在loaded(html *page)这个方法中就可以对下载下来的文件进行处理了.
首先,最重要的,是larbin文档中,关于customnize的描述:
In order to customize larbin according to your needs, you have to create a userouput file (see src/interf/useroutput.cc). This file must define the 4 following functions :
- void loaded (html *page) : This function is called when the fetch ended with success. From the page object, you can
- get the url of the page by calling the method getUrl()
- get the content of the page by calling the method getPage()
- get the list of the sons by calling the method getLinks() (if options.h includes "#define LINKS_INFO")
- get the http headers by calling the method getHeaders()
- get the tag with getUrl()->tag (if options.h includes "#define URL_TAGS")
- void failure (url *u, FetchError reason) : This function is called when the fetch ended by an error. u describes the url of the page. A description of its class can be found in src/utils/url.h. reason explains why the fetch failed. enum FetchError is defined in src/types.h.
- void initUserOutput () : Function for initialising all your data, called after all other initialisations
- void outputStats(int fds) : This function is called from the webserver if you want to track some data. fds is the file descriptor on which you must write to exchange with the net. This function is called in another thread than the main one with no lock at all, so be carefull !
相关代码见下:
class file {
protected:
// link to the buffer of our connexion
char *buffer;
// parsing position
char *posParse;
public:
// Constructor
file (Connexion *conn);
// Destructor
virtual ~file ();
// Is it a robots.txt
bool isRobots;
// current position in the buffer
uint pos;
// a string arrives from the server
virtual int inputHeaders (int size) = 0; // just parse headers
virtual int endInput () = 0;
};
class html : public file {
private:
// Where are we
url *here;
// beginning of the current interesting area
char *area;
// begining of the real content (end of the headers + 1)
char *contentStart;
// base de l'URL
url *base;
/* manage a new url : verify and send it */
void manageUrl (url *nouv);
/* All the following functions are used for parsing
* they return 0 if OK, 1 if problem occurs (errno is set) */
// parse the answer code line
int parseCmdline ();
// parse a line of header (ans 30X) => just look for location
int parseHeader30X ();
// parse a line of header
int parseHeader ();
// functions for parsing headers called by parseHeader
int verifType ();
int verifLength ();
/* The following functions are called by endInput
* for parsing the content of the file */
// enter a html section
void parseHtml ();
// enter a comment
void parseComment ();
// enter a tag
void parseTag ();
// enter a tag content
void parseContent (int action);
}
然后,在loaded(html *page)这个方法中就可以对下载下来的文件进行处理了.