爬虫豆瓣 -- v1.0

     尚书有云:尔不与吾,吾自索之,今之宇内,物属我众,今吾提笔,享与众人!

     In this article, i want to illustrate how to crawl the DouBan website. if you can't understand some detail, just leave a message.

     今天我想谈谈如何爬虫豆瓣的资源,首先声明,本来我不想爬虫豆瓣,这样对豆瓣网站并不好,但是由于其提供的API有诸多限制,严重影响了我的想象力,所以我觉定自己写程序提取他的数据。读者可以研读此文章和程序,但万不可去挑战豆瓣。

     近来需要开发一个视频网站系统,既是此类网站,那么就需要大量的视频资源数据,但是一般这类数据是不会有人双手送与你,所以你得自己准备好所有数据。今日在此文中,吾只谈数据资源的准备,不谈如何开发网站系统!有违令者,发配戍边!

       其实豆瓣向开发者提供了API接口,通过此API,开发者可以提取到有关电影视频的一些元信息,但是作为普通第三方开发者,此API返回的数据实在有限,不能满足一般需求。那么还有另一条路可走吗?当属爬虫了!爬虫其实就是自己写程序,此程序会遍历豆瓣的视频资源页面,然后从中提取你所需要的字段。众所周知的Google搜索引擎中,一个重要部分就是其Google-Spider,即谷歌蜘蛛,此蜘蛛时刻在互联网上行走遍历,每到一个页面或者网页,就提取一些关键性信息,谷歌搜索引擎既是根据这些信息来提供检索功能的。

     爬虫设计:此爬虫使用PHP编程语言,在Linux的Lamp环境下实现,其实PHP是跨平台的,读者可原封不动的将代码在Windows上运行。此爬虫主要是由类videometadata实现,videometadata类通过使用HTML的DOM-Document Object Model 和正则表达式来具体实现信息的提取。具体代码代码如下:

<?php
/*
 * Author: Archer
 * Date: 20/Jun/2013
 * File: videometadata.php
 * Des: this file is responsible for getting the video meta data from the video
 *	info page in the douban website.
 *
 * Note: at the beginning, i use the html dom so solve that problem, but that is
 *	not robust, so this time, i am gonna to use regular expression solve the
 *	problem. -- 20/Jun/2013
 *
 * Produced By CSRG.
 */

require_once ('simple_html_dom.php');
require_once ('netlib.php');

class Videometadata {
	/*
	 * Des: some useful properties.
	 */
	public $html = FALSE;
	public $htmldom = '';
	public $doubanid = 0;
	
	/*
	 * Des: temporary don't know the things need to do by the constructor.
	 */
	public function __construct ($url) {
		// TO-DO
		// use my own function get the html contents
		$this->html = get_data($url);
		$this->htmldom = str_get_html($this->html);
		// following two line for doubanid
		$tmp = explode('/', $url);
		$this->doubanid = $tmp[4];
		//$this->htmldom = file_get_html($url);
		//echo "Running construct method ...\n";
	}
	
	/**
	 * DES: method getPoster is used to get the main poster from the html dom
	 * @parm: none
	 * @return: url of the poster link
	 * Note: at beginning, i think i don't need to writer this method to get to
	 *	get the poster, becuase there is proramme dedicated to get all the
	 *	related photos, but sometimes, it is not that reliable.
	 */
	public function getPoster () {
		if ($this->html == FALSE) {
			// 404
			return FALSE;
		}
		$tmpdom = $this->htmldom->getElementById('mainpic');
		if ( !is_object($tmpdom)) {
			return FALSE;
		}
		
		// here use regular expression, for html dom is not reliable sometimes
		$partern = '/<img src="(.*)" .*\/>/';
		$partern2 = '/<img src="(.*)"[ ]*title=".*"[ ]*alt=".*"[ ]*rel=".*"[ ]*\/>/';
		if (preg_match($partern2, $tmpdom, $matches)) {
			// yes, matched
			//print_r($matches);
			return $matches[1];
		} else {
			// no, not metached
			return FALSE;
		}
	}
	/*
	 * Des: method getTitle is used to get the title from the html dom.
	 * @parm: none
	 * @return: title string
	 */
	public function getTitle () {
		if ($this->html == FALSE) {
			// 404
			return FALSE;
		}
		$titledom = $this->htmldom->find('h1', 0);
		return $titledom->plaintext;
	}
	
	/*
	 * Des: method getDirectors is used to get the director from the html
	 *	dom.
	 * @parm: none
	 * @return: array of the directors and it's links
	 */
	public function getDirectors () {
		if ($this->html == FALSE) {
			// 404
			return FALSE;
		}
		// locate to the directors block of code
		$directorsdom = $this->htmldom->getElementById('info');
		if (is_object($directorsdom)) {
			$directorsdom = $directorsdom->childNodes(0);
		} else {
			return FALSE;
		}
		
		// sometimes old movie don't have too much metadata
		if ( !is_object($directorsdom)) {
			return FALSE;
		}
		$directors = array();
		$index = 0;
		foreach ($directorsdom->find('a') as $element) {
			//echo $element->plaintext . "\n";
			$directors[$index]['director'] = $element->plaintext;
			//echo $element->href . "\n";
			$directors[$index]['link'] = $element->href;
			$index++;
		}
		return $directors;
	}
	
	/*
	 * Des: method getWriters is used to get the writers of the video from the
	 *	html dom.
	 * @parm: none
	 * @return: array of the writers and its links
	 */
	public function getWriters () {
		if ($this->html == FALSE) {
			// 404
			return FALSE;
		}
		// locate to the writers block of code
		$writersdom = $this->htmldom->getElementById('info');
		if (is_object($writersdom)) {
			$writersdom = $writersdom->childNodes(2);
		} else {
			return FALSE;
		}
		// sometimes, old movie don't have too much meta data, so there is no childnodes 2 in writersdom.
		if ( !is_object($writersdom)) {
			return FALSE;
		}
		$writers = array();
		$index = 0;
		foreach ($writersdom->find('a') as $element) {
			//echo $element->plaintext . "\n";
			$writers[$index]['writer'] = $element->plaintext;
			//echo $element->href . "\n";
			$writers[$index]['link'] = $element->href;
			$index++;
		}
		return $writers;
	}
	
	/*
	 * Des: method getActors is used to get the actors from the video html
	 *	dom.
	 * @parm: none
	 * @return: array of the actors and links
	 */
	public function getActors () {
		if ($this->html == FALSE) {
			// 404
			return FALSE;
		}
		// locate to the actors block of code
		$actorsdom = $this->htmldom->getElementById('info');
		if (is_object($actorsdom)) {
			$actorsdom = $actorsdom->childNodes(4);
		} else {
			return FALSE;
		}
		if ( !is_object($actorsdom)) {
			return FALSE;
		}
		$actors = array();
		$index = 0;
		foreach ($actorsdom->find('a') as $element) {
			//echo $element->plaintext . "\n";
			$actors[$index]['actor'] = $element->plaintext;
			//echo $element->href . "\n";
			$actors[$index]['link'] = $element->href;
			$index++;
		}
		return $actors;
	}
	
	/*
	 * Des: method getGenres is used to get the genres from the video html
	 *	dom.
	 * @parm: none
	 * @return: array of the genres
	 */
	public function getGenres () {
		if ($this->html == FALSE) {
			// 404
			return FALSE;
		}
		// locate to the actors block of code
		$genresdom = $this->htmldom->getElementById('info');
		if (is_object($genresdom)) {
			$genresdom = $genresdom->childNodes(6);
		} else {
			return FALSE;
		}
		// test again
		if ( !is_object($genresdom)) {
			return FALSE;
		}
		$genres = array();
		$index = 0;
		// generaly speaking, there are at most 10 genres
		for ($i = 0; $i < 10; $i++) {
			if ($genresdom->getAttribute('property') == "v:genre") {
				//echo $genresdom->plaintext . "\n";
				$genres[$index]['genre'] = $genresdom->plaintext;
				$index++;
			}
			$genresdom = $genresdom->next_sibling();
		}
		//$genres[$index]['genre'] = $genresdom->plaintext;
		return $genres;
	}
	
	/*
	 * Des: method getGenresa is used to get the genres from the video html
	 *	dom.
	 * @parm: none
	 * @return: array of the genres
	 * Note: method getGenres is not robust by using the html dom, so this
	 *	one I use regular expression do the dirty work.
	 */
	public function getGenresa () {
		if ($this->html == FALSE) {
			// 404
			return FALSE;
		}
		$tmp = $this->htmldom->getElementById('info');
		if ( !is_object($tmp)) {
			// no object
			return FALSE;
		}
		// the following partern is used to match utf-8 chinese code
		$partern2 = '/<span property="v:genre">([\x{4e00}-\x{9fa5}]+)<\/span>/u';
		if (preg_match_all($partern2, $tmp, $matches)) {
			// matches
			//print_r($matches);
			//echo $matches[0];
			//echo "wow";
			return $matches[1];
		} else {
			// no match
			//echo "No\n";
			return FALSE;
		}
	}
	
	/*
	 * Des: method getOfficiallink is used to get the officail link from the video html
	 *	dom.
	 * @parm: none
	 * @return: string of the link
	 */
	/*public function getOfficialLink () {
		if ($this->html == FALSE) {
			// 404
			return FALSE;
		}
		// locate to the official link block of code
		$tmpdom = $this->htmldom->getElementById('info');
		if (is_object($tmpdom)) {
			echo "yes";
		} else {
			echo "no";
		}
		// get the child span relatednumbers
		$nodenum = count($this->htmldom->getElementById('info')->getElementsByTagName('span'));
		
		// locate to the exact location and get the content
		for ($i = 0; $i < $nodenum; $i++) {
			// jump over tags not belong to span.
			$tmp = $this->htmldom->find("#info", 0)->getElementByTagName('br');
			if (!$tmp) {
				$tmp = $tmp->next_sibling();
				echo "fuck\n";
				$i--;
				continue;
			}
			// the dom method will append an blank space sometimes
			$str1 = $tmpdom->plaintext;
			//echo $str1;
			$str2 = "官方网站: ";
			if (strncmp ($str1, $str2, 12) == 0) {
				//echo $str1 . "\t" . strlen('官方网站:') . "\t";
				return $tmpdom->next_sibling()->plaintext;
			}
			//echo $i . "\t" . $tmpdom->plaintext . "\n";
			$tmpdom = $tmpdom->next_sibling();
		}
		
	}*/
	/*
	 * Des: method getOfficiallink is used to get the officail link from the video html
	 *	dom.
	 * @parm: none
	 * @return: string of the link
	 * Note: use regular expression get the data you want.
	 */
	public function getOfficialLink () {
		if ($this->html == FALSE) {
			// 404
			return FALSE;
		}
		$tmp = $this->htmldom->getElementById('info');
		if (is_object($tmp)) {
			//echo "Yes\n";
			$tmp = $tmp->plrelatedaintext;
		} else {
			//echo "No\n";
			return FALSE;
		}
		// use reg get the thing you want
		$partern = '/官方网站:[ ]*(.*)/';
		if (preg_match($partern, $tmp, $matches)) {
			// matches
			return $matches[1];
		} else {
			// no match
			return FALSE;
		}
	}
	
	/*related
	 * Des: method getCountries is used to get the countries link from the video html
	 *	dom.
	 * @parm: none
	 * @return: string of the contries
	 * Note: countries are not include in any tag, so html dom can not solve this problem.
	 *	use regular expression solve this problem.
	 *
	 */
	public function getCountries () {
		if ($this->html == FALSE) {
			// 404
			return FALSE;
		}
		// put the block of code into the string
		$countries = $this->htmldom->getElementById('info');
		$partern = '/<span class="pl">制片国家\/地区:<\/span> (.*)<br\/>/';
		if (preg_match($partern, $countries, $matches)) {
			// Ah! finally function explode solved my problem.
			$countries = explode("<br/>", $matches[1], 2);
			return $countries[0];
		} else {
			return FALSE;
		}
	}
	
	/*
	 * Des: method getLanguages is used to get the languages link from the video html
	 *	dom.
	 * @parm: none
	 * @return: string of the languages
	 * Note: languages are not include in any tag, so html dom can not solve this problem.
	 *	use regular expression solve this problem.
	 *
	 */
	public function getLanguages () {
		if ($this->html == FALSE) {
			// 404
			return FALSE;
		}
		// put the block of code into the string
		$countries = $this->htmldom->getElementById('info');
		$partern = '/<span class="pl">语言:<\/span> (.*)<br\/>/';
		if (preg_match($partern, $countries, $matches)) {
			// Ah! finally function explode solved my problem.
			$countries = explode("<br/>", $matches[1], 2);
			return $countries[0];
		} else {
			return FALSE;
		}
	}
	
	/*
	 * Des: method getPubDate is used to get the publate date from the video html
	 *	dom.
	 * @parm: none
	 * @return: string of the pub date
	 * Note: wired! not working on some tv series
	 */
	public function getPubDate () {
		if ($this->html == FALSE) {
			// 404
			return FALSE;
		}
		// locate to the official link block of code
		$tmpdom = $this->htmldom->getElementById('info');
		if (is_object($tmpdom)) {
			$tmpdom = $tmpdom->getElementsByTagName('span', 0);
		} else {
			return FALSE;
		}
		if ( !is_object($tmpdom)) {
			return FALSE;
		}
		// get the child span numbers
		$nodenum = count($this->htmldom->getElementById('info')->getElementsByTagName('span'));
		// locate to the exact location and get the content
		for ($i = 0; $i < $nodenum; $i++) {
			// jump over tags not belong to span.
			if ($tmpdom->getElementByTagName('br')) {
				$tmpdom = $tmpdom->next_sibling();
				$i--;
				continue;
			}
			// the dom method will append an blank space sometimes
			$str1 = $tmpdom->plaintext;
			$str2 = "上映日期: ";
			if (strncmp ($str1, $str2, 12) == 0) {
				//echo $str1 . "\t" . strlen('官方网站:') . "\t";
				return $tmpdom->next_sibling()->plaintext;
			}
			//echo $i . "\t" . $tmpdom->plaintext . "\n";
			$tmpdom = $tmpdom->next_sibling();
		}
		
	}
	
	/*
	 * Des: method getPubDatea is used to get the pubdates from the video html
	 *	dom.
	 * @parm: none
	 * @return: string of the pubdate
	 * Note: method getPubdate is not robust by using the html dom, so this
	 *	one I use regular expression do the dirty work.
	 */
	public function getPubDatea () {
		if ($this->html == FALSE) {
			// 404
			return FALSE;
		}
		$tmp = $this->htmldom->getElementById('info');
		if ( !is_object($tmp)) {
			// no object
			//echo "No Object\n";
			return FALSE;
		}
		//echo $tmp;
		// the following partern is used to match utf-8 chinese code
		//$partern = '/[ ]*<span property="v:initialReleaseDate" content=".*">(.*)<\/span><br\/>/';
		$partern2 = '/[ ]*<span property="v:initialReleaseDate" content=".*">([0-9]{4}-[0-9]{2}-[0-9]+\(.*\))<\/span><br\/>/';
		if (preg_match($partern2, $tmp, $matches)) {
			// matches
			//echo "Yes\n";
			//print_r($matches);
			//echo $matches[1];
			return $matches[1];
		} else {
			// no match
			//echo "No\n";
			return FALSE;
		}
	}
	
	/*
	 * Des: method getLength is used to get the video length from the video html
	 *	dom.
	 * @parm: none
	 * @return: string of the video length
	 * Note: wired! not working on some tv series
	 */
	public function getLength () {
		if ($this->html == FALSE) {
			// 404
			return FALSE;
		}
		// locate to the official link block of code
		$tmpdom = $this->htmldom->getElementById('info');
		if (is_object($tmpdom)) {
			$tmpdom = $tmpdom->getElementsByTagName('span', 0);
		} else {
			return FALSE;
		}
		// get the child span numbers
		$nodenum = count($this->htmldom->getElementById('info')->getElementsByTagName('span'));
		if ( !is_object($tmpdom)) {
			return FALSE;
		}
		// locate to the exact location and get the content
		for ($i = 0; $i < $nodenum; $i++) {
			// jump over tags not belong to span.
			if ($tmpdom->getElementByTagName('br')) {
				$tmpdom = $tmpdom->next_sibling();
				$i--;
				continue;
			}
			// the dom method will append an blank space sometimes
			$str1 = $tmpdom->plaintext;
			$str2 = "片长: ";
			if (strncmp ($str1, $str2, 6) == 0) {
				return $tmpdom->next_sibling()->plaintext;
			}
			$tmpdom = $tmpdom->next_sibling();
		}
		
	}
	
	/*
	 * Des: method getLengtha is used to get the length from the video html
	 *	dom.
	 * @parm: none
	 * @return: string of the length
	 * Note: method getPubdate is not robust by using the html dom, so this
	 *	one I use regular expression do the dirty work.
	 */
	public function getLengtha () {
		if ($this->html == FALSE) {
			// 404
			return FALSE;
		}
		$tmp = $this->htmldom->getElementById('info');
		if ( !is_object($tmp)) {
			// no object
			//echo "No Object\n";
			return FALSE;
		}
		//echo $tmp;
		// the following partern is used to match utf-8 chinese code
		$partern = '/[ ]*<span property="v:runtime" content=".*">(.*)<\/span><br\/>/';
		if (preg_match($partern, $tmp, $matches)) {
			// matches
			//echo "Yes\n";
			//print_r($matches);
			//echo $matches[0];
			return $matches[1];
		} else {
			// no match
			//echo "No\n";
			return FALSE;
		}
	}
	
	/*
	 * Des: method getAka is used to get the aka from the video html
	 *	dom.
	 * @parm: none
	 * @return: string of the aka
	 * Note: aka are not include in any tag, so html dom can not solve this problem.
	 *	use regular expression solve this problem.
	 *
	 */
	public function getAka () {
		if ($this->html == FALSE) {
			// 404
			return FALSE;
		}
		// put the block of code into the string
		$tmp = $this->htmldom->getElementById('info');
		$partern = '/<span class="pl">又名:<\/span> (.*)<br\/>/';
		if (preg_match($partern, $tmp, $matches)) {
			// Ah! finally function explode solved my problem.
			$akas = explode("<br/>", $matches[1], 2);
			return $akas[0];
		} else {
			return FALSE;
		}
	}
	
	/*
	 * Des: method getImdb is used to get the imdb id from the video html
	 *	dom.
	 * @parm: none
	 * @return: string of the imdb id
	 */
	/*public function getImdb () {
		if ($this->html == FALSE) {
			// 404
			return FALSE;
		}
		// locate to the official link block of code
		$tmpdom = $this->htmldom->getElementById('info');
		if (is_object($tmpdom)) {
			$tmpdom = $tmpdom->getElementsByTagName('span', 0);
		} else {
			return FALSE;
		}
		// get the child span numbers
		$nodenum = count($this->htmldom->getElementById('info')->getElementsByTagName('span'));
		if ( !is_object($tmpdom)) {
			return FALSE;
		}
		// locate to the exact location and get the content
		for ($i = 0; $i < $nodenum; $i++) {
			// jump over tags not belong to span.
			if ($tmpdom->getElementByTagName('br')) {
				$tmpdom = $tmpdom->next_sibling();
				$i--;
				continue;
			}
			// the dom method will append an blank space sometimes
			$str1 = $tmpdom->plaintext;
			$str2 = "IMDb链接: ";
			if (strncmp ($str1, $str2, 4) == 0) {
				return $tmpdom->next_sibling()->plaintext;
			}
			$tmpdom = $tmpdom->next_sibling();
		}
		
	}*/
	/*
	 * Des: method getImdb is used to get the imdb id from the video html
	 *	dom.
	 * @parm: none
	 * @return: string of the imdb id
	 */
	public function getImdb () {
		if ($this->html == FALSE) {
			// 404
			return FALSE;
		}
		$tmp = $this->htmldom->getElementById('info');
		if (is_object($tmp)) {
			//echo "Yes\n";
			$tmp = $tmp->plaintext;
		} else {
			//echo "No\n";
			return FALSE;
		}
		// use reg get the thing you want
		$partern = '/IMDb链接:[ ]*(.*)/';
		if (preg_match($partern, $tmp, $matches)) {
			// matches
			return $matches[1];
		} else {
			// no match
			return FALSE;
		}
	}
	
	/*
	 * Des: getDoubanId is used to get the douban video id from the url. this
	 *	task is already done in the constructor. so just return the value.
	 * @parm: none
	 * @return: the string of the douban id
	 */
	public function getDoubanId () {
		return $this->doubanid;
	}
	
	/*
	 * DES: method getRating is used to get the douban rating data from the 
	 *	original html file by using the html dom and regular expression.
	 * @parm: none
	 * @return: string of the rating
	 */
	public function getRating () {
		if ($this->html == FALSE) {
			// 404
			return FALSE;
		}
		$tmp = $this->htmldom->getElementById('interest_sectl');
		//echo $tmp->plaintext;
		if (is_object($tmp)) {
			//echo "Yes\n";
			$tmp = $tmp->plaintext;
		} else {
			//echo "No\n";
			return FALSE;
		}
		// use reg get the thing you want
		$partern = '/[ ]*([0-9]\.[0-9])[ ]*/';
		if (preg_match($partern, $tmp, $matches)) {
			// matches
			//echo $matches[1] . "\n";
			return $matches[1];
		} else {
			// no match
			return FALSE;
		}
	}
	
	/*
	 * DES: method getViewNum is used to get the view number data from the 
	 *	original html file by using the html dom and regular expression.
	 * @parm: none
	 * @return: string of the view number
	 */
	public function getViewNum () {
		if ($this->html == FALSE) {
			// 404
			return FALSE;
		}
		$tmp = $this->htmldom->getElementById('interest_sectl');
		//echo $tmp->plaintext;
		if (is_object($tmp)) {
			//echo "Yes\n";
			$tmp = $tmp->plaintext;
		} else {
			//echo "No\n";
			return FALSE;
		}
		// use reg get the thing you want
		//$partern = '/[ ]*([0-9]\.[0-9])[ ]*/';
		$partern = '/[ ]*[0-9]\.[0-9][ ]*\(([0-9]*)/';
		if (preg_match($partern, $tmp, $matches)) {
			// matches
			//echo $matches[1] . "\n";
			return $matches[1];
		} else {
			// no match
			return FALSE;
		}
	}
	
	/*
	 * DES: method getDescription is used to get the description data from the 
	 *	original html file by using the html dom and regular expression.
	 * @parm: none
	 * @return: string of the description
	 */
	public function getDescription () {
		if ($this->html == FALSE) {
			// 404
			return FALSE;
		}
		$tmp = $this->htmldom->getElementById('link-report');
		//echo $tmp->plaintext;
		if (is_object($tmp)) {
			//echo "Yes\n";
			$tmp = $tmp->plaintext;
			return $tmp;
		} else {
			//echo "No\n";
			return FALSE;
		}
	}
	
	/*
	 * Des: method getTraillerLink is used to get the links of the photos.
	 * @parm: none
	 * @return: string of the link
	 */
	public function getTraillerLink () {
		if ($this->html == FALSE) {
			// 404
			return FALSE;
		}
		$tmp = $this->htmldom->find('h2', 1);
		if (is_object($tmp)) {
			//echo "Yes\n";
			$partern = '/<a href=(.*)>预告片[0-9]*<\/a>/';
			if (preg_match($partern, $tmp, $matches)) {
				// matched
				//print_r($matches);
				$res = str_replace("\"", "", $matches[1]);
				return $res;
			} else {
				return FALSE;
			}
			//return $tmp;
		} else {
			//echo "No\n";
			return FALSE;
		}
	}
	
	/*
	 * Des: method getPhotosLink is used to get the links of the photos.
	 * @parm: none
	 * @return: string of the link
	 */
	public function getPhotosLink () {
		if ($this->html == FALSE) {
			// 404
			echo "Videometadata::getPhotosLink 404\n";
			return FALSE;
		}
		$tmp = $this->htmldom->find('h2', 1);
		if (is_object($tmp)) {
			//echo "Yes\n";
			$partern = '/<a href=.*>预告片[0-9]*<\/a>.*<a href=(.*)>图片[0-9]*<\/a>/';
			$partern2 = '/<a href="(.*)">全部[0-9]*<\/a>/';
			if (preg_match($partern, $tmp, $matches) || preg_match($partern2, $tmp, $matches)) {
				// matched
				//print_r($matches);
				$res = str_replace("\"", "", $matches[1]);
				return $res;
			} else {
				echo "Videometadata::getPhotosLink NO Match\n";
				return FALSE;
			}
		} else {
			//echo "No\n";
			echo "Videometadata::getPhotosLink Not Object\n";
			return FALSE;
		}
	}
	
	/*
	 * DES: method getPhotoLinks is used to get the image url link. this method is different
	 *	from getPhotoLink.
	 * @parm: string photourl
	 * @return: array of the url links
	 */
	public function getPhotoLinks ($photourl) {
		if ( !$photourl) {
			return FALSE;
		}
		// first get the html file
		$html = get_data($photourl);
		if ($html == FALSE) {
			// can not get the all_photos page
			return FALSE;
		}
		$partern = '/<img src="(.*)">/';
		if (preg_match_all($partern, $html, $matches, PREG_PATTERN_ORDER)) {
			// matched
			return $matches[1];
		} else {
			// not matche
			return FALSE;
		}
	
	}
	
	/*
	 * DES: method getRecommendId is used to get the recommendation videos from
	 *	douban.
	 * @parm: douban video id
	 * @return: array contains the recommendation video id
	 */
	public function getRecommendId () {
		if ($this->html == FALSE) {
			// 404
			return FALSE;
		}
		$tmp = $this->htmldom->getElementById('recommendations');
		if ( !is_object($tmp)) {
			return FALSE;
		}
		$tmp = $tmp->find('dd');
		/*if ( !is_object($tmp)) {
			return FALSE;
		}*/
		//echo count($tmp);
		$res = array();
		$index = 0;
		for ($i = 0; $i < count($tmp); $i++) {
			$tmpa = $this->htmldom->getElementById('recommendations')->find('dd', $i);
			//echo $tmp;
			$partern = '/<dd>[ ]*<a href="(.*)\?from=subject-page" class="">/';
			if (preg_match($partern, $tmpa, $matches)) {
				// matched
				$tmpidarray = explode('/', $matches[1]);
				$res[$index] = $tmpidarray[4];
				$index++;
			}	
		}
		if ($index == 二来

0) { // index can be a flag, if not match at all, return false return FALSE; } else { // return the id array return $res; } } /* * DES: method getRecommendIda is used to get the recommendation videos from * douban. * @parm: douban video id * @return: array contains the recommendation video id * Note: method getRecommendId is facing some object problem, so i add a if test * to solve that problem. */ public function getRecommendIda () { if ($this->html == FALSE) { // 404 echo "Videometadata::getRecommendIda 404\n"; return FALSE; } $tmp = $this->htmldom->getElementById('recommendations'); if ( !is_object($tmp)) { echo "Videometadata::getRecommendIda Not Object\n"; return FALSE; } $tmp = $tmp->find('dd'); /*if ( !is_object($tmp)) { return FALSE; }*/ //echo count($tmp); $res = array(); $index = 0; for ($i = 0; $i < count($tmp); $i++) { $tmpa = $this->htmldom->getElementById('recommendations')->find('dd', $i); //echo $tmp; $partern = '/<dd>[ ]*<a href="(.*)\?from=subject-page" class="">/'; if (preg_match($partern, $tmpa, $matches)) { // matched $tmpidarray = explode('/', $matches[1]); $res[$index] = $tmpidarray[4]; $index++; } } if ($index == 0) { // index can be a flag, if not match at all, return false return FALSE; } else { // return the id array return $res; } } /* * DES: method getReviewTitle is used to get the review title of the video from * douban. * @parm: none * @return: string of the title * Note: use html dom can solve this problem very good. */ public function getReviewTitle ($index = 0) { if ($this->html == FALSE) { // 404 return FALSE; } $titlenum = count($this->htmldom->getElementById('review_section')->find('h3')); $tmp = $this->htmldom->getElementById('review_section')->find('h3', $index); return $tmp->lastChild()->innertext; } /* * DES: method getReviewBody is used to get the review body of the video from * douban. * @parm: none * @return: array of the body * Note: use html dom can solve this problem very good. */ public function getReviewsBody () { if ($this->html == FALSE) { // 404 return FALSE; } $tmp = $this->htmldom->getElementById('review_section'); if (is_object($tmp)) { // html dom return right, continuing $reviewdoms = count($tmp->find('div')); //echo $reviewdoms . "\n"; $reviewdoms = $tmp->find('div'); $index = 0; $res = array(); foreach ($reviewdoms as $item) { if ($item->hasAttribute('class') && $item->getAttribute('class') == 'review-short') { //echo "Do 'Oh!\n"; //echo $item->plaintext . "\n"; //echo self::getReviewTitle($index) . "\n"; $res[$index]['title'] = self::getReviewTitle($index); $res[$index]['body'] = $item->firstChild()->plaintext; $index++; //echo $item->firstChild()->plaintext . "\n\n"; } } } else { // not dom return FALSE; } } /* * DES: method getReviews is used to get the review info of the video from * douban. * @parm: none * @return: array contains the title and the body, i.e. $res[0]['title']['body'] * Note: use html dom can solve this problem very good. */ public function getReviews () { if ($this->html == FALSE) { // 404 echo "Videometadata::getReviews 404\n"; return FALSE; } $tmp = $this->htmldom->getElementById('review_section'); if (is_object($tmp)) { // html dom return right, continuing $reviewdoms = count($tmp->find('div')); //echo $reviewdoms . "\n"; $reviewdoms = $tmp->find('div'); $index = 0; $res = array(); foreach ($reviewdoms as $item) { if ($item->hasAttribute('class') && $item->getAttribute('class') == 'review-short') { //echo self::getReviewTitle($index) . "\n"; $res[$index]['title'] = self::getReviewTitle($index); $res[$index]['body'] = $item->firstChild()->plaintext; $index++; //echo $item->firstChild()->plaintext . "\n\n"; } } if ($index == 0) { // no result in array return FALSE; } else { // some result in array return $res; } } else { // not dom echo "Videometadata::getReviews Not Object\n"; return FALSE; } } /* * Des: temporary don't know the things need to do by the deconstructor. */ public function __destruct () { // TO-DO //echo $this->htmldom; //echo "Running destruct method ...\n"; } } ?>

    上述PHP 类接收使用方法如下:

<?php
/**
 * DES: an example to illustrate how to use the videometadata class.
 *
 * Produced By CSRG.
 **/
require_once ('videometadata.php');

// the video page url in douban
$url = "http://movie.douban.com/subject/123393/"
// to new an object from videometadata, you need provide the url
$videoobj = new videometadata($url);
echo $videoobj->getTitle();

?>

   

    我知道将主要代码贴在文章中实在是不可取,一是因为没人回去仔细研读,二来,这样影响文章的阅读。我本是想将所有的项目代码放在附件中的,可琢磨了下,博客系统没提供这个功能,所以只能将就了。但愿诸位能理解,意见我也接受。博客系统的使用,我还不是很熟悉,望各位指导!

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值