php抓取imdb海报
It's been quite a while since I've written a PHP grabber and the itch finally got to me. This time the victim is the International Movie Database, otherwise known as IMDB. IMDB has info on every movie ever made (or so it seems). Their HTML source code is easy to parse so this one was a piece of cake.
自从我写了一个PHP采集器以来,已经有一段时间了,但最终我感到很痒。 这次受害者是国际电影数据库,也称为IMDB。 IMDB包含有史以来制作的每部电影的信息(看起来如此)。 他们HTML源代码很容易解析,因此这只是小菜一碟。
PHP (The PHP)
//url
$url = 'http://www.imdb.com/title/tt0367882/';
//get the page content
$imdb_content = get_data($url);
//parse for product name
$name = get_match('/<title>(.*)<\/title>/isU',$imdb_content);
$director = strip_tags(get_match('/<h5[^>]*>Director:<\/h5>(.*)<\/div>/isU',$imdb_content));
$plot = get_match('/<h5[^>]*>Plot:<\/h5>(.*)<\/div>/isU',$imdb_content);
$release_date = get_match('/<h5[^>]*>Release Date:<\/h5>(.*)<\/div>/isU',$imdb_content);
$mpaa = get_match('/<a href="\/mpaa">MPAA<\/a>:<\/h5>(.*)<\/div>/isU',$imdb_content);
$run_time = get_match('/Runtime:<\/h5>(.*)<\/div>/isU',$imdb_content);
//build content
$content.= '<h2>Film</h2><p>'.$name.'</p>';
$content.= '<h2>Director</h2><p>'.$director.'</p>';
$content.= '<h2>Plot</h2><p>'.substr($plot,0,strpos($plot,'<a')).'</p>';
$content.= '<h2>Release Date</h2><p>'.substr($release_date,0,strpos($release_date,'<a')).'</p>';
$content.= '<h2>MPAA</h2><p>'.$mpaa.'</p>';
$content.= '<h2>Run Time</h2><p>'.$run_time.'</p>';
$content.= '<h2>Full Details</h2><p><a href="'.$url.'" rel="nofollow">'.$url.'</a></p>';
echo $content;
//gets the match content
function get_match($regex,$content)
{
preg_match($regex,$content,$matches);
return $matches[1];
}
//gets the data from a URL
function get_data($url)
{
$ch = curl_init();
$timeout = 5;
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
As with my other grabbers, the trick is always in the regular expressions. Note that the most important part of the URL is the string after "/title/". That string uniquely identifies the movie.
与我的其他抓取器一样,技巧始终在正则表达式中。 请注意,URL的最重要部分是“ / title /”之后的字符串。 该字符串唯一标识电影。
php抓取imdb海报