采集图片集的代码 SPI Version 1.0

10 篇文章 0 订阅

多余的话先不说,先贴代码:

  spi.php 

<?php
#SPI Version 1.0
#Author: mrn6 from csdn.net--https://me.csdn.net/qq_21264377

if ($_SERVER["REQUEST_METHOD"] == "POST") {
    require_once "document.php";

    #匹配jpg jpeg png gif webp bmp图片链接的表达式
    $imgSrcPattern = '/[^>\"\']+.jpg|[^>\"\']+.jpeg|[^>\"\']+.png|[^>\"\']+.gif|[^>\"\']+.webp|[^>\"\']+.bmp/';
    #匹配http/https地址的基本表达式
    $linkSrcPattern = '/http[s]?:\/\/[^>\"\']+/';
    #匹配包含href属性的内容
    $linkTagPattern = '/href=[\'"]{1}[^<>"\']+[\'"]{1}/';
    #匹配javascript地址
    $jsSrcPattern = '/http[s]?:\/\/[^>\"\']+.js[^>\"\']*/';
    #匹配img标签
    $imgPattern = '/<img[^>]*?>/';
    #匹配包含alt属性的img标签
    $altPattern = '/<img[^>]*alt=[^>]+>/';
    #匹配meta头信息为charset字符集的标签
    $metaCharsetPattern = '/<meta[^>]*charset=[^>]+>/';
    #从上一表达式匹配结果中匹配charset属性的内容
    $charsetPattern = '/charset=[a-zA-Z0-9]+/';
    #匹配title标签
    $titlePattern = '/<title>[^>]*<\/title>/';
    #从上一表达式中匹配title标签的内容
    $titleSrcPattern = '/[^<>]+/';
    #从Response响应头header信息中匹配ETag即文件名
    $tagFilePattern = '/ETag:[ ]*"[^<\"\']+"/';
    #从上一表达式匹配的结果中匹配其文件名的内容
    $filePattern = '//';
    #从Reponse响应头Header中匹配Content-Type
    $contentTypePattern = '/Content-Type:[ ]*[a-zA-Z0-9]+[\/][a-zA-Z0-9]+/';
    #获取主机名,如https://www.baidu.com中的baidu
    function getHost($source)
    {
        $schema = 'http://';
        $host = $source;
        if (strpos($host, 'http://') === 0) {
            $schema = 'http://';
            $host = preg_replace('/http:[\/]{2}/', '', $host);
        } elseif (strpos($host, 'https://') === 0) {
            $schema = 'https://';
            $host = preg_replace('/https:[\/]{2}/', '', $host);
        } else {
            // pass
        }
        $pos = strpos($host, '/');
        $host = substr($host, 0, $pos);
        $host = $schema . $host;
        return $host;
    }
    #使用cURL从$url中获取响应内容
    function getUrlContent($url, $https = 0)
    { // 通过url获取html内容
        $output = 'unknown';
        try {
            $ch = curl_init();
            curl_setopt($ch, CURLOPT_URL, $url);
            $headers = array(
                'User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:65.0) Gecko/20100101 Firefox/65.0',
                'Referer: ' . $url
            );
            if ($https) {
                // curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); // 对认证证书来源的检查
                // curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false); // 从证书中检查SSL加密算法是否存在
                curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, true); // 对认证证书来源的检查
                curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, true); // 从证书中检查SSL加密算法是否存在
            }
            // curl_setopt($ch, CURLOPT_USERAGENT,"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:65.0) Gecko/20100101 Firefox/65.0");
            curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
            curl_setopt($ch, CURLOPT_HEADER, 1);
            curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
            curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
            curl_setopt($ch, CURLOPT_AUTOREFERER, $url);
            curl_setopt($ch, CURLOPT_TIMEOUT, 10);
            $output = curl_exec($ch);
            $output = mb_convert_encoding($output, 'UTF-8', 'UTF-8, GBK, GB2312, BIG5');
            $code = curl_getinfo($ch, CURLINFO_HTTP_CODE);
            if ($code === 200) {
                echo ":=<font color=green>200</font>";
            } else {
                echo ":=<font color=red>" . $code . "</font>";
            }
            flush();
            ob_flush();
            curl_close($ch);
        } catch (Exception $e) {
            echo $e->getMessage();
            flush();
            ob_flush();
        }
        return $output;
    }
    #下载文件保存
    function wget($source, $header, $tmpfile, $https = 0)
    {
        try {
            global $origin;
            $source = urlpadding($source, getCurrentDirectory($origin));
            $ch = curl_init();
            curl_setopt($ch, CURLOPT_URL, $source);
            $headers = array(
                'User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:65.0) Gecko/20100101 Firefox/65.0',
                'Referer: ' . $header
            );
            if ($https) {
                // curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE); // 对认证证书来源的检查
                // curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, FALSE); // 从证书中检查SSL加密算法是否存在
                curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, true); // 对认证证书来源的检查
                curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, true); // 从证书中检查SSL加密算法是否存在
            }
            // curl_setopt($ch, CURLOPT_USERAGENT,"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:65.0) Gecko/20100101 Firefox/65.0");
            curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
            curl_setopt($ch, CURLOPT_HEADER, 1);
            curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
            curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
            curl_setopt($ch, CURLOPT_AUTOREFERER, $header);
            curl_setopt($ch, CURLOPT_TIMEOUT, 10);
            curl_setopt($ch, CURLOPT_NOBODY, FALSE); // 需要response body
            $response = curl_exec($ch);
            // 分离header与body
            $header = '';
            $body = '';
            if (curl_getinfo($ch, CURLINFO_HTTP_CODE) == '200') {
                $headerSize = curl_getinfo($ch, CURLINFO_HEADER_SIZE); // 头信息size
                $header = substr($response, 0, $headerSize);
                $body = substr($response, $headerSize);
            }
            curl_close($ch);
            // 文件名
            $arr = array();
            // echo $header;
            $ctype = '';
            $ctypebool = preg_match('/Content-Type:[ ]*[a-zA-Z0-9]+[\/]{1}[a-zA-Z0-9]+/', $header, $arr);
            if ($ctypebool) {
                $ctype = $arr[0];
                $ctype = preg_replace('/Content-Type:[ ]*[a-zA-Z0-9]+[\/]{1}/', '', $ctype);
            }
            if ($ctype == 'jpeg') {
                $ctype = 'jpg';
            }
            // $find=preg_match('/filename="[^<\"\']+"/', $header, $arr);
            // if (! $find) {
            $find = preg_match('/ETag:[ ]*"[^<\"\']+"/', $header, $arr);
            // }
            $file = '';
            if ($find) {
                $file = $arr[0];
                $file = preg_replace('/ETag:[ ]*"/', '', $file);
                $file = preg_replace('/"/', '', $file);
                if (strpos($file, ":0") >= 0) {
                    $file = preg_replace('/[:]{1}[0]{1}/', '', $file);
                }
            } else {
                $p = strrpos($source, '/');
                // http:// or https://
                // echo $source . "::" . $p;
                if ($p > 6) {
                    $file = substr($source, $p + 1);
                } else {
                    $file = md5($source);
                }
                $file = urlencode($file);
                // echo "::" . $source . " header::" . $header ."<br/>";
            }
            if (strpos($file, ':') !== false) {
                $file = preg_replace('/:/', '_', $file);
            }
            if (strpos($file, '!') !== false) {
                $file = preg_replace('/!/', '_', $file);
            }
            if (strpos($file, ';') !== false) {
                $file = preg_replace('/;/', '_', $file);
            }
            if (strpos($file, '-') !== false) {
                $file = preg_replace('/-/', '_', $file);
            }
            if (strpos($file, '~') !== false) {
                $file = preg_replace('/~/', '_', $file);
            }
            if (strpos($file, '%2F') !== false) {
                $file = preg_replace('/%2F/', '_', $file);
            }
            if (strlen($ctype) > 1) {
                $file = $file . '.' . $ctype;
            }
            $file = date('Ym') . '_' . $file;
            $tmpfile = $tmpfile . "_" . $file;
            if (file_exists($tmpfile)) {
                // echo ': cached';
            } else {
                if (strlen($body) >= 1024 * 10) {
                    file_put_contents($tmpfile, $body);
                    // echo "content name::" . $file . "<br>";
                    /*
                     * $fp=fopen($tmpfile, "w+");
                     * $fp.write($body);
                     * fclose($fp);
                     */
                } else {
                    // echo "content length::too small<br>";
                    // echo ': not loaded -- too small';
                }
            }
            // echo " ................................OK<br/>";
            echo ".";
            flush();
            ob_flush();
        } catch (Exception $e) {
            echo $e->getMessage();
        }
    }

    function isArrayObjectSet($sources)
    {
        return isset($sources) && $sources->count() > 0;
    }

    function isArraySet($sources)
    {
        return isset($sources) && count($sources) > 0;
    }
    #判断$source是否被包含在$targets数组中,也即判断是否存在已访问历史中。
    function inArray($source, $targets)
    {
        if (! isArrayObjectSet($targets)) {
            return false;
        } else {
            $size = $targets->count();
            for ($rindex = 0; $rindex < $size; $rindex ++) {
                // echo "<font color=red>set::" . $targets[$rindex] . "</font><br/>";
                if ($source == $targets[$rindex]) {
                    return true;
                } // =
            } // for targets
        } // else target is not null
        return false;
    }
    #将面向移动端的地址转换为PC端地址:将前置的http(s)://m.修改为http(s)://www.
    function mobile2pc($source)
    {
        if (strpos($source, 'http://m.') === 0) {
            $source = preg_replace('/http:\/\/m\./', 'http://www.', $source);
        } elseif (strpos($source, 'https://m.') === 0) {
            $source = preg_replace('/https:\/\/m\./', 'https://www.', $source);
        } elseif (strpos($source, 'm.') === 0) {
            $source = preg_replace('/m\./', 'www.', $source);
        }
        // echo "mp::".$source."<br/>";
        return $source;
    }
    #填充url地址: 将相对路径修改为对应网站的全路径网址。
    function urlpadding($source, $currentDirectory)
    {
        // only for source(s) with http(s) protocol
        if (strpos($source, "http://") === 0 || strpos($source, "https://") === 0) {
            return $source;
        } else {
            $target = "";
            global $host;
            if (strpos($source, "/") === 0) {
                $target = $host . $source;
            } elseif (strpos($source, "./") === 0) {
                $target = $currentDirectory . "/" . preg_replace('/\.\//', '', $source);
            } elseif (strpos($source, "../") === 0) {
                $target = $host . "/" . preg_replace('/\.\.\//', '', $source);
            } else {
                $target = $currentDirectory . "/" . $source;
            }
            // echo "source::" . $source . "::host::" . $host . " target::" . $target . "<br/>";
            return $target;
        }
    }
    #获取当前网址$source的所在目录
    function getCurrentDirectory($source)
    {
        $firstindex = strpos($source, '/');
        $lastindex = strrpos($source, '/');
        if ($lastindex === 0) {
            return NULL;
        } elseif ($lastindex == $firstindex || $lastindex == $firstindex + 1) {
            return $source;
        } else {
            return substr($source, 0, $lastindex);
        }
    }

    function arrayPop()
    {
        global $unloadedLinks, $loadedLinks;
        // check if unloaded link set equal to null;
        if (! isArrayObjectSet($unloadedLinks)) {
            return NULL;
        } else {
            $size = $unloadedLinks->count();
            for ($index = 0; $index < $size; $index ++) {
                $currentUnloadedLink = $unloadedLinks[$index];
                // check if current unloaded link in loaded list;
                if (! inArray($currentUnloadedLink, $loadedLinks)) {
                    // if not, then load current link;
                    $loadedLinks->append($currentUnloadedLink);
                    return $currentUnloadedLink;
                }
            }
        } // unloaded link set not null.
        return NULL;
    }

    function arrayPush($source, $sources)
    {
        if (isset($source) && isArrayObjectSet($sources)) {
            $sources->append($source);
        }
    }
    #判断$source是否为超链接标签a
    function getLinks($source, $currentDirectory)
    {
        global $linkSrcPattern;
        $pattern = $linkSrcPattern;
        $links = new ArrayObject();
        preg_match_all($pattern, $source, $match);
        // echo "--".count($match[0]);
        $currentDirectory = mobile2pc($currentDirectory);
        if (count($match[0]) > 0) {
            for ($i = 0; $i < count($match[0]); $i ++) {
                $link = $match[0][$i];
                $link = mobile2pc($link);
                $link = urlpadding($link, $currentDirectory);
                $links->append($link);
            }
        }
        global $linkTagPattern;
        $pattern = $linkTagPattern;
        preg_match_all($pattern, $source, $match);
        if (count($match[0]) > 0) {
            for ($i = 0; $i < count($match[0]); $i ++) {
                $link = $match[0][$i];
                // remove tag attribute;
                $link = preg_replace('/href=/', '', $link);
                // remove tag quote;
                $link = preg_replace('/\'/', '', $link);
                $link = preg_replace('/"/', '', $link);
                if (strpos($link, 'javascript:') !== 0) {
                    $link = urlpadding($link, $currentDirectory);
                    $link = mobile2pc($link);
                    // echo "get tag< a >::" . $link . "<br/>";
                    $links->append($link);
                }
            }
        }
        return $links;
    }
    #判断$target是否跟$source具有相同的网页地址前缀,是否连续网页集合的“下一页”--Next Page
    #只要两个超链接地址相似度超过80%小于100%的,才进行下一步判断
    #特定某一类型的符合返回99
    function distinct($source, $target)
    {
        $percent = 0;
        if ($source == $target) {
            $percent = 100;
        } else {
            similar_text($source, $target, $percent);
            if ($percent > 80 && $percent < 100) {
                $last = strrpos($source, '.');
                if ($last > 6) {
                    $pageprefx = substr($source, 0, $last);
                    $dirlast = strrpos($pageprefx, '/');
                    $file = substr($pageprefx, $dirlast);
                    $startswith = strpos($target, $file);
                    // bug:
                    // $startswith = strstr($target, $pageprefx);
                    // echo '<br>'.$target.'-'.$pageprefx.'-'.$startswith;
                    if ($startswith === false) {
                        $percent = 81;
                    } else {
                        $percent = 99;
                    }
                } else {
                    $percent = 81;
                }
            }
        }
        return $percent;
    }
    #将数组$targets中的超链接元素与源地址$source进行比较,判断是否为“下一页”--Next Page
    function compare($source, $targets)
    {
        global $unloadedLinks;
        global $origin;
        $size = $targets->count();
        for ($rindex = 0; $rindex < $size; $rindex ++) {
            $target = $targets[$rindex];
            $l = strlen($origin);
            $l2 = strlen($target);
            if ($l != $l2 && $l < $l2 && $l2 < $l * 2) {
                $currentDirectory = getCurrentDirectory($source);
                $target = urlpadding($target, $currentDirectory);
                $target = mobile2pc($target);
                $percent = distinct($origin, $target);
                // echo "<font color='gray'>" . $source . "</font><><font color='green'>" . $target . "</font>::<font color='green'>" . $percent . "%</font><br/>";
                if ($percent >= 98 && $percent < 100) {
                    if (inArray($target, $unloadedLinks)) {
                        // echo "saved;<font color=gray>" . $source . "</font><><font color=green>" . $target . "</font>::<font color=green>" . $percent . "%</font><br/>";
                    } else {
                        $unloadedLinks->append($target);
                        // echo "<font color=gray>" . $source . "</font><><font color=green>" . $target . "</font>::<font color=green>" . $percent . "%</font>";
                    }
                } else {
                    // echo "<font color=gray>" . $source . "</font><><font color=gray>" . $target . "</font>::<font color=gray>" . $percent . "%</font><br/>";
                }
            } // strlen
        } // targets;
        echo "<br/> --compare with " . $origin . " --mixed unloaded " . $unloadedLinks->count() . "<br>";
    }

    function isHtmlFile($source)
    {
        if (isHost($source)) {
            return false;
        } else {
            if (isDirectory($source)) {
                return false;
            } else {
                return true;
            }
        }
    }

    function isDirectory($source)
    {
        $pos = strrpos($source, '/');
        if ($pos === 0) {
            return true;
        } elseif ($pos < 0) {
            return true;
        } else {
            $urllen = mb_strlen($source, 'UTF-8');
            if ($pos == $urllen - 1) {
                return true;
            } else {
                $dotpos = strrpos($source, '.');
                if ($dotpos === 0) {
                    return true;
                } elseif ($dotpos > $pos) {
                    return false;
                } else {
                    return true;
                }
            }
        }
    }

    function isRelative($source)
    {
        if (strpos($source, '/') === 0 || strpos($source, './') === 0 || strpos($source, '../') === 0) {
            return true;
        } elseif (strpos($source, "http://") === 0 || strpos($source, "https://") === 0) {
            return false;
        } else {
            return true;
        }
    }

    function getAbsolutePath($source, $current)
    {
        $abspath = '';
        if (isRelative($source)) {
            if (strpos($source, "./") === 0) {
                $abspath = getCurrentDirectory($current) + preg_replace('/\.\//', '', $source);
            } elseif (strpos($source, "../") === 0) {
                $abspath = getHost($current) + preg_replace('/\.\.\//', '', $source);
            } elseif (strpos($source, "/") === 0) {
                $abspath = getCurrentDirectory($current) + $source;
            } else {
                $abspath = getCurrentDirectory($current) + "/" + $source;
            }
        } else {
            $abspath = $source;
        }
        return $abspath;
    }

    function isHost($source)
    {
        if (strpos($source, "http://") === 0) // if010
        {
            $host = preg_replace('/http:\/\//', '', $source);
            if (strpos($host, "/") < 0) // if011
            {
                return true;
            } // if011
            else {
                return false;
            } // if011
        } // if010
        elseif (strpos($source, "https://") === 0) // if 010
        {
            $host = preg_replace('/http:\/\//', '', $source);
            if (strpos($host, "/") < 0) // if 021
            {
                return true;
            } else {
                return false;
            }
        } // if 010
        else {
            return false;
        } // if 010
    }

    function escapeJs($source)
    {
        echo 'Before escape js::' . strlen($source) . '<br/>';
        $jspattern = '/<script[^<>]*>[^<script>|<\/script>]*<\/script>/';
        $pattern = $jspattern;
        preg_match($source, $pattern, $match);
        while (count($match) > 0) {
            $source = preg_replace($jspattern, '', $source);
            preg_match($source, $pattern, $match);
        }
        echo 'After escape js::' . strlen($source) . '<br/>';
        flush();
        ob_flush();
        return $source;
    }

    function getHtmlText($source)
    {
        $htmltagstartpattern = '/<[^>]+>/';
        $htmltagendpattern = '/<[\/]{1}[^>]+>/';
        // escape javascript content first;
        $source = escapeJs($source);
        $pattern = $htmltagstartpattern;
        preg_match($pattern, $source, $match);
        while (count($match) > 0) {
            $source = preg_replace($pattern, '', $source);
            preg_match($pattern, $source, $match);
        }
        $pattern = $htmltagendpattern;
        preg_match($pattern, $source, $match);
        while (count($match) > 0) {
            $source = preg_replace($pattern, '', $source);
            preg_match($pattern, $source, $match);
        }
        return $source;
    }

    // main function entry:: loadResources();
    // 主入口
    function loadResources($source)
    {
        try {
            // $host = getHost($source);
            // Test source:
            $https = 0;
            if (strpos($source, "https://") === 0) {
                $https = 1;
            }
            echo "<br>-->content URL::" . $source;
            $html = getUrlContent($source, $https);
            // get text escaping html tags
            // $htmltext = getHtmlText($html);
            // echo "::<br/>::".$htmltext.'::';
            // compare links
            $links = getLinks($html, getCurrentDirectory($source));
            // echo ";currrent page links:=" . $links->count();
            compare($source, $links);
            // match title tag
            global $titlePattern;
            $pattern = $titlePattern;
            preg_match_all($pattern, $html, $match);
            // echo count($match).PHP_EOL;
            $titletag = $match[0][0];
            // echo $titletag.PHP_EOL;
            global $titleSrcPattern;
            $pattern = $titleSrcPattern;
            // $title=preg_replace('/<[\/]*title[^<>]*>/', '', $titletag);
            $title = strip_tags($titletag);
            // echo 'Title is '.$title.PHP_EOL;
            echo '<title>' . $title . '</title>';
            // $charsettag = '';
            global $imgPattern;
            $pattern = $imgPattern;
            preg_match_all($pattern, $html, $match);
            // print_r($match);
            $imgtags = $match[0];
            global $imgSrcPattern;
            $pattern = $imgSrcPattern;
            $count = count($imgtags);
            $docs = new ArrayObject();
            // echo "<ul>";
            $tmpdir = "./tmp/" . date('Y_m_d');
            if (! file_exists($tmpdir)) {
                mkdir($tmpdir);
            }
            for ($index = 0; $index < $count; $index ++) {
                // echo "::".$imgtags[$index]."::<br/>";
                preg_match_all($pattern, $imgtags[$index], $match);
                $imgcount = count($match[0]);
                if ($imgcount > 0) {
                    for ($imgindex = 0; $imgindex < $imgcount; $imgindex ++) {
                        $img = $match[0][$imgindex];
                        if (strpos($img, '//') === 0) {
                            $img = 'http:' . $img;
                        }
                        // echo "<br/>" . $index . ']' . $img;
                        $alt = 'unknown';
                        // echo "<li><a href=\"".$img."\" target=\"_blank\"><img src=\"".$img."\"/>".$title.'-'.$alt."</a>";
                        $doc = new Document();
                        $doc->setSource($img);
                        $doc->setTitle($title);
                        $doc->setContent($alt);
                        $doc->setAuthor($source);
                        $docs->append($doc);
                        $https = 0;
                        if (strpos($doc->getSource(), "https://") === 0) {
                            $https = 1;
                        }
                        // echo 'wget>' . $doc->getSource() . ' ';
                        flush();
                        ob_flush();
                        global $loadChk, $loadedResources, $loadedSources;
                        if ($loadChk->checkIfResourceLoaded($source, $loadedResources)) {
                            echo ' loaded';
                        } else {
                            array_push($loadedResources, $source);
                            wget($doc->getSource(), $source, $tmpdir . "/tmp_", $https);
                        }
                    }
                }
            }
            // echo "</ul>";
            // echo "<br/>::finished<br/>Redirect in 5s...";
            // header("Refresh:5;url=spi.php");
            return ':=ok';
        } catch (Exception $e) {
            echo $e->getMessage();
            return ':=error';
        }
    } // main function end:: loadResources()

    // MAIN :entry
    // :major data sets
    $unloadedLinks = new ArrayObject();
    $loadedLinks = new ArrayObject();
    $host = '';

    $source = "https://www.***.com/gq/m*/hy1693.html";
    if (isset($_POST["source"])) {
        $source = $_POST["source"];
        // echo $source;
    }
    // start fetching resources
    set_time_limit(500);
    // get host
    $host = getHost($source);
    $origin = $source;
    // load resources of first page in the following
    loadResources($source);
    // check if resource(s) got:
    global $loadChk, $loadedSources;
    if (isArrayObjectSet($unloadedLinks)) {
        $unloadedlink = arrayPop();
        while ($unloadedlink !== NULL) {
            if ($loadChk->checkIfSourceLoaded($unloadedlink, $loadedSources)) {
                echo 'Source' . $unloadedlink . ' loaded';
                flush();
                ob_flush();
            } else {
                array_push($unloadedlink, $loadedSources);
                loadResources($unloadedlink);
            }
            sleep(1);
            $unloadedlink = arrayPop();
        }
    } // unloaded link(s) not equal to NULL;
} // POST method
elseif ($_SERVER["REQUEST_METHOD"] == "GET") {
    echo "<!DOCTYPE html>
<html><head><title></title>
<style type='text/css'>
html, body, div{
    padding:0;
    margin:0 auto;
    overflow:hidden;
}
.post-container{
    display:table;
    position:absolute;
    width:480px;
    height:240px;
    top:50%;
    left:50%;
    margin:-136px 0 0 -240px;
    background-color:#fefefe;
}
.post-form{
    display:table-cell;
    vertical-align:middle;
    text-align:center;
}
.lbl-title{
    margin-bottom:1em;
    display:block;
    font-size:2em;
}
.input-source{
    height:2em;
    line-height:2em;
    width:16em;
    margin-left:5px;
    margin-right:5px;
    display:inline;
    font-size:1em;
    border:1px solid gray;
    border-radius:4px;
}
.input-btn{
    width:6em;
    height:2.2em;
    line-height:2.2em;
    display:inline;
    font-weight:bold;
    font-size:1em;
}
</style>
</head>
<body>
<div class='post-container'>
<form class='post-form' method=\"post\" action=\"spi.php\">
<label class='lbl-title'>SPI Search</label>
<input class='input-source' type=\"password\" name=\"source\">
<input class='input-btn' type=\"submit\" value=\"OK\">
</form></div>
</body></html>";
}
#SPI Version 1.0
#Author: mrn6 from csdn.net--https://me.csdn.net/qq_21264377
?>

  Document.php

<?php

class Document
{

    var $title;

    var $content;

    var $created;

    var $author;

    var $editor;

    var $source;

    var $updated;

    var $comment;

    var $doctype;

    function __construct()
    {}

    function __destruct()
    {
        $this->title = null;
        $this->source = null;
        $this->content = null;
    }

    function setTitle($title)
    {
        $this->title = $title;
    }

    function getTitle()
    {
        return $this->title;
    }

    function setSource($source)
    {
        $this->source = $source;
    }

    function getSource()
    {
        return $this->source;
    }

    function setContent($content)
    {
        $this->content = $content;
    }

    function getContent()
    {
        return $this->content;
    }

    function setAuthor($author)
    {
        $this->author = $author;
    }

    function getAuthor()
    {
        return $this->author;
    }

    function setEditor($editor)
    {
        $this->editor = $editor;
    }

    function getEditor()
    {
        return $this->editor;
    }

    function getCreated()
    {
        return $this->created;
    }

    function setUpdated($updated)
    {
        $this->updated = updated;
    }

    function getUpdated()
    {
        return $this->updated;
    }

    function setComment($comment)
    {
        $this->comment = $comment;
    }

    function getComment()
    {
        return $this->comment;
    }

    function setDoctype($type)
    {
        $this->doctype = $type;
    }

    function getDoctype()
    {
        return $this->doctype;
    }
}

?>

这是一个从某个网址自动搜集连续网页的采集图片集的代码。本来有两种匹配连续网页的方式:1)判断网址相似率;2)判断是否使用相同字符串前缀的网址。第一种方式误差较大,在某些情况下,某些高相似率的网址只是同一栏目下的,而不是同一网页集合的子元素。故此暂时摒弃此法。第二种当前只能识别http(s)://***.com/pic/2323.html--http(s)://***.com/pic/2323_2.html这类规则的网页集。对于此类规则识别,在我的另一篇文章里略有介绍--https://blog.csdn.net/qq_21264377/article/details/104934580。在这里不做讨论。因为过于具体的规则识别,深入讨论的意义不大,除非用在大规模采集较为稳定资源的情况下考虑。相当于修建大楼的"添砖加瓦"。

值得注意的是,许多网站为了方便部署维护,网页中的本地地址都采用相对路径或“简化绝对路径”如 “/pic/2342.html”。在这种情况下,需要对采集的地址进行填充。对于相对路径,需要从获取源地址--输入地址的当前目录然后拼接成完整的网址。“简化绝对路径”的情况,需从源地址获取域名网址进行拼接。使用表达式expression描述为:

1) Assume: $sourceUrl="https://www.abc.com/pic/202003/1235.html", $currentUrl="202003/1235_2.html"

Target: $targetUrl="https://www.abc.com/pic/202003/1235_2.html"

Process: getCurrentDirectory()->$currentDirectoryUrl="https://www.abc.com/pic/202003/",

$targetUrl=$currentDirectoryUrl+$currentUrl

2) Assume: $sourceUrl="https://www.abc.com/pic/202003/1235.html", $currentUrl="/pic/202003/1235_2.html"

Target: $targetUrl="https://www.abc.com/pic/202003/1235_2.html"

Process: getHost()->$host="https://www.abc.com",

$targetUrl=$host+$currentUrl.

因为URL地址都是以"/"为分隔符的,目录层级比较格式化,是按协议规则来定的,与平台无关,所以划分比较简单。从这个亦可看出标准协议的重要性:透明的规则意味着畅行。这也是互联网盛行的基本原因之一。

定义文件名也是一个令人头疼的事。特别是不太了解甚少深入接触Web协议的。一种是直接从URL地址获取文件名+后缀,一种是从Response响应头Header信息中提取文件名。这两种各有优缺点。网址指向的文件名在哪里定义,怎么定义,是由网站作者定义的。网站作者有可能按照Web协议来办事,也有可能完全按照自己的意愿来处理。基于这样的可能性,我们可以“点到即止”的方式来应对:获取文件名成功即可,不成功或获取的文件名无效或不符合“我”的意愿,才考虑第二种方案。事实上,很多浏览器的下载功能似乎很少遇到类似问题。这是值得思考的一个问题。

这里有个问题,输入网址需是第一页的地址。如果是网页集中之后随意一页,利用上面描述的“前缀法”对其展开判断,很容易失去抽象的共性从而迷失深陷于具象的规则里。若使算法“贪婪”而又简洁,则应尽可能的抽象出某类事物的共性,进而提炼其算法的表达式。

注:以上PHP源代码仅供参考交流。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值