方法一,用于获取比较规律的文章列表
1、在index.php同级目录创建一个guxi.txt
2、index.php中写入一下代码
ini_set('user_agent','Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; GreenBrowser)');
ini_set('max_execution_time', '0');
$xh=4338;
$myfile = fopen("guxi.txt", "a+") or die("Unable to open file!");
for($i=0;$i<=30;$i++){
$xh++;
$url='https://www.yqhy.org/read/2/2150/2439'.$xh.'.html';
$html= file_get_contents($url);
$pattern='/<div[^>]*id="content"[^>]*>(.*?)<div[^>]*id="thumb">/si';
$data = preg_match($pattern, $html,$txt);
$txt[0]=str_replace('<br>',"\n\n\n",$txt[0]);
$txt[0]=str_replace(' ',"\n\n",$txt[0]);
$whtml=strip_tags($txt[0]);
fwrite($myfile, $whtml);
}
fclose($myfile);
方法二,根据文章列表的获取每个内容页的链接,有些文章的链接无规律可以用此方法
ini_set('user_agent','Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; GreenBrowser)');
ini_set('max_execution_time', '0');
//先获取列表链接
$url='https://www.yqhy.org/read/2/2150/';
$html= file_get_contents($url);
$pattern='/<dd[^>]*>(.*?)<\/dd>/si';
preg_match_all($pattern, $html,$txt);
$myfile = fopen("guxi.txt", "a+") or die("Unable to open file!");
// 循环列表里面得到的内容页的url
for($i=0;$i<=count($txt[0]);$i++){
$pattern1='/https:(.*?)html/is';
preg_match($pattern1, $txt[0][$i],$txt1);
$url=$txt1[0];
// $url = preg_replace("{\t}","",$url);
// $url = preg_replace("{\r\n}","",$url);
// $url = preg_replace("{\r}","",$url);
$url = preg_replace("{\n}","",$url); //去除链接中的空格
// $url = preg_replace("{ }"," ",$url);
$html= file_get_contents($url);
$pattern='/<div[^>]*id="content"[^>]*>(.*?)<div[^>]*id="thumb">/si';
$data = preg_match($pattern, $html,$txt2);
$txt2[0]=str_replace('<br>',"\n\n\n",$txt2[0]);
$txt2[0]=str_replace(' ',"\n\n",$txt2[0]);
$whtml=strip_tags($txt2[0]);
echo $i;
fwrite($myfile, $whtml);
}
fclose($myfile);