爬取网站的文章，然后保存在本地的txt中

最新推荐文章于 2021-10-05 09:03:06 发布

chengchengbox

最新推荐文章于 2021-10-05 09:03:06 发布

阅读量1k

点赞数

分类专栏： html php 文章标签： php

本文链接：https://blog.csdn.net/qq_34297991/article/details/104019869

版权

html 同时被 2 个专栏收录

36 篇文章 0 订阅

订阅专栏

php

34 篇文章 0 订阅

订阅专栏

方法一，用于获取比较规律的文章列表

1、在index.php同级目录创建一个guxi.txt

2、index.php中写入一下代码

ini_set('user_agent','Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; GreenBrowser)'); 


ini_set('max_execution_time', '0'); 
$xh=4338;
$myfile = fopen("guxi.txt", "a+") or die("Unable to open file!");
for($i=0;$i<=30;$i++){ 
	$xh++;
	$url='https://www.yqhy.org/read/2/2150/2439'.$xh.'.html'; 
	 
	$html= file_get_contents($url);
	$pattern='/<div[^>]*id="content"[^>]*>(.*?)<div[^>]*id="thumb">/si';

	$data = preg_match($pattern, $html,$txt); 
	$txt[0]=str_replace('<br>',"\n\n\n",$txt[0]);  
	$txt[0]=str_replace('&nbsp;&nbsp;&nbsp;&nbsp;',"\n\n",$txt[0]);  
	$whtml=strip_tags($txt[0]); 
	
	fwrite($myfile, $whtml); 
} 
fclose($myfile);

方法二，根据文章列表的获取每个内容页的链接，有些文章的链接无规律可以用此方法

ini_set('user_agent','Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; GreenBrowser)');   
ini_set('max_execution_time', '0'); 


//先获取列表链接
$url='https://www.yqhy.org/read/2/2150/'; 
$html= file_get_contents($url);
$pattern='/<dd[^>]*>(.*?)<\/dd>/si'; 
preg_match_all($pattern, $html,$txt);
$myfile = fopen("guxi.txt", "a+") or die("Unable to open file!");
// 循环列表里面得到的内容页的url
for($i=0;$i<=count($txt[0]);$i++){
	$pattern1='/https:(.*?)html/is';
	preg_match($pattern1, $txt[0][$i],$txt1);
	
	$url=$txt1[0];  
	
	// $url = preg_replace("{\t}","",$url);   
	// $url = preg_replace("{\r\n}","",$url);   
	// $url = preg_replace("{\r}","",$url);   
	$url = preg_replace("{\n}","",$url);   //去除链接中的空格
	// $url = preg_replace("{ }"," ",$url);    
	
	$html= file_get_contents($url);
	$pattern='/<div[^>]*id="content"[^>]*>(.*?)<div[^>]*id="thumb">/si';

	$data = preg_match($pattern, $html,$txt2);  
	
	$txt2[0]=str_replace('<br>',"\n\n\n",$txt2[0]);  
	$txt2[0]=str_replace('&nbsp;&nbsp;&nbsp;&nbsp;',"\n\n",$txt2[0]);  
	$whtml=strip_tags($txt2[0]); 
	echo $i;
	fwrite($myfile, $whtml); 
} 

fclose($myfile);

chengchengbox

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
爬取网站的文章，然后保存在本地的txt中

方法一，用于获取比较规律的文章列表1、在index.php同级目录创建一个guxi.txt2、index.php中写入一下代码ini_set('user_agent','Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; GreenBrowse...
复制链接

扫一扫