简单的爬虫获取数据

最新推荐文章于 2023-01-11 11:21:53 发布

weixin_33692284

最新推荐文章于 2023-01-11 11:21:53 发布

阅读量123

点赞数

文章标签：爬虫 python 移动开发

原文链接：https://my.oschina.net/findurl/blog/189793

版权

2019独角兽企业重金招聘Python工程师标准>>>

一、基本的思路

首先网站的根目录下面有个文件robots.txt，记录爬虫访问权限。

下面搭建一个简单的解析流程

1.添加一个日志记录吧Log4j

2.HttpURLConnection 可以构造Http请求连接头()，发起请求，得到输入流。也有Https的

3.构造一个Parse解析类：（1）得到编码

（2）解码

（3）抽取URL

（4）也应该抽取内容

4.构造一个Filter过滤类：（1）假URL js

（2）没有前缀的

（3）限定URL

（4）重复过滤

（5）添加到队列中

--------------------------------------------------------------

问题：（1）深度

（2）解析不保存文件只获取数据

（3）并发生产者消费者添加进来

（4）去重

（5）范围控制前缀、地址

二、问题求解

对照上面按照查找顺序

（2）对内容的抽取不考虑有意义否

这里添加一个内容获取神奇TIKA

2.1.方式自己通过正则表达式处理

2.2.借助于开源html解析工具我这里今天看了一个 HtmlParser的工具

http://free0007.iteye.com/blog/1131163用法讲的很详细了

后面还提供了集合之间的操作、类似于lucene有个查询的方法

2.3.决定用这个工具操作了，提供了很多抽取的方法。

有一个问题：还不知道怎么很好的处理？

我获取内容的div定位了，然后要获取它第二层的子节点遍历，得到一个描述的结构。

我第一次处理得到输入流后自己得到内容正则解析的（只是抽取的URL）

2.4.例如分析下泡泡泡泡网手机页面的信息

=》首先内容在 class的product下面

=》接着每个内容是在标签<dl></dl>下面

=》最后详细内容在<dd><div>简称</div><ul><li>.属性..</li></ul></dd>下面

》》处理过程首先根据class得到当前的大模块

》》用当前模块构建一个小的模块，接着分别处理

<div class="product">
 <dl>
    <dt>
       <a target="_blank" href="http://product.pcpop.com/000475692/Index.html"><img height="90" src="http://img5.pcpop.com/ProductImages/Standard/5/5923/005923406.jpg"></a>
       <span class="duibiBefore"  sensisId="612323" id="000475692" onclick="javascript:AddCompareProduct('000475692','三星Note3 N9008 16G移动3G手机(白色)TD-SCDMA/GSM非合约机','005923406','005900050','00160');">加入对比</span>
    </dt>
    <dd>
       <div>
	       <span><a target="_blank" href="http://product.pcpop.com/000475692/Price.html"><b>￥3828</b></a></span>
           <i><a target="_blank" title="三星Note3 N9008 16G移动3G手机(白色)TD-SCDMA/GSM非合约机" alt="三星Note3 N9008 16G移动3G手机(白色)TD-SCDMA/GSM非合约机" href="http://product.pcpop.com/000475692/Index.html">三星Note3 N9008 16G移动3G手机(白色)TD-SCDMA/GS</a></i>
       </div>
      <ul>
        <li>所属：<a target="_blank" href="http://product.pcpop.com/series/000010843/Index.html" title="三星Note3系列">三星Note3系列</a></li>
        <li title="网络模式：单卡双模(TD-SCDMA/GSM)">网络模式：单卡双模(TD-SCDMA...&nbsp;</li>
        <li>手机版本：移动定制机&nbsp;</li>
        <li title="支持运营商：移动3G(TD-SCDMA),移动2G/联通2G(GSM)">支持运营商：移动3G(TD-SCDMA...&nbsp;</li>
        <li>操作系统：安卓Android 4.3&nbsp;</li>
        <li>CPU核心数：四核&nbsp;</li>
        <li>主摄像头像素：1300万像素&nbsp;</li>
        <li>电池容量：3200毫安时&nbsp;</li>
        <li style="width:440px;"><span class="fr"><a target="_blank" href="http://product.pcpop.com/000475692/Picture.html">图片</a>  <a target="_blank" href="http://product.pcpop.com/000475692/Comment.html">点评(99)</a><a target="_blank" href="http://product.pcpop.com/000475692/PriceShop.html" class="red">电商报价</a>&nbsp;<a target="_blank" href="http://product.pcpop.com/000475692/Price.html" class="red">实体店报价</a></span> <a target="_blank" href="http://product.pcpop.com/000475692/Detail.html/">参数>></a></li>
      </ul>
    </dd>
 </dl>

代码：

这里第一点没发现能能直接根据一些父子关系直接得到节点

内容解析感觉嵌套次数太多了，例如我在想得到名字

抽取的时候我是直接给内容还是地址，决定还是给内容了，因为请求头可以构造

public class TestJd {
	
	public static void main(String[] args) {
		
		Parser htmlParse=new Parser();
		Parser p=new Parser();
		try {
			//htmlParse.setEncoding("GB2312");
			htmlParse.setURL("http://product.pcpop.com/Mobile/00283_1.html");
		    
			//1.页面内容在这个class大模块
			NodeFilter filterFirst=new HasAttributeFilter("class","product");
		    NodeList nodes=htmlParse.extractAllNodesThatMatch(filterFirst);
		   
		    if(nodes.size()>0){
		    	Node node=nodes.elementAt(0);
		    	
		    //2.单独构建内容模块
		    	p.setInputHTML(node.toHtml().toString());
		    	//p.setResource(node.toHtml().toString());
		    //3.每个产品模块抽取	
		    	NodeFilter nf=new TagNameFilter("ul");
		    	NodeList ns=p.extractAllNodesThatMatch(nf);
		    	
		    	System.out.println("抽取产品数目："+ns.size());     	
		    	for(int j=0;j<ns.size();j++){
		    		 Node n=ns.elementAt(j);
		    	     NodeList nl=n.getChildren();	 
		    //4.细节内容获取
		    	     for(Node nn:nl.toNodeArray()){
		    		 System.out.println(nn.toPlainTextString());
		    	     }
		    System.out.println("-------------------");
		    	 }
		     }
		   
		} catch (ParserException e) {
			e.printStackTrace();
		}	
	}
}

效果如下：

这部分暂且就这样解决了吧

---------------------------------------------------------------------------------------------------------

（3）生产者消费者并发

3.1分析：并发这个对效率是很重要的

获取队列相当于消费

抽取队列相当于生成

生产的效率肯定高于消费的

3.2用信号量来处理来

public class Signs {

	//生产者使用暂且没有
	public static Semaphore product=new Semaphore(1);
	//只有一个初始化的URL
          public static Semaphore customer=new Semaphore(1);
        //互斥
         public static Semaphore mutex1=new Semaphore(1);  
    
}

#当信号量为1的时候就是互斥了

#首先定义一装载URL的共同类
/**
 * 算是个URL的测试管理
 */
public class URLManage {

	/** URL存储*/
	public static Set<String> urlSet=new HashSet<String>();
	
	/**URL添加*/
	public static void addUrl(String url){
	   System.out.println("@@@添加："+url);
	   URLManage.urlSet.add(url);
	}
	
	/**URL移除*/
	public static void removeUrl(String url){
		System.out.println("@@@移除："+url);
		if(URLManage.urlSet.contains(url)){
			URLManage.urlSet.remove(url);
			System.err.println("###URL移除成功");
		}else{
			System.err.println("***URL移除失败");
		}
	}
	
	/**URL的个数*/
	public static void getTheSize(){
		System.out.println("***URL个数："+URLManage.urlSet.size());
	} 
}

#生产者消费共同的的处理肯定要互斥
#生产者和消费者单独有必要互斥吗？是否有共同资源的访问
#消费者目的是抽取出一个URL-》后相互独立了
#                         解析得到URL
#                         当要放入的时候是互斥
-----------------------------------------------
#生产者这里扮演一个过滤的角色吗？
#                过滤得到所有的内容后
-----------------------------------------------
对数据量的操作是应该互斥的，其他都可以不用

public class CustomerURL implements Runnable {

	private static int index=2;
	@Override
	public void run() {
	while(true){	
		//首先判断执行的数量
		try {   
	       //Signs.customer.acquire();
			System.out.println("---消费者："+Thread.currentThread().getName()+"---正在执行");
 
			Signs.mutex1.acquire();
			
			URLManage.addUrl(String.valueOf(index++));
			URLManage.getTheSize();
			
			Signs.mutex1.release();
			
			//Signs.customer.release();
			Thread.sleep(1000);
		} catch (InterruptedException e) {
			System.out.println("---消费失败---");
			e.printStackTrace();
		}
	}
	}
}

public class ProducterURL implements Runnable {

	private static int index=2;
	@Override
	public void run() {
    while(true){		
		try {
			//Signs.product.acquire();
			System.out.println("===生产者："+Thread.currentThread().getName()+"生产了产品");
			
			Signs.mutex1.acquire();
			
			URLManage.removeUrl(String.valueOf((index++)+1));
			URLManage.getTheSize();
			
			Signs.mutex1.release();
			
			//Signs.product.release();
			
			Thread.sleep(1000);
		} catch (InterruptedException e) {
			System.out.println("---生产失败---");
			e.printStackTrace();
		}
		
	}
	}

}

输出有点乱为什么，没有按照顺序，因为生产消费的过程可以并发使用，但是对于公共数据的添加删除要互斥操作

这个问题也到这里

--------------------------------------------------------------------------------------------------------

（4）去重（1）深度

4-1.1去重暂且打算放到一个全局的Set里面（自动去重功能）

或者放到一个全局的Map里面（KEY）

4-1.2遍历这个东西分两部分，深度和广度；

深度一条道走到黑

广度一个级别的走到底然后下一个级别

4-1.3这里不考虑什么深度广度怎么搞这个问题那。

决定用Map处理Key-地址 Value-深度

根的深度为0，也就是种子，然后深度自动=根+1

4-1.4这里需呀处理的问题就是

#一个全局保存所有的已经访问过的地址->去重的

#   一个全局的用来保存新得到的地址 ->不论对错
#                                错的
#                                修正的
#                                已经获取的
#                                范围校验

#  一个全局的用来保存正在使用的地址->种子
#                               使用了就要移除
#                               大杂烩

这个也就这样吧

--------------------------------------------------------------------------------------------------------

（5）范围

这个暂且根据前缀吧。

--------------------------------------------------------------------------------------------------------

最后组装一下来；

三、可能遇到的问题

3.1编码的问题

发现项目编码是什么那么编码就是什么。例如我的机器默认是GBK，项目改为UTF-8编码也就是UTF-8。

本来的出发点首先根据首先提取编码-》回复编码的。现在采用配置吧。

3.2总体的控制想要解析的动态

抽取的范围不一样的我打算采用XML的层级控制来（尝试中）

3.3打算通过简单的配置实现简单的通用

因为获取内容都是特定的，配置文件可以指定特定额层次关系解析（层次出来就是个树）

<?xml version="1.0" encoding="UTF-8"?>
<root>
    <!-- 当前的编码集 -->
    <encode>GB2312</encode>
    
    <!-- 解决的前缀添加   -->
    <prefix>p1</prefix>
    
    <!-- 范围控制根据前缀 -->
    <rang>
        <filter>f1</filter>
        <filter>f2</filter>
    </rang>
    
    <!-- 深度控制 -->
    <deep>5</deep>
    
    <!-- 内容解析的原型 -->
    <!-- 第一层是最外层控制 -->
    <!-- 第二层深化具体 -->
    <!-- 第三层细化可以不要 -->
    <content>
       <attribute name="class" value="product">
           <attribute name="ul">
               <attribute name="li"/>
           </attribute>    
       </attribute>
    </content>
</root>

四、当前DEMO

放在了这里： https://github.com/Soap13/MySpider

五、更新

1.解决一个获取会出现乱码的情况

Accept-Encoding=gzip, deflate

2.对解析格式的控制做到配置，通过xml实现目前的

其中分为两个部分了，一个就是大的内容抽取，例如某个页面的文章列表

一个就是具体内容抽取，例如文章列表里面的作者、标题、url

下面这个是抽取iteye博客列表得到文章的tag，只是一个简单那的测试

<?xml version="1.0" encoding="UTF-8"?>
<root>
    <!-- 当前的编码集 -->
    <encode>UTF-8</encode>
    
    <!-- 解决的前缀添加   -->
    <prefix>http://product.pcpop.com/</prefix>
    
    <!-- 范围控制根据前缀 -->
    <rang>
        <filter>http://product.pcpop.com/</filter>
        <filter>f2</filter>
    </rang>
    
    <!-- 深度控制 -->
    <deep>5</deep>
    
    <!-- 内容解析的原型 -->
    <!-- 第一层是最外层控制 -->
    <!-- 第二层深化具体 -->
    <!-- 第三层细化可以不要 -->
    <content>
       <attribute name="id" value="main">
           <attribute name="id" value="index_main">
              <attribute name="class" value="blog clearfix"/>
           </attribute>
       </attribute>
    </content>
    
    <!-- 解析内容的定制 -->
    <content-name>
       <attr name="class" value="category"/>
    </content-name>
</root>

3.越发觉得在组织整个一个小项目的无力

转载于:https://my.oschina.net/findurl/blog/189793