手把手教你制作信息收集器之收集网站备案号-CSDN博客

本期任务：
1.掌握备案号的收集。
2.练习从http返回包中获取信息的能力。
所需工具：
pip，http请求库：requests库，匹配库：re库、Beautiful Soup，json

问题引入：

1. 何为网站备案号以及为什么收集它？

答：备案号是网站是否合法注册经营的标志，一个网站的域名是需要去备案的。上一期我们教大家如何用搜索引擎收集网站的子域名，思路是从主域名下手，延伸下去获取尽可能多的子域名。而一家企业的网站资产中，远远不止有一个主域名，有很多隐藏的主域名我们未能发现，通过查询备案号能够得到更多的信息。

2. 去哪里收集备案号？
查备案号的地址有很多，比如收集百度备案号的地址：

 
    http://www.beianbeian.com/search/baidu.com
http://www.sojson.com/api/beian/baidu.com
http://icp.chinaz.com/info?q=baidu.com
...... 
   

查到了备案号，比如是：京ICP证030173号，然后进行反查：

 
    http://www.beianbeian.com/search-1/京ICP证030173号
http://icp.chinaz.com/京ICP证030173号

3.代码编写难点在哪？
其实收集信息并不难，本次重在于练习获取信息的能力。

收集器制作开始：

简单的从返回包中获取备案号信息：
http://www.beianbeian.com/search/+domain
在返回包中我们发现了我们想要的备案号，而且在反查链接里面~
<a href="/search-1/京ICP证030173号">[反查]</a>
根据上一篇的姿势，此处用上我们最最简单的非贪婪匹配(.*?)就可以获取到备案号。
下一步，我们可以进行备案号反查http://www.beianbeian.com/search-1/京ICP证030173号

Alt text
从图片中得到的信息我们发现，我们想要的信息是网站名称和网站首页网址。

Alt text
通过查看源代码，可以发现每一行的网站名称和网址都存在于一个比较大的<tr>标签里面，这个时候，如果我们想用正则匹配这两个字段，规则比较难写，怎么处理呢？

奉上一碗美味的汤
美味的汤，Beautiful Soup，是python的一个库，用它我们可以很方便的从html或者是xml标签中提取我们想要的内容。

举个例子,假设我们获取的返回包的html内容：

 
    比如有一些标签看起来是这样：
<span class="green">ILoveStudy</span>
而另一些标签卡起来是这样：
<span class="red">StudyMakeMeHappy</span> 
   

我们可以先获取返回包的内容，然后创建一个BeautifulSoup对象：

 
    import requests
from bs4 import BeautifulSoup
html=requests.get(url).content
bsObj=BeautifulSoup(html,"lxml") 
   

建立了BeautifulSoup对象，我们可以用find_all函数获取比如说只包含在<span class="green"></span>标签里的文字：

 
    getlist=bsObj.find_all("span",{"class":"green"})
for get in getlist:
print get.get_text()
结果：
ILoveStudy 
   

回到上面我们获取到的返回包中，我们要的信息：分别在<td>和<div>标签中，并且标签属性是有规律的。

 
    <td style="word-break:break-all;word-wrap:break-word;">
鸿媒体
</td>
 
<div id="home_url"><div><a href="/go?url=www.hongmeiti.com" target="_blank">www.hongmeiti.com</a></div>

因此我们可以写出我们的规则出来：

 
    namelist=soup.find_all("td",{"style":"word-break:break-all;word-wrap:break-word;"})
domainlist=soup.find_all("div",{"id":"home_url"})

此时，我们得到是两个列表，如何同时遍历两个列表呢？
可以使用zip把两个列表打包，举个小例子：

 
    list1 = [1,2,3,4]
list2 = [5,6,7,8]
for (i1, i2) in zip(list1,list2):
print i1,i2
 
结果：
1 5
2 6
3 7
4 8 
   

给出一个小小的demo来：

 
    #-*-coding:utf-8-*-
import requests,re
from bs4 import BeautifulSoup
 
def get_record_1(key):
url="http://www.beianbeian.com/search/"+key #先查一个备案号来
match='<a href="(.*?)">\[反查\]</a>'
response=requests.get(url=url).content
print "正在查询地址 :"+url+",结果如下: "
path=re.findall(match,response)[0]
url="http://www.beianbeian.com/"+path #查到了直接拼接url再次进行反查
response_1=requests.get(url).content
 
soup = BeautifulSoup(response_1,"lxml")
namelist = soup.find_all("td", {"style": "word-break:break-all;word-wrap:break-word;"})
namelist=namelist[2:]
 
domainlist = soup.find_all("div", {"id": "home_url"})
for (name, domain) in zip(namelist, domainlist):
print name.get_text().replace("\n", "").strip() + ":" + domain.get_text()
 
if __name__ == '__main__':
key = raw_input("请输入所要查询备案号的一级域名:")
get_record_1(key) 
   

运行效果：

Alt text

JSON大法
如果你有各种网站API，例如http://www.sojson.com/api/beian/baidu.com这种查询就是直接使用API的接口，那么返回来的数据一般是JSON的格式。我们可以把获得的json当成python的dict来读取。

Alt text

 
    import json
url="http://www.sojson.com/api/beian/"+key
r=requests.get(url=url,headers=headers).json()
print r["sitename"]+" "+r["nowIcp"] 
   

小结

还有一些查询地址未演示，不同的返回页面匹配规则的不同，大家可以回去练习。本次手把手叫你制作信息收集器之收集备案号就到这里，我们下期见~

转载于:https://www.cnblogs.com/reborn-blog/p/8007231.html