上一节回顾了一些网络协议和网络体系架构,并且基于TCP/IP协议族体系实现了简单的TCP/IP和UDP/IP通信实例
这一节进入网络嗅探、端口扫描器以及网页内容读取和网络爬虫的学习
3. 网络嗅探器与端口扫描器设计
3.1 网络嗅探器
嗅探器程序可以检测本机所在局域网内的网络流量和数据包收发情况,对于网络管理具有重要作用。为了实现网络流量嗅探,需要将网卡设置为混杂模式,并且运行嗅探器程序的用户账号需要拥有系统管理员权限
在做网络嗅探时需要使用到原始套接字
网络嗅探器程序
- 下面的代码运行60s,然后输出本机所在局域网内非本机发出的数据包,并统计不同主机发出的数据包数量。
注意,这个程序需要在管理员权限下运行。
import socket
import threading
import time
activeDegree = dict()
flag = 1
def main():
global activeDegree
global flag
# 公共网络接口
HOST = socket.gethostbyname(socket.gethostname())
# 创建原始套接字并且绑定公共网络接口
s = socket.socket(socket.AF_INET, socket.SOCK_RAW, socket.IPPROTO_IP)
s.bind((HOST, 0))
# include IP headers
s.setsockopt(socket.IPPROTO_IP, socket.IP_HDRINCL, 1)
# receive all packages
s.ioctl(socket.SIO_RCVALL, socket.RCVALL_ON)
# receive a package
while flag:
c = s.recvfrom(8888)
host = c[1][0]
activeDegree[host] = activeDegree.get(host, 0) + 1
if c[1][0] != '192.168.0.104': # 假设10.2.1.8是当前主机的IP地址
print(c)
# disabled promiscuous mode禁用广播(滥交)模式
s.ioctl(socket.SIO_RCVALL, socket.RCVALL_OFF)
s.close()
t = threading.Thread(target=main)
t.start()
time.sleep(60)
flag = 0
t.join()
for item in activeDegree.items():
print(item)
结果:
3.2 多进程端口扫描器
在网络安全和黑客领域,端口扫描是经常用到的技术,可以探测指定主机上是否开放了特定端口,进一步判断主机上是否允许了某些重要的网络服务,最终判断是否存在潜在的安全漏洞。
端口扫描器程序
- 下面的代码模拟了扫描器的工作原理,并采用多线程技术提高扫描技术。
import socket
import multiprocessing
import sys
def ports(ports_service):
"""获取常用端口对应的服务名称-->dict{端口号:服务名}"""
for port in list(range(1, 100)) + [143, 145, 113, 443, 445, 3389, 8080]:
try:
ports_service[port] = socket.getservbyport(port)
except socket.error:
pass
def ports_scan(host, ports_service):
"""
根据ports方法得到的ports_service字典来对其中包含端口进行扫描,得到已经打开的端口
:param host:
:param ports_service:
:return:
"""
ports_open = []
try:
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
# 超长事件的不同会影响扫描结果的精确度
sock.settimeout(0.01)
except socket.error:
print('socket creation error')
sys.exit()
for port in ports_service:
try:
# 尝试连接指定端口
sock.connect((host, port))
# 记录打开的端口
ports_open.append(port)
sock.close()
except socket.error:
pass
return ports_open
if __name__ == '__main__':
m = multiprocessing.Manager()
ports_service = dict()
results = dict()
ports(ports_service)
# 创建进程池,允许最多8个进程同时允许
pool = multiprocessing.Pool(processes=8)
net = '127.0.0.'
for host_number in map(str, range(8, 10)):
host = net + host_number
# 创建一个新进程,同时记录其运行结果
results[host]=pool.apply_async(ports_scan,(host,ports_service))
print('starting'+host+'...')
# 关闭进程池,close()必须在join()之前调用
pool.close()
# 等待进程池中的进程全部执行结束
pool.join()
# 打印输出结果
print('*'*50)
print('扫描到的端口以及服务:')
for host in ports_service:
print('=' * 30)
print(host, '.' * 10)
for port in ports_service:
print(port, ":", ports_service[port])
print('*'*50)
print('有安全隐患的端口以及对应的服务:')
for host in results:
print('='*30)
print(host,'.'*10)
for port in results[host].get():
print(port,":",ports_service[port])
4. 网页内容读取与网页爬虫
4.1 网页内容读取与域名分析
Python3.x 提供了urllib库支持网页内容读取,主要包含四个部分:
-
urllib.request
-
urllib.response
-
urllib.parse
-
urllib.error
-
下面的代码演示了如何读取并显示指定网页的内容.
import urllib.request
fp = urllib.request.urlopen(r'https://www.baidu.com')
lines=fp.readlines()
for line in lines:
print(line)
fp.close()
b'<html>\r\n'
b'<head>\r\n'
b'\t<script>\r\n'
b'\t\tlocation.replace(location.href.replace("https://","http://"));\r\n'
b'\t</script>\r\n'
b'</head>\r\n'
b'<body>\r\n'
b'\t<noscript><meta http-equiv="refresh" content="0;url=http://www.baidu.com/"></noscript>\r\n'
b'</body>\r\n'
b'</html>'
- 下面的代码演示了如何使用GET方法读取并指定URL的内容
import urllib.request
params= '?s=socket' # 以?分割url和请求的数据,这些数据就是你要查询字段的编码。。而这个过程,就是典型的GET请求的情况。
url="https://www.runoob.com/python3/%s"%params
with urllib.request.urlopen(url) as f:
print(f.read().decode('utf-8'))
<!Doctype html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta property="qc:admins" content="465267610762567726375" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>socket 的搜索結果</title>
<link rel='dns-prefetch' href='//s.w.org' />
<meta name="keywords" content="Python 3 教程">
<meta name="description" content="菜鸟教程'Python 3 教程'..">
<link rel="shortcut icon" href="//static.runoob.com/images/favicon.ico" mce_href="//static.runoob.com/images/favicon.ico" type="image/x-icon" >
<link rel="stylesheet" href="/wp-content/themes/runoob/style.css?v=1.154" type="text/css" media="all" />
<link rel="stylesheet" href="//static.runoob.com/assets/font-awesome/4.7.0/css/font-awesome.min.css" media="all" />
<!--[if gte IE 9]><!-->
<script src="//static.runoob.com/assets/jquery/2.0.3/jquery.min.js"></script>
<!--<![endif]-->
<!--[if lt IE 9]>
<script src="//cdn.staticfile.org/jquery/1.9.1/jquery.min.js"></script>
<script src="//cdn.staticfile.org/html5shiv/r29/html5.min.js"></script>
<![endif]-->
<link rel="apple-touch-icon" href="//static.runoob.com/images/icon/mobile-icon.png"/>
<meta name="apple-mobile-web-app-title" content="菜鸟教程">
</head>
<body>
<!-- 头部 -->
<div class="container logo-search">
<div class="col search row-search-mobile">
<form action="index.php">
<input class="placeholder" placeholder="搜索……" name="s" autocomplete="off">
</form>
</div>
<div class="row">
<div class="col logo">
<h1><a href="/">菜鸟教程 -- 学的不仅是技术,更是梦想!</a></h1>
</div>
<div class="col search search-desktop last">
<form action="//www.runoob.com/" target="_blank">
<input class="placeholder" id="s" name="s" placeholder="搜索……" autocomplete="off">
</form>
</div>
</div>
</div>
<!-- 导航栏 -->
<!-- 导航栏 -->
<div class="container navigation">
<div class="row">
<div class="col nav">
<ul class="pc-nav">
<li><a href="//www.runoob.com/">首页</a></li>
<li><a href="/html/html-tutorial.html">HTML</a></li>
<li><a href="/css/css-tutorial.html">CSS</a></li>
<li><a href="/js/js-tutorial.html">JavaScript</a></li>
<li><a href="/jquery/jquery-tutorial.html">jQuery</a></li>
<li><a href="/bootstrap/bootstrap-tutorial.html">Bootstrap</a></li>
<li><a href="/python3/python3-tutorial.html">Python3</a></li>
<li><a href="/python/python-tutorial.html">Python2</a></li>
<li><a href="/java/java-tutorial.html">Java</a></li>
<li><a href="/cprogramming/c-tutorial.html">C</a></li>
<li><a href="/cplusplus/cpp-tutorial.html">C++</a></li>
<li><a href="/csharp/csharp-tutorial.html">C#</a></li>
<li><a href="/sql/sql-tutorial.html">SQL</a></li>
<li><a href="/mysql/mysql-tutorial.html">MySQL</a></li>
<li><a href="/php/php-tutorial.html">PHP</a></li>
<li><a href="/browser-history">本地书签</a></li>
<li><a style="font-weight:bold;" href="https://www.runoob.com/linux/linux-cloud-server.html" target="_blank" onclick="_hmt.push(['_trackEvent', 'yun', 'click', 'yun'])" title="云服务器">云服务器</a></li>
<!--
<li><a href="/w3cnote/knowledge-start.html" style="font-weight: bold;" onclick="_hmt.push(['_trackEvent', '星球', 'click', 'start'])" title="我的圈子">我的圈子</a></li>
<li><a href="javascript:;" class="runoob-pop">登录</a></li>
-->
</ul>
<ul class="mobile-nav">
<li><a href="//www.runoob.com/">首页</a></li>
<li><a href="/html/html-tutorial.html">HTML</a></li>
<li><a href="/css/css-tutorial.html">CSS</a></li>
<li><a href="/js/js-tutorial.html">JS</a></li>
<li><a href="/browser-history">本地书签</a></li>
<a href="javascript:void(0)" class="search-reveal">Search</a>
</ul>
.....
- 下面的代码演示了如何使用POST方法提交参数并读取指定页面内容
import urllib.request
import urllib.parse
url='http://www.baidu.com'
# 把要发送的数据写成字典
value={
'name':'BUPT',
'age':'60',
'location':'Beijing'#字典中的内容随意,不影响#
}
data=urllib.parse.urlencode(value)#对value进行编码,转换为标准编码#
data=data.encode('ascii')
with urllib.request.urlopen(url,data) as f: #向url发送请求,并传送表单data,注意在POST请求中,请求内容是放在urlopen方法的第二个参数中
print(f.read().decode('utf-8'))
<!DOCTYPE html>
<!--STATUS OK-->
<html>
<head>
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
<meta http-equiv="content-type" content="text/html;charset=utf-8">
<meta content="always" name="referrer">
<script src="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/nocache/imgdata/seErrorRec.js"></script>
<title>页面不存在_百度搜索</title>
<style data-for="result">
body {color: #333; background: #fff; padding: 0; margin: 0; position: relative; min-width: 700px; font-family: arial; font-size: 12px }
p, form, ol, ul, li, dl, dt, dd, h3 {margin: 0; padding: 0; list-style: none }
input {padding-top: 0; padding-bottom: 0; -moz-box-sizing: border-box; -webkit-box-sizing: border-box; box-sizing: border-box } img {border: none; }
.logo {width: 117px; height: 38px; cursor: pointer }
#wrapper {_zoom: 1 }
#head {padding-left: 35px; margin-bottom: 20px; width: 900px }
.fm {clear: both; position: relative; z-index: 297 }
.btn, #more {font-size: 14px }
.s_btn {width: 95px; height: 32px; padding-top: 2px\9; font-size: 14px; padding: 0; background-color: #ddd; background-position: 0 -48px; border: 0; cursor: pointer }
.s_btn_h {background-position: -240px -48px }
.s_btn_wr {width: 97px; height: 34px; display: inline-block; background-position: -120px -48px; *position: relative; z-index: 0; vertical-align: top }
#foot {}
#foot span {color: #666 }
.s_ipt_wr {height: 32px }
.s_form:after, .s_tab:after {content: "."; display: block; height: 0; clear: both; visibility: hidden }
.s_form {zoom: 1; height: 55px; padding: 0 0 0 10px }
#result_logo {float: left; margin: 7px 0 0 }
#result_logo img {width: 101px }
#head {padding: 0; margin: 0; width: 100%; position: absolute; z-index: 301; min-width: 1000px; background: #fff; border-bottom: 1px solid #ebebeb; position: fixed; _position: absolute; -webkit-transform: translateZ(0) }
.....
if (typeof document.addEventListener != "undefined") {
window.addEventListener('resize', bds.util.setFormWidth, false);
document.getElementById('kw').addEventListener('focus', function(){bds.util.setClass(c,'iptfocus', 'add');}, false);
document.getElementById('kw').addEventListener('blur', function(){bds.util.setClass(c,'iptfocus', 'remove');}, false);
} else {
window.attachEvent('onresize', bds.util.setFormWidth, false);
document.getElementById('kw').attachEvent('onfocus', function(){bds.util.setClass(c,'iptfocus', 'add');}, false);
document.getElementById('kw').attachEvent('onblur', function(){bds.util.setClass(c,'iptfocus', 'remove');}, false);
}
})();
...
- 下面的代码演示了如何使用HTTP代理访问指定页面
当需要采集大量数据时,或者有的网站对访问速度特别严格的时候,有的网站就采取封ip,这样就需要使用代理ip。
该代码转载自https://www.jianshu.com/p/ea49f886c756
import io
import sys
import random
import time
import requests
from bs4 import BeautifulSoup
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='utf-8')
def open_url(url_str,proxy_ip):
html = ""
headers = {
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0",
"Accept-Language":"zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3",
"Accept-Encoding":"gzip, deflate",
"Connection":"keep-alive"
}
if bool(proxy_ip):
html = requests.get(url=url_str,headers = headers, proxies=proxy_ip).content
else:
html = requests.get(url=url_str,headers = headers).content
#返回网页内容,动态加载的需要另行处理
return html
'''
该脚本使用说明:
使用免费的代理或者自己购买的代理,打开指定的网页地址,模拟用户使用独立IP访问相同页面
这里演示,使用的是89免费代理,地址:http://www.89ip.cn/
可以使用http://filefab.com/查看IP
http_ip:
可以自行编辑添加更多代理
url_str:
可以自行编辑为需要打开的网页地址
'''
url_str = 'https://vku.youku.com/live/ilproom?spm=a2hlv.20025885.0.0&id=8009372&scm=20140666.manual.65.live_8014311'
print("访问的网页地址:",url_str)
http_ip = [
'139.155.41.15',
'58.220.95.90',
'59.62.5.220',
'58.22.177.194',
'60.13.42.204'
]
'''
循环执行,每次访问后等待指定时间后重新访问,避免过于频繁
max_count:
可以自行编辑,访问多少次后自动终止
sleep_time:
可以自行编辑,等待多久后重新发起新的独立IP访问
'''
flag = True
max_count = 3
sleep_time = 3
print('共计需要访问',url_str,'网页',max_count,'次')
# 这里只做简单演示请求,单次延时访问,并发可以使用asyncio,aiohttp
while flag:
proxy_ip = {
'http' : random.choice(http_ip),
}
print('使用代理的IP:',proxy_ip)
html = open_url(url_str,proxy_ip)
# 解析网页内容,可以使用BeautifulSoup
print('返回网页内容长度:',len(html))
time.sleep(sleep_time)
print('等待',sleep_time,'秒后,重新使用独立IP发起网页请求')
max_count -=1
if(max_count==0):
flag = False
print("执行结束")
- 标准库urllib.parse提供了域名解析的功能,支持拆分和合并URL以及相对地址到绝对地址的转换
from urllib.parse import urlparse
o= urlparse('http://localhost:8888/notebooks/Python.html')
print(o.port)
8888
o.hostname
'localhost'
print(o)
ParseResult(scheme='http', netloc='localhost:8888', path='/notebooks/Python.html', params='', query='', fragment='')
# 合并URL
from urllib.parse import urljoin
urljoin('http://www.baidu,com/%7Egudi/第一字段.html','第二字段.html')
'http://www.baidu,com/%7Egudi/第二字段.html'
# 拆分URL
from urllib.parse import urlsplit
url='http://localhost:8888/notebooks/python_Learning/PythonPart2.ipynb'
r1=urlsplit(url)
r1.hostname
'localhost'
r1.geturl()
'http://localhost:8888/notebooks/python_Learning/PythonPart2.ipynb'
r1.netloc
'localhost:8888'
r1.scheme
'http'
4.2 版本自适应的网页爬虫
网页爬虫常用于在互联网上爬取感兴趣的页面或文件,结合数据处理与分析技术可以得到更深层次的信息。
- 下面的代码实现了网页爬虫,可以抓取指定网页中的所有链接,并且可以指定关键字和抓取深度。如果需要更高级的网页抓取功能,请参考scrapy框架
"""
网页爬虫程序
"""
import sys
import multiprocessing
import re
import os
try:
# 版本控制
# python3
import urllib.request as lib
python3 = True
except Exception:
# Python2
import urllib as lib
python3 = False
def craw_links(url, depth, keywords, processed):
"""
实现爬虫的功能
:param url:the url to craw
:param depth: the current depth to craw
:param keywords:the tuple of keywords to focus
:param processed: processed pool
:return:
"""
contents = []
if url.startswith('http://') or url.startswith('https://'):
if url not in processed:
# 标记该url为已处理
processed.append(url)
else:
# 避免重复处理同一个url
return
print('Crawing ' + url + '...')
fp = lib.urlopen(url)
if python3:
# Python3 returns bytes,so need to decode
contents = fp.read()
contents_decoded = contents.decode('UTF-8')
else:
# Python2 return str,does not need this decode
contents_decoded = fp.read()
fp.close()
pattern = '|'.join(keywords)
# 如果正在爬取的页面包含了关键词,就将其保存至文件中
flag = False
searched=None
if pattern:
searched = re.search(pattern, contents_decoded)
else:
# 如果未提供要筛选的关键字,就保存当前页
flag = True
print('是否提供关键字:',not flag,"---匹配结果:", searched)
if flag or searched:
if python3:
with open('craw\\' + url.replace(':', '_').replace('/', '_'), 'wb') as fp:
fp.write(contents)
else:
with open('craw\\' + url.replace(':', '_').replace('/', '_'), 'w') as fp:
fp.write(contents_decoded)
# 查找当前页面的所有链接
links = re.findall('href="(.*?)"', contents_decoded)
# 爬取得到的所有链接
for link in links:
# 考虑相对路径
if not link.startswith(('http://', 'https://')):
try:
index = url.rindex('/')
link = url[0:index + 1] + link
except:
pass
# 如果要求爬取的深度大于0,则递归
if depth > 0 and link.endswith(('.htm', '.html')):
craw_links(link, depth - 1, keywords, processed)
if __name__ == '__main__':
processed = []
keywords = ('text', 'author')
if not os.path.exists('craw') or not os.path.isdir('craw'):
os.mkdir('craw')
craw_links(r'http://lab.scrapyd.cn', 1, keywords, processed)