【网络空间安全数据挖掘技术】基于机器学习的钓鱼网址检测

最新推荐文章于 2025-04-10 14:35:47 发布

Wxxkrain

最新推荐文章于 2025-04-10 14:35:47 发布

阅读量1.7k

点赞数 1

文章标签：安全数据挖掘机器学习

本文链接：https://blog.csdn.net/LLlanNN_/article/details/133420423

版权

作业任务

选择恶意和正常URL链接数据进行研究(特征选择、算法选择)，并编写代码构建模型，最终满足如下需求：

打印出模型的准确率和召回率；
代码可以根据输入的URL自动判定其安全性；

数据集

benign/phishing data，两个csv文件：

Label列是自己添加的，benign为0，phishing为1

在这里插入图片描述

钓鱼网址数据来源
属性解释：
在这里插入图片描述

网址相关知识

href= protocol + hostname+ port+ pathname+ parameter+ fragment

protocol: http, https（双斜杠）
hostname: 主机域名+二级域名(+多级域名)
port: 默认80且隐藏
pathname: 文件所在子目录
param: 参数（动/静态网址）
fragment: 当前片段标识符

如何区分特征

表层特征：

从字符/数字组成元素
从整体字符的长度
从各组成部分是否都存在（例如：是否存在瞄点）
网址文件后缀语义(静态/动态、可执行文件、视频…)
深层特征：
分词、单词数量
信息熵 (字符复杂度)
单词语义及前后关系、单词是否常用词
从域名后缀，代表含义 (edu:教育、com:商业、…)

机器学习中常用特征以及代码实现

参考论文：Pratiwi M E, Lorosae T A, Wibowo F W. Phishing site detection analysis using artificial neural network[C]//Journal of Physics: Conference Series. IOP Publishing, 2018, 1140(1): 012048.
链接：https://iopscience.iop.org/article/10.1088/1742-6596/1140/1/012048/pdf

having_IP_Address{ -1,1 }

# Is IP addr present as the hostname
import ipaddress as ip 

def isip(uri):
    try:
        if ip.ip_address(uri):
            return 1
    except:
        return 0

如果不是IPv4/IPv6地址时会报错，返回0
在这里插入图片描述

URL_Length{ 1,0,-1 }

# URL_Length{ 1,0,-1 }
def url_len(url):
    length = len(url)
    if length < 54:
        return -1
    elif length < 75:
        return 0
    else:
        return 1

Shortining_Service{ 1,-1 }
是否使用了url缩短服务

def shorten_service(url):
        match = re.search('bit\.ly|goo\.gl|shorte\.st|go2l\.ink|x\.co|ow\.ly|t\.co|tinyurl|tr\.im|is\.gd|cli\.gs|'
                        'yfrog\.com|migre\.me|ff\.im|tiny\.cc|url4\.eu|twit\.ac|su\.pr|twurl\.nl|snipurl\.com|'
                        'short\.to|BudURL\.com|ping\.fm|post\.ly|Just\.as|bkite\.com|snipr\.com|fic\.kr|loopt\.us|'
                        'doiop\.com|short\.ie|kl\.am|wp\.me|rubyurl\.com|om\.ly|to\.ly|bit\.do|t\.co|lnkd\.in|'
                        'db\.tt|qr\.ae|adf\.ly|goo\.gl|bitly\.com|cur\.lv|tinyurl\.com|ow\.ly|bit\.ly|ity\.im|'
                        'q\.gs|is\.gd|po\.st|bc\.vc|twitthis\.com|u\.to|j\.mp|buzurl\.com|cutt\.us|u\.bb|yourls\.org|'
                        'x\.co|prettylinkpro\.com|scrnch\.me|filoops\.info|vzturl\.com|qr\.net|1url\.com|tweez\.me|v\.gd|'
                        'tr\.im|link\.zip\.net',
                        url)
        if match:
            return 1
        else:
            return -1

having_At_Symbol{ 1,-1 }
在 URL 中使用“@”符号会导致浏览器忽略“@”符号之前的所有内容，而真实地址通常位于“@”符号之后。

    # Function to check if URL has '@' symbol
    def having_At_Symbol(url):
        if '@' in url:
            return 1
        else:
            return -1

double_slash_redirecting{ -1,1 }
双斜杠或“//”表示用户或用户将被重定向到另一个站点。双斜杠的使用位置通常出现在第六个位置，如链接 http://amikom.ac.id 中所写。但是，如果双斜线出现在第七个位置，即 https://amikom.ac.id，则可能会被怀疑是网络钓鱼站点。

    # Function to check for double slash redirecting
    def double_slash_redirecting(url):
        if url.find('//') > 7:
            return 1
        else:
            return -1

Prefix_Suffix{ -1,1 }
钓鱼网址通常使用-连接前后缀伪装，例如http://www.amikom-keren.com.

    # Function to check for Prefix Suffix
    def Prefix_Suffix(url):
        if '-' in url:
            return 1
        else:
            return -1

Having_Sub_Domains{-1,0,1}
域名可能有每个国家/地区的代码（cc TLD），例如“id”，或学术教育机构的“ac”和组合的“ac.id”，也称为二级域（SLD）。提取特征的阶段首先要做的是删除 URL 中的“www”并删除 cc DTL（如果有）。然后计算剩余的点，如果点的数量大于1，则该URL可以被归类为“可疑”，因为只有子域。但是，如果点数大于两个，则它将被归类为网络钓鱼，因为它有多个子域，如果没有子域，则站点被归类为合法网站。

 # Function to classify URL based on the number of dots in the domain part
    def remove_subdomain_ccDTL(url):
        global domain
        
        if domain.startswith("www."):
            domain = domain[4:]
        domain_segments = domain.split(".")
        ccSLDs = ["ac", "co", "gov", "mil", "edu", "org", "net", "int"]
        ccTLDs = ["uk", "us", "ca", "au", "fr", "br", "de", "jp", "cn", "in", "ru", "za", "ch", "nl", "se"]
        if len(domain_segments) >= 2 and domain_segments[-1] in ccTLDs:
            domain_segments.pop()
            if len(domain_segments) >= 2 and domain_segments[-1] in ccSLDs:
                domain_segments.pop()
        dots_in_domain = len(domain_segments) - 1
        if dots_in_domain <= 1:
            return -1
        elif dots_in_domain == 2:
            return 0
        else:
            return 1

HTTPS{-1,0,1}

# Function to check HTTPS with trusted issuer
    def check_https_with_trusted_issuer(url):
        global parsed_url

        is_https = parsed_url.scheme == "https"
        trusted_issuers = ["DigiCert", "Let's Encrypt", "Comodo", "Symantec"]
        if is_https:
            try:
                cert_info = ssl.get_server_certificate((parsed_url.hostname, 443))
                x509 = ssl.PEM_cert_to_DER_cert(cert_info)
                cert = ssl.DER_cert_to_PEM_cert(x509)
                issuer_info = None
                for line in cert.split("\n"):
                    if line.startswith("issuer"):
                        issuer_info = line[len("issuer="):].strip()
                        break
                if issuer_info and any(trusted in issuer_info for trusted in trusted_issuers):
                    cert_data = ssl.get_certificate(parsed_url.hostname)
                    cert_start_date = datetime.datetime.strptime(cert_data.get_notBefore().decode(), "%Y%m%d%H%M%SZ")
                    cert_end_date = datetime.datetime.strptime(cert_data.get_notAfter().decode(), "%Y%m%d%H%M%SZ")
                    cert_validity_years = (cert_end_date - cert_start_date).days / 365.0
                    if cert_validity_years >= 1:
                        return -1
                    else:
                        return 0
            except Exception as e:
                pass
        return 1

Domain Registration Length{-1,1}
属于钓鱼网站，有效期短，使用期限为一年。

# Function to check domain registration length
    def check_domain_registration_length(url):
        global domain

        domain_info = whois.whois(domain)
        try:
            domain_info = whois.whois(domain)
            expiration_date = domain_info.expiration_date
            today = datetime.datetime.now()
            if isinstance(expiration_date, list):
                expiration_date = min(expiration_date)
            if expiration_date is not None:
                days_until_expiration = (expiration_date - today).days
                if days_until_expiration <= 365:
                    return 1
                else:
                    return -1
            else:
                return 0
        except Exception as e:
            return 0

Favicon{-1, 1}
Favicon是在网站上用作图标的图像，favicon也表明了网站的身份。但如果地址栏中的图标分开显示，则可以怀疑该网站是钓鱼网站。

# Function to check favicon
    def check_Favicon(url):
        global parsed_url, soup
        try:
            favicon_tag = soup.find("link", rel="icon")
            if favicon_tag:
                favicon_url = favicon_tag.get("href")
                parsed_favicon_url = urlparse(favicon_url)
                if parsed_favicon_url.netloc != '' and parsed_url.netloc != parsed_favicon_url.netloc:
                    return 1
            return -1
        except Exception as e:
            return 0

Port{-1, 1}
用于验证某些服务（例如 HTTP）的端口。利用防火墙、代理、网络地址转换或NAT可以进行自动封锁，并可以按照意愿开放。但如果所有端口都打开，网络钓鱼者就会发现漏洞并启用任何所需的服务，例如窃取信息。

预测结果

whois查询会很慢，一直到timeout，并且很多网址已失效都返回None
在这里插入图片描述

使用soup分析html的几个属性也非常慢，几十个训练样本就需要十分钟来提取特征

   # Function to check request URL
    def request_url(url):
        global parsed_url, domain, soup
        try:
            all_urls = [link.get("href") for link in soup.find_all(["a", "img", "script", "link"])]
            main_domain = domain
            same_domain_count = sum(1 for url in all_urls if urlparse(url).netloc == main_domain)
            total_urls_count = len(all_urls)
            percentage_same_domain = (same_domain_count / total_urls_count) * 100
            if percentage_same_domain < 22:
                return -1
            elif 22 <= percentage_same_domain <= 61:
                return 0
            else:
                return 1
        except Exception as e:
            return 0

    # Function to check anchor URL
    def anchor_url(url):
        global parsed_url, domain, soup
        try:
            anchor_urls = [link.get("href") for link in soup.find_all("a")]
            main_domain = domain
            different_domain_count = sum(1 for url in anchor_urls if urlparse(url).netloc != main_domain)
            total_anchor_count = len(anchor_urls)
            percentage_different_domain = (different_domain_count / total_anchor_count) * 100
            if percentage_different_domain < 31:
                return -1
            elif 31 <= percentage_different_domain <= 67:
                return 0
            else:
                return 1
        except Exception as e:
            return 0

    # Function to check links in tags
    def links_in_tags(url):
        global parsed_url, domain, soup
        try:
            meta_tags = soup.find_all("meta")
            script_tags = soup.find_all("script")
            link_tags = soup.find_all("link")
            main_domain = domain
            links_count = sum(1 for tag in (meta_tags + script_tags + link_tags) if urlparse(tag.get("src", "")).netloc == main_domain)
            total_links_count = len(meta_tags) + len(script_tags) + len(link_tags)
            percentage_links_in_tags = (links_count / total_links_count) * 100
            if percentage_links_in_tags < 17:
                return -1
            elif 17 <= percentage_links_in_tags <= 81:
                return 0
            else:
                return 1
        except Exception as e:
            return 0

    # Function to check server form handler (SFH)
    def sfh(url):
        global parsed_url, domain, soup
        try:
            sfh_value = ""
            form_tag = soup.find("form")
            if form_tag:
                sfh_value = form_tag.get("action", "")
            main_domain = domain
            sfh_parsed_url = urlparse(sfh_value)
            sfh_domain = sfh_parsed_url.netloc if sfh_parsed_url.netloc else main_domain
            if sfh_value == "about:blank" or sfh_value == "":
                return 1
            elif sfh_domain != main_domain:
                return 0
            else:
                return -1
        except Exception as e:
            return 0

    # Function to check submitting to email
    def submitting_to_email(url):
        global soup
        try:
            script_tags = soup.find_all("script")
            for script in script_tags:
                script_text = script.get_text()
                if "mail()" in script_text or "mailto:" in script_text:
                    return 1
            return -1
        except Exception as e:
            return 0