python敏感词过滤代码简单_一只数据狗的Python之路——DFA算法进行敏感词过滤

最新推荐文章于 2023-09-02 13:08:36 发布

weixin_39964833

最新推荐文章于 2023-09-02 13:08:36 发布

阅读量303

点赞数

文章标签： python敏感词过滤代码简单

DFA算法

DFA(Deterministic

Finite

Automaton，确定有穷自动机)是实现文字过滤的一种不错的算法，当然，这只是DFA众多用途中的一种。

简单说，DFA就是通过当前"状态"

"动作"获取下一个"状态"。

首先看下图：

这是一个python字典dict或者json对象，是{key:value}的格式。例如，"山"这个key对应的value为"children"和"word"，分别表示其子对象和敏感词，如果children为空则表示这一条线到此位置结束，并在word中把所表示的敏感词给出(当然这里可以不用给出，而只是用一个标识，例如isEnd=1来表示)。”山”的children中共有”东”和”西”两个字，而西的children为空，说明从山到西是一个敏感词，同理，从山到学(“山东大学”)以及山到省(“山东省”)也是敏感词。

可以发现，每个对象的children子对象都与其父对象拥有相同的类型(毕竟都是一棵树上的节点)。

Python代码

#!/usr/bin/env python

# coding=utf-8

__author__ = 'cuixuewei'

__date__ = '2015-11-18'

import time

import DBUtils

import re

import sys

reload(sys)

sys.setdefaultencoding('utf-8')

class Node(object):

def __init__(self):

self.children = None

self.badword = None

#self.isEnd = None

def add_word(root,word):

node = root

for i in range(len(word)):

if node.children == None:

node.children = {}

node.children[word

] = Node()

elif word

not in node.children:

node.children[word

] = Node()

node = node.children[word

]

node.badword = word

#node.isEnd = 1

def init():

root = Node()

#result = u"卧槽\n尼玛\n"

'''#-------------------------------------------

#从数据库中读取

db = DBUtils.DBUtils('localhost','root','4521','test')

db.set_table("base_badwords")

result = db.select(['words'])

for line in result:

#只匹配中文/英文/数字

#li = ''.join(re.findall(re.compile(u'[a-zA-Z0-9\u4e00-\u9fa5]'),line[0]))

#if li:

# add_word(root,li.lower())

add_word(root,line[0].lower())

return root

'''#-------------------------------------------

#-------------------------------------------

#从文件中读取

with open ('/Users/cuixuewei/DFA/badwords.txt','r') as result:

for line in result:

#只匹配中文/英文/数字

#li = ''.join(re.findall(re.compile(u'[a-zA-Z0-9\u4e00-\u9fa5]'),line.strip().decode('utf8')))

#if li:

# print li

# add_word(root,li.lower())

if line.strip():

add_word(root,line.strip().decode('utf8').lower())

return root

#'''#-------------------------------------------

def is_contain(message, root):

for i in range(len(message)):

p = root

j = i

while (j

in p.children):

p = p.children[message

]

j = j 1

if p.badword == message[i:j]:

#print '--word--',p.badword,'-->',message

return p.badword

#if p.isEnd:

#return message[i:j]

return 0

def dfa():

print '------------dfa start-----------'

print 'init ...'

root = init()

print 'init done!'

#message = u'卧槽'

db = DBUtils.DBUtils('localhost','root','4521','test')

db.set_table("user_profile")#用户表

result = db.select(['nickname','user_id'])#取昵称和用户ID

print "user count:",len(result)

#开始计时

start_time = time.time()

data = []

for line in result:

message = ''.join(re.findall(re.compile(u'[a-zA-Z0-9\u4e00-\u9fa5]'),line[0]))

#print '***message***',len(message)

res = is_contain(message.lower(),root)

if res:

data.append([line

,res,message])

end_time = time.time()

#这里把含有敏感词的用户存入数据库

db.set_table('bad_user')#含有敏感词的用户表，其实可以直接在用户表中添加相应字段，并标记为0:正常用户，1:含有敏感词用户

fields = ['user_id','bad_word','nickname']#用户ID 敏感词昵称

db.insertmany(data,fields)

#输出所用时间

print (end_time - start_time)

if __name__ == '__main__':

dfa()

关键点：

与数据结构中简单的的线性链表以及二叉树类似，线性链表是p=p->next，一次向后移动一个确定的节点，二叉树向后遍历的时候会选择left

child和right child，这里的node =

node.children[wordx]本质上是一棵多叉树，每个node的子children都不相同。生成以及遍历树的时候，到达一个node，会判断node.children中是否有某个词(分支)，有就继续，当遍历到children为空的时候，判断一下是不是要找的词。

结果:字典中541个敏感词，1576684个待检测用户名(mysql中是VARCHAR(60))，用时20.91秒

关于时间复杂度，与树的深度以及待检测句子的长度有关。

weixin_39964833

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫