两个大表的个别字段的模糊匹配查询

最新推荐文章于 2021-12-09 10:02:44 发布

dyh4201

最新推荐文章于 2021-12-09 10:02:44 发布

阅读量7.6k

点赞数

分类专栏：数据库文章标签：大表 like instr full text

本文链接：https://blog.csdn.net/dyh4201/article/details/79759364

版权

数据库专栏收录该内容

7 篇文章 0 订阅

订阅专栏

需求：

有两个表A，和B，分别为300万和400万数据，B中path字段包含目录信息，需要匹配A中的path信息，及B.path in A.path且需要包含信息。

问题：

1 如果直接关联查询，如用like，instr，查询非常慢，执行一天都执行不出来结果。执行计划无法使用索引，直接全表扫描，效率奇差

2 采用 full text索引后，搜索出来的结果都是模糊查询，一大堆结果，所以无法联合查询

解决办法:

1 修改表引擎为MyISAM，设置表A.path字段为full text索引，采 “IN BOOLEAN MODE”查询

2 采用生产者消费者模式，然后逐一查询，

做个记号，不知道有没有其他更好的方法，目前依旧很慢

#encoding:utf-8
import sys
sys.path.append('/home/fastqweb/dyh/script')
reload(sys)
sys.setdefaultencoding('utf-8')
from setting import *
from util.mysqlclient import mysqlClient
from util.log import Log
loghander = Log(log_path, 'matchinfo.log')
import Queue
import time
import threading
mutex = threading.Lock()


def mysql_connect(loghander):
    mysql_db = mysqlClient(mysql_config['IP'], mysql_config['USER'], mysql_config['PASSWORD'], mysql_config['DB'], mysql_config['PORT'])
    try:
        mysql_db.open()
    except Exception, e:
        print e
        loghander.error('mysql can not connect,the ip is {}, the user is {}, the password is {}, '
                        'the db is {}, the port is {}'.format(mysql_config['IP'], mysql_config['USER'],
                          mysql_config['PASSWORD'], mysql_config['DB'], mysql_config['PORT']))
        exit(0)
    return mysql_db
def get_gf_histroy_data(q):
    print "I am hear"
    mysql_db = mysql_connect(loghander)
    sql = 'select * from B'
    path_cursor = mysql_db.select(sql)
    while True:
        text = path_cursor.fetchone()
        if text:
            while q.full():
              print 'full'
              time.sleep(5)
            q.put(text)
        else:
            q.put('FINISH')
            break

def generate_searchword(dir):
    b = dir.split('/')
    data = []
    for item in b:
        if item:
            item = '+' + item
            data.append(item)
    return ' '.join(data)

def get_sample_machine(q):
    print "I am in"
    mysql_db = mysql_connect(loghander)
    while True:
        mutex.acquire()
        status = q.empty()
        if not status:
            text = q.get()
            mutex.release()
            if text and text != 'FINISH':
                backup_dir = text[8]
                if backup_dir:
                    new_dir = generate_searchword(backup_dir)
                    path_sql = "select * from A where match(machine_path) against(\'{}\'IN BOOLEAN MODE)".format(new_dir)
                    path_cursor = mysql_db.select(path_sql)
                    result_text = path_cursor.fetchall()
                    if result_text:
                        for item in result_text:
                            if backup_dir in  item[12]:
                                mutex.acquire()
                                loghander.info('path match result:' + str(item ))
                                mutex.release()
            if text == 'FINISH':
                    break
        else:
            mutex.release()



if __name__ == '__main__':
    loghander.info('begin ')
    q = Queue.Queue(maxsize=1000)
    #g1 = gevent.spawn(get_gf_histroy_data, q)
    tread_list =[threading.Thread(target=get_gf_histroy_data, args=(q,))]
    for i in range(100):
        tread_list.append(threading.Thread(target=get_sample_machine, args=(q,)))
    for item in tread_list:
        item.start()
    for item in tread_list:
        item.join()
    loghander.info('finish')