使用python处理提取sql文件中的表_用python解析sql提取表名

最新推荐文章于 2024-07-21 16:08:46 发布

冯起升

最新推荐文章于 2024-07-21 16:08:46 发布

阅读量4.7k

点赞数 3

文章标签：使用python处理提取sql文件中的表

本文链接：https://blog.csdn.net/weixin_33196646/article/details/113552061

版权

python解析sql提取表名

本文介绍三种用python解析sql提取表名的方法，分别是正则表达式提取，使用sqlparse 库进行提取，使用sql_metadata进行提取。

解析sql提取出表名，在特定场景下是非常有应用价值的操作，假设你提供了一个供用户执行sql的客户端，不论是c/s还是b/s架构，都可能会面临一个强烈的需求，控制不同的人访问表的权限。不同的用户，可以访问不同的表，这种权限控制是十分常见的要求，解决的思路之一便是解析sql，从sql语句中提取出表名，然后根据用户是否拥有该表的操作权限来决定这个sql是否可以执行。

目标很明确，思路很清晰，but，从sql语句中提取出表名却并不是一件容易的事情，调研了几日，总算是找到一种相对靠谱的方法。

1. 正则表达式提取

通过正则表达式来提取sql语句中的表名，听起来是一个不错的选择，来看示例代码

import ply.lex as lex, re

def extract_table_name_from_sql(sql_str):

# remove the /* */ comments

q = re.sub(r"/\*[^*]*\*+(?:[^*/][^*]*\*+)*/", "", sql_str)

# remove whole line -- and # comments

lines = [line for line in q.splitlines() if not re.match("^\s*(--|#)", line)]

# remove trailing -- and # comments

q = " ".join([re.split("--|#", line)[0] for line in lines])

# split on blanks, parens and semicolons

tokens = re.split(r"[\s)(;]+", q)

# scan the tokens. if we see a FROM or JOIN, we set the get_next

# flag, and grab the next one (unless it's SELECT).

result = set()

get_next = False

for token in tokens:

if get_next:

if token.lower() not in ["", "select"]:

result.add(token)

get_next = False

get_next = token.lower() in ["from", "join"]

return result

print(extract_table_name_from_sql("select * from user"))

print(extract_table_name_from_sql("select * from user left join teacher on user.teacher_id=teacher.id"))

输出结果

{'user'}

{'teacher', 'user'}

仅仅看着两条测试语句，效果还不错，可是随着测试的深入，就会发现至少这段代码有许多情况无法处理，比如下面两个sql

print(extract_table_name_from_sql("select user.name, teacher.name from user,teacher"))

print(extract_table_name_from_sql("update user set age = 14 where user.id= 100"))

输出结果为

{'user,teacher'}

set()

不能因为这段示例代码不能有效提取sql中的表名就否定使用正则表达式的思路，但几天的搜索下来，确实没有找到让我满意的方法。

2.sqlparse

sqlparse 是一个python的专门用于解析sql的库，它的解析相比于正则就要高级了一些，它可以区分出哪些是关键字，哪些是标识符，不需要像正则那样去处理注释，而且官方给出了一个抽取表名的示例

import sqlparse

from sqlparse.sql import IdentifierList, Identifier

from sqlparse.tokens import Keyword, DML

def is_subselect(parsed):

if not parsed.is_group:

return False

for item in parsed.tokens:

if item.ttype is DML and item.value.upper() == 'SELECT':

return True

return False

def extract_from_part(parsed):

from_seen = False

for item in parsed.tokens:

if from_seen:

if is_subselect(item):

yield from extract_from_part(item)

elif item.ttype is Keyword:

return

else:

yield item

elif item.ttype is Keyword and item.value.upper() == 'FROM':

from_seen = True

def extract_table_identifiers(token_stream):

for item in token_stream:

if isinstance(item, IdentifierList):

for identifier in item.get_identifiers():

yield identifier.get_name()

elif isinstance(item, Identifier):

yield item.get_name()

# It's a bug to check for Keyword here, but in the example

# above some tables names are identified as keywords...

elif item.ttype is Keyword:

yield item.value

def extract_tables(sql):

stream = extract_from_part(sqlparse.parse(sql)[0])

return list(extract_table_identifiers(stream))

if __name__ == '__main__':

sql = """

select K.a,K.b from (select H.b from (select G.c from (select F.d from

(select E.e from A, B, C, D, E), F), G), H), I, J, K order by 1,2;

"""

tables = ', '.join(extract_tables(sql))

print('Tables: {}'.format(tables))

输出结果

Tables: A, B, C, D, E, F, G, H, I, J, K

看起来也还不错哦，可还是有严重缺陷

if __name__ == '__main__':

sql = "select * from user as u where u.id = 99"

tables = ', '.join(extract_tables(sql))

print('Tables: {}'.format(tables))

对于这样的语句，什么都解析不出来

3. sql_metadata

sql_metadata 是一个基于sqlparse的sql解析库，提供了get_query_tables 方法，从sql中提取表名的效果是所有调研的解析方法中最优质的

import sql_metadata

sql = ' select x1,x2 from liepin.a as atable left join b on atable.id = b.id right join c on c.id = atable.id'

sql = ' '.join(sql.split())

print(sql_metadata.get_query_tables("select * from user, user2 left join c on user.id = c.id right join d on d.id = e.id"))

print(sql_metadata.get_query_tables(sql))

print(sql_metadata.get_query_tables("select x1, x2 from (select x1, x2 from (select x1, x2 from apple.a)) left join orange.b as ob on a.id=ob.id where b.id in (select id from f)"))

print(sql_metadata.get_query_tables("select * from user as u where u.id = 99"))

输出结果

['user', 'user2', 'c', 'd']

['liepin.a', 'b', 'c']

['apple.a', 'orange.b', 'f']

['user']

get_query_tables方法的效果是最好的，但也不是没有瑕疵，这个问题出在sqlparse上，在解析sql时，'left join'会被解析成一个token，sql_metadata设置了一个关键词列表

table_syntax_keywords = [

# SELECT queries

'FROM', 'WHERE', 'JOIN', 'INNER JOIN', 'FULL JOIN', 'FULL OUTER JOIN',

'LEFT OUTER JOIN', 'RIGHT OUTER JOIN',

'LEFT JOIN', 'RIGHT JOIN', 'ON',

# INSERT queries

'INTO', 'VALUES',

# UPDATE queries

'UPDATE', 'SET',

# Hive queries

'TABLE', # INSERT TABLE

]

如果left 和 join之间存在多个空格，就无法匹配这里面的关键词，导致解析不正确，因此需要在解析前，对sql进行预处理。虽然sqlparse也提供了sql美化的功能，但并不能处理left join 之间的空格。我在上面的代码中使用了比较讨巧的方法，使用split函数对sql字符串进行切分，再用空格连接，可以得到比较理想的效果。

4. 对sql进行处理

尽管sql_metadata已经表现的很优异，但仍有瑕疵，如何sql语句中存在注释，sqlparsse在解析sql语句时就会出现问题导致解析结果的不准确，因此需要对sql进行清洗，这里就借鉴了正则表达式的方法，去除那些无用的注释

def clean(sql_str):

# remove the /* */ comments

q = re.sub(r"/\*[^*]*\*+(?:[^*/][^*]*\*+)*/", "", sql_str)

# remove whole line -- and # comments

lines = [line for line in q.splitlines() if not re.match("^\s*(--|#)", line)]

# remove trailing -- and # comments

q = " ".join([re.split("--|#", line)[0] for line in lines])

q = ' '.join(q.split())

return q

在抽取table之前，先对数据进行清洗，避免注释影响结果

冯起升

关注

3
点赞
踩
27

收藏

觉得还不错? 一键收藏
0
评论
使用python处理提取sql文件中的表_用python解析sql提取表名

python解析sql提取表名本文介绍三种用python解析sql提取表名的方法，分别是正则表达式提取，使用sqlparse 库进行提取，使用sql_metadata进行提取。解析sql提取出表名，在特定场景下是非常有应用价值的操作，假设你提供了一个供用户执行sql的客户端，不论是c/s还是b/s架构，都可能会面临一个强烈的需求，控制不同的人访问表的权限。不同的用户，可以访问不同的表，这种权限控制...
复制链接

扫一扫