我有一个像这样的字符串:
my_string = "Hello, I need to find php, software-engineering, html, security and safety things or even Oracle in your dataset. #C should be another opetion, databases and queries"
和这样的列表:
my_list = ['C#', 'Django' 'Software-Engineering', 'C', 'PHP', 'Oracle Cload', 'React', 'Flask', 'IT-Security market', 'Databases and Queries']
我想从my_list中提取每个可能的my_string词。
这是我期望的:
['PHP', 'Software-Engineering', 'C', 'Oracle Cload', 'IT-Security market', 'Databases and Queries']
这是我尝试的:
import re
try:
user_inps = re.findall(r'\w+', my_string)
extracted_inputs = set()
for user_inp in user_inps:
if user_inp.lower() in set(map(lambda x: x.lower(), my_list)):
extracted_inputs.add(user_inp)
except Exception:
extracted_inputs = set()
但是我得到这个:
['php', 'C']
效率也是我关注的问题。任何帮助将不胜感激。
解决方案
如果要避免使用,您可以使用纯Python来完成大部分工作re。这将是大量的十万字的顺序列出快。
基本计划:清理标点符号,将所有内容标记化,使用集合进行匹配。对于小型应用程序,您可以修改关键字中的标记以省略诸如查找“ and”之类的内容。
my_string = "Hello, I need to find php, software-engineering, html, security and safety things or even Oracle in your dataset. #C should be another opetion, databases and queries"
my_list = ['C#', 'Django', 'Software-Engineering', 'C', 'PHP', 'Oracle Cload', 'React', 'Flask', 'IT-Security market', 'Databases and Queries']
# make table of tokens : phrases
keywords = {}
for word in my_list:
# split each word into tokens
tokens = {w.lower() for w in word.replace('-',' ').split()}
for t in tokens:
keywords[t] = word
# tokenize the string my_string
# note: this is specifically tailored to your input with commas and hyphens, you may need to
# make this more universal
my_string_tokens = {t.lower() for t in my_string.replace(',','').replace('-',' ').split()}
# now you can just intersect the sets, which is much more efficient than nested looping
matches = my_string_tokens & set(keywords.keys())
for match in matches: # do what you want here...
print(f'token: {match:20s}-> {keywords[match]}')
产生:
token: queries -> Databases and Queries
token: php -> PHP
token: oracle -> Oracle Cload
token: engineering -> Software-Engineering
token: databases -> Databases and Queries
token: software -> Software-Engineering
token: and -> Databases and Queries
token: security -> IT-Security market