There's a logfile with text in the form of space-separated key=value pairs, and each line was originally serialized from data in a Python dict, something like:
' '.join([f'{k}={v!r}' for k,v in d.items()])
The keys are always just strings. The values could be anything that ast.literal_eval can successfully parse, no more no less.
How to process this logfile and turn the lines back into Python dicts? Example:
>>> to_dict("key='hello world'")
{'key': 'hello world'}
>>> to_dict("k1='v1' k2='v2'")
{'k1': 'v1', 'k2': 'v2'}
>>> to_dict("s='1234' n=1234")
{'s': '1234', 'n': 1234}
>>> to_dict("""k4='k5="hello"' k5={'k6': ['potato']}""")
{'k4': 'k5="hello"', 'k5': {'k6': ['potato']}}
Here is some extra context about the data:
Input lines are well-formed (e.g. no dangling brackets)
The data is trusted (unsafe functions such as eval, exec, yaml.load are OK to use)
Order is not important. Performance is not important. Correctness is important.
Edit: As requested in the comments, here is an MCVE and an example code that didn't work correctly
>>> def to_dict(s):
... s = s.replace(' ', ', ')
... return eval(f"dict({s})")
...
...
>>> to_dict("k1='v1' k2='v2'")
{'k1': 'v1', 'k2': 'v2'} # OK
>>> to_dict("s='1234' n=1234")
{'s': '1234', 'n': 1234} # OK
>>> to_dict("key='hello world'")
{'key': 'hello, world'} # Incorrect, the value was corrupted
解决方案
Your input can't be conveniently parsed by something like ast.literal_eval, but it can be tokenized as a series of Python tokens. This makes things a bit easier than they might otherwise be.
The only place = tokens can appear in your input is as key-value separators; at least for now, ast.literal_eval doesn't accept anything with = tokens in it. We can use the = tokens to determine where the key-value pairs start and end, and most of the rest of the work can be handled by ast.literal_eval. Using the tokenize module also avoids problems with = or backslash escapes in string literals.
import ast
import io
import tokenize
def todict(logstring):
# tokenize.tokenize wants an argument that acts like the readline method of a binary
# file-like object, so we have to do some work to give it that.
input_as_file = io.BytesIO(logstring.encode('utf8'))
tokens = list(tokenize.tokenize(input_as_file.readline))
eqsign_locations = [i for i, token in enumerate(tokens) if token[1] == '=']
names = [tokens[i-1][1] for i in eqsign_locations]
# Values are harder than keys.
val_starts = [i+1 for i in eqsign_locations]
val_ends = [i-1 for i in eqsign_locations[1:]] + [len(tokens)]
# tokenize.untokenize likes to add extra whitespace that ast.literal_eval
# doesn't like. Removing the row/column information from the token records
# seems to prevent extra leading whitespace, but the documentation doesn't
# make enough promises for me to be comfortable with that, so we call
# strip() as well.
val_strings = [tokenize.untokenize(tok[:2] for tok in tokens[start:end]).strip()
for start, end in zip(val_starts, val_ends)]
vals = [ast.literal_eval(val_string) for val_string in val_strings]
return dict(zip(names, vals))
This behaves correctly on your example inputs, as well as on an example with backslashes:
>>> todict("key='hello world'")
{'key': 'hello world'}
>>> todict("k1='v1' k2='v2'")
{'k1': 'v1', 'k2': 'v2'}
>>> todict("s='1234' n=1234")
{'s': '1234', 'n': 1234}
>>> todict("""k4='k5="hello"' k5={'k6': ['potato']}""")
{'k4': 'k5="hello"', 'k5': {'k6': ['potato']}}
>>> s=input()
a='=' b='"\'' c=3
>>> todict(s)
{'a': '=', 'b': '"\'', 'c': 3}
Incidentally, we probably could look for token type NAME instead of = tokens, but that'll break if they ever add set() support to literal_eval. Looking for = could also break in the future, but it doesn't seem as likely to break as looking for NAME tokens.