So what I'm trying to do is replace a string "keyword" with
"keyword"
in a larger string.
Example:
myString = "HI there. You should higher that person for the job. Hi hi."
keyword = "hi"
result I would want would be:
result = "HI there. You should higher that person for the job.
Hi hi."
I will not know what the keyword until the user types the keyword
and won't know the corpus (myString) until the query is run.
I found a solution that works most of the time, but has some false positives,
namely it would return "higher"which is not what I want. Also note that I
am trying to preserve the case of the original text, and the matching should take
place irrespective of case. so if the keyword is "hi" it should replace
HI with HI and hi with hi.
The closest I have come is using a slightly derived version of this:
http://code.activestate.com/recipes/576715/
but I still could not figure out how to do a second pass of the string to fix all of the false positives mentioned above.
Or using the NLTK's WordPunctTokenizer (which simplifies some things like punctuation)
but I'm not sure how I would put the sentences back together given it does not
have a reverse function and I want to keep the original punctuation of myString. Essential, doing a concatenation of all the tokens does not return the original
string. For example I would not want to replace "7 - 7" with "7-7" when regrouping the tokens into its original text if the original text had "7 - 7".
Hope that was clear enough. Seems like a simple problem, but its a turned out a little more difficult then I thought.
解决方案
This ok?
>>> import re
>>> myString = "HI there. You should higher that person for the job. Hi hi."
>>> keyword = "hi"
>>> search = re.compile(r'\b(%s)\b' % keyword, re.I)
>>> search.sub('\\1', myString)
'HI there. You should higher that person for the job. Hi hi.'
The key to the whole thing is using word boundaries, groups and the re.I flag.