What is the correct way to count English words in a document using regular expression?
I tried with:
words=re.findall('\w+', open('text.txt').read().lower())
len(words)
but it seems I am missing few words (compares to the word count in gedit).
Am I doing it right?
Thanks a lot!
解决方案
Using \w+ won't correctly count words containing apostrophes or hyphens, eg "can't" will be counted as 2 words. It will also count numbers (strings of digits); "12,345" and "6.7" will each count as 2 words ("12" and "345", "6" and "7").