I want to extract all the words that are between single quotation marks from a text file. The text file looks like this:
u'MMA': 10,
=u'acrylic'= : 19,
== u'acting lessons': 2,
=u'aerobic': 141,
=u'alto': 2= 4,
=u= 39;art therapy': 4,
=u'ballet': 939,
=u'ballroom'= ;: 234,
= =u'banjo': 38,
And ideally, my output would look lie this:
MMA,
acrylic,
acting lessons,
...
From browsing posts, it seems like I should use some combination of NLTK / regex for python to accomplish this. I've tried the following:
import re
file = open('artsplus_categories.txt', 'r').readlines()
for line in file:
list = re.search('^''$', file)
file.close()
And get the following error:
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 142, in search
return _compile(pattern, flags).search(string)
TypeError: expected string or buffer
I think the error might be caused by how I'm looking for the pattern. My logic is that I search for everything inside of the '....'.
What's tripping up re.py?
Thanks!
--------------------------------
Following Ashwini's comment:
import re
file = open('artsplus_categories.txt', 'r').readlines()
for line in file:
list = re.search('^''$', line)
print list
#file.close()
But the output contains nothing:
Samuel-Finegolds-MacBook-Pro:~ samuelfinegold$ /var/folders/jv/9_sy0bn10mbdft1bk9t14qz40000gn/T/Cleanup\ At\ Startup/artsplus_categories_clean-393952531.278.py.command ; exit;
None
logout
@Rasco: here's the error I'm getting:
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 177, in findall
return _compile(pattern, flags).findall(string)
TypeError: expected string or buffer
logout
I'm using this code:
file2 = open('artsplus_categories.txt', 'r').readlines()
list = re.findall("'[^']*'", file2)
for x in list:
print (x)
解决方案
Instead of passing the line to the regex you actually passed it the whole list(file). You should pass line to re.search not file.
for line in file:
lis = re.search('^''$', line) # line not file
Don't use list, file as variable names. They are built-in functions.
Update:
with open('artsplus_categories.txt') as f:
for line in f:
print re.search(r"'(.*)'", line).group(1)
...
MMA
acrylic
acting lessons
aerobic
alto
art therapy
ballet
ballroom
banjo