我找到了其中的几个主题,并找到了这个解决方案:sentence=re.sub(ur"[^\P{P}'|-]+",'',sentence)
这应该去掉所有的标点符号,除了',问题是它也从句子中去掉了其他的东西。
示例:>>> sentence="warhol's art used many types of media, including hand drawing, painting, printmaking, photography, silk screening, sculpture, film, and music."
>>> sentence=re.sub(ur"[^\P{P}']+",'',sentence)
>>> print sentence
'
当然,我想要的是保持句子没有标点符号,而“沃霍尔的”保持原样
期望输出:"warhol's art used many types of media including hand drawing painting printmaking photography silk screening sculpture film and music"
"austro-hungarian empire"
编辑:
我也试过用tbl = dict.fromkeys(i for i in xrange(sys.maxunicode)
if unicodedata.category(unichr(i)).startswith('P'))
sentence = sentence.translate(tbl)
但这会去掉所有标点符号