python程序执行时产生typeerror_Python:获取TypeError:调用函数时期望的字符串或类似字节的对象...

I have a text file which was converted to dataframe using below command:

df = pd.read_csv("C:\\Users\\Sriram\\Desktop\\New folder (4)\\aclImdb\\test\\result.txt", sep = '\t',

names=['reviews','polarity'])

Here the reviews column consists of all the movie reviews and polarity column consists of whether the review is positive or negative.

I have below feature function, to which my reviews column (nearly 1000 reviews) from dataframe needs to be passed.

def find_features(document):

words = word_tokenize(document)

features = {}

for w in word_features:

features[w] = (w in words)

return features

I am creating a training dataset using below function.

trainsets = [find_features(df.reviews), df.polarity]

Hence by doing this, all the words in my reviews column will be split as a result of tokenize function in find_feature and will be assigned a polarity (positive or negative).

For example:

reviews polarity

This is a poor excuse for a movie negative

For above case, after calling the find_features function, if the method inside the function is satisfied, I will be getting output as:

poor - negative

excuse - negative

and so on....

While I am trying to call this function, I am getting the below error:

---------------------------------------------------------------------------

TypeError Traceback (most recent call last)

in ()

30 return features

31

---> 32 featuresets = [find_features(df.reviews), df.polarity]

33 #featuresets = [(find_features(rev), category) for ((rev, category)) in

reviews]

34 '''

in find_features(document)

24

25 def find_features(document):

---> 26 words = word_tokenize(document)

27 features = {}

28 for w in word_features:

C:\Users\Sriram\Anaconda3\lib\site-packages\nltk\tokenize\__init__.py in

word_tokenize(text, language)

102 :param language: the model name in the Punkt corpus

103 """

--> 104 return [token for sent in sent_tokenize(text, language)

105 for token in _treebank_word_tokenize(sent)]

106

C:\Users\Sriram\Anaconda3\lib\site-packages\nltk\tokenize\__init__.py in

sent_tokenize(text, language)

87 """

88 tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))

---> 89 return tokenizer.tokenize(text)

90

91 # Standard word tokenizer.

C:\Users\Sriram\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py in

tokenize(self, text, realign_boundaries)

1224 Given a text, returns a list of the sentences in that text.

1225 """

-> 1226 return list(self.sentences_from_text(text,

realign_boundaries))

1227

1228 def debug_decisions(self, text):

C:\Users\Sriram\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py in

sentences_from_text(self, text, realign_boundaries)

1272 follows the period.

1273 """

-> 1274 return [text[s:e] for s, e in self.span_tokenize(text,

realign_boundaries)]

1275

1276 def _slices_from_text(self, text):

C:\Users\Sriram\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py in

span_tokenize(self, text, realign_boundaries)

1263 if realign_boundaries:

1264 slices = self._realign_boundaries(text, slices)

-> 1265 return [(sl.start, sl.stop) for sl in slices]

1266

1267 def sentences_from_text(self, text, realign_boundaries=True):

C:\Users\Sriram\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py in

(.0)

1263 if realign_boundaries:

1264 slices = self._realign_boundaries(text, slices)

-> 1265 return [(sl.start, sl.stop) for sl in slices]

1266

1267 def sentences_from_text(self, text, realign_boundaries=True):

C:\Users\Sriram\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py in

_realign_boundaries(self, text, slices)

1302 """

1303 realign = 0

-> 1304 for sl1, sl2 in _pair_iter(slices):

1305 sl1 = slice(sl1.start + realign, sl1.stop)

1306 if not sl2:

C:\Users\Sriram\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py in

_pair_iter(it)

308 """

309 it = iter(it)

--> 310 prev = next(it)

311 for el in it:

312 yield (prev, el)

C:\Users\Sriram\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py in

_slices_from_text(self, text)

1276 def _slices_from_text(self, text):

1277 last_break = 0

-> 1278 for match in

self._lang_vars.period_context_re().finditer(text):

1279 context = match.group() + match.group('after_tok')

1280 if self.text_contains_sentbreak(context):

TypeError: expected string or bytes-like object

How to call a function directly from a dataframe which has multiple rows of values (In my case reviews)?

解决方案

going by your expected output mentioned:

poor - negative

excuse - negative

I will suggest:

trainsets = df.apply(lambda row: ([(kw, row.polarity) for kw in find_features(row.reviews)]), axis=1)

adding a sample snippet for ref:

import pandas as pd

from StringIO import StringIO

print 'pandas-version: ', pd.__version__

data_str = """

col1,col2

'leoperd lion tiger','non-veg'

'buffalo antelope elephant','veg'

'dog cat crow','all'

"""

data_str = StringIO(data_str)

# a dataframe with 2 columns

df = pd.read_csv(data_str)

# a dummy function taking a col1 value from each row

# and splits it into multiple values & returns a list

def my_fn(row_val):

return row_val.split(' ')

# calling row-wise apply vetor operation on dataframe

train_set = df.apply(lambda row: ([(kw, row.col2) for kw in my_fn(row.col1)]), axis=1)

print train_set

output:

pandas-version: 0.15.2

0 [('leoperd, 'non-veg'), (lion, 'non-veg'), (ti...

1 [('buffalo, 'veg'), (antelope, 'veg'), (elepha...

2 [('dog, 'all'), (cat, 'all'), (crow', 'all')]

dtype: object

@SriramChandramouli, hope I understood your requirement correctly.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值