用句子和标签在python中分割线条(splitting lines in python with sentences and labels)
我有一个带有句子和标签的文件样本。 怎么能分成句子和标签?
一部非常,非常非常缓慢,漫无目的的电影,讲述一个忧郁,漂泊的年轻人。 0
不知道谁更迷失 - 扁平人物或观众,其中近一半人走了出去。 0
这部电影以黑白和巧妙的摄影角度尝试艺术,令人失望 - 变得更加荒谬 - 因为表演很差,情节和线条几乎不存在。 0
很少有音乐或任何可以谈论的东西。 0
产量
句子列表:
['一部非常非常非常慢动作的漫无目的的电影,讲述一个忧心忡忡,漂泊的年轻人','不知道谁更迷失 - 平面人物或观众,其中将近一半人走出去了']
相应的标签:
[ '0', '0']
I have a sample of a file with sentences and labels. How can it be split into sentences and labels?
A very, very, very slow-moving, aimless movie about a distressed, drifting young man. 0
Not sure who was more lost - the flat characters or the audience, nearly half of whom walked out. 0
Attempting artiness with black & white and clever camera angles, the movie disappointed - became even more ridiculous - as the acting was poor and the plot and lines almost non-existent. 0
Very little music or anything to speak of. 0
output
list of sentences:
['A very, very, very slow-moving, aimless movie about a distressed, drifting young man','Not sure who was more lost - the flat characters or the audience, nearly half of whom walked out']
corresponding labels:
['0','0']
原文:https://stackoverflow.com/questions/47466917
2020-08-18 19:08
满意答案
假设最后一个“。”(点)之后的数字是标签
对于存储在文件'yourdata.txt'中的给定示例,以下代码应该生成2个列表sentence_list和label_list 。 您可以根据您的要求单独将这些列表中的数据写入文件。
fmov=open('yourdata.txt','r')
sentence_list=[]
label_list=[]
for f in fmov.readlines():
lineinfo=f.split('.')
sentenceline=".".join(lineinfo[0:-1])
sentence_list.append(sentenceline)
label_list.append(str(lineinfo[-1]).replace('\n',''))
print(sentence_list)
print(label_list)
OUT:
['A very, very, very slow-moving, aimless movie about a distressed, drifting young man', 'Not sure who was more lost - the flat characters or the audience, nearly half of whom walked out', 'Attempting artiness with black & white and clever camera angles, the movie disappointed - became even more ridiculous - as the acting was poor and the plot and lines almost non-existent', 'Very little music or anything to speak of']
[' 0', ' 0', ' 0', ' 0']
Assuming that the number after the last "."(dot) is the Label
For the given example when stored in a file 'yourdata.txt' the following code should produce 2 lists sentence_list and label_list. You can write the data in these lists to files separately then as requested by you.
fmov=open('yourdata.txt','r')
sentence_list=[]
label_list=[]
for f in fmov.readlines():
lineinfo=f.split('.')
sentenceline=".".join(lineinfo[0:-1])
sentence_list.append(sentenceline)
label_list.append(str(lineinfo[-1]).replace('\n',''))
print(sentence_list)
print(label_list)
OUT:
['A very, very, very slow-moving, aimless movie about a distressed, drifting young man', 'Not sure who was more lost - the flat characters or the audience, nearly half of whom walked out', 'Attempting artiness with black & white and clever camera angles, the movie disappointed - became even more ridiculous - as the acting was poor and the plot and lines almost non-existent', 'Very little music or anything to speak of']
[' 0', ' 0', ' 0', ' 0']
2017-11-24
相关问答
自然语言工具包( nltk.org )具有您需要的功能。 这个小组发帖表示这样做: import nltk.data
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
fp = open("test.txt")
data = fp.read()
print '\n-----\n'.join(tokenizer.tokenize(data))
(我还没试过!) The Natural Language Toolkit (n...
your_list有你的解决方案。 您不需要执行任何进一步的步骤。 with open('testcsv.csv', 'r') as f:
reader = csv.reader(f)
your_list = list(reader)
print your_list
结果:[['1','11'],['2','12'],['3','13'],['4','14']] your_list has your solution. You do not need to perform any f...
使用DataFrame.itertuples()方法: import pandas as pd
df = pd.DataFrame(
[['John Lennon', 10], ['George Harrison', 6]],
columns=['beatle', 'songs']
)
longform = pd.DataFrame(columns=['word', 'num'])
for idx, name, songs in df.itertuples():
na...
以下列表推导创建了一个元组列表,其中前两个元素是索引,最后一个是相似性: edges = [(i,j,dice_coefficient(x,y))
for i,x in enumerate(sentences)
for j,y in enumerate(sentences) if i < j]
您现在可以删除某个阈值以下的边缘,并将剩余的边缘转换为带有networkx的图形: import networkx as nx
G = nx.Graph()
G.a...
在多个列表的情况下你可以这样做 尝试这个:- import itertools
final_list = [list1,list2,list3,....]
print(list(itertools.product(*final_list))) #you will get all possible matches
In Multiple list cases You can do like this Try this:- import itertools
final_list = [list1,l...
它不是直接拆分的正则表达式,而是一种解决方法: (?!Mrs?\.|Jr\.|Dr\.|Sr\.|Prof\.)(\b\S+[.?!]["']?)\s
DEMO 您可以将匹配的片段替换为例如: $1# (或其他未在文本中出现的字符,而不是# ),然后使用#DEMO将其拆分。 然而,它不是太优雅的解决方案。 It is not regex for direct split, but kind of workaround: (?!Mrs?\.|Jr\.|Dr\.|Sr\.|Prof\.)(\b\S+...
这很有可能更好地使用nltk处理( 安装正确 ,那是): from nltk.tokenize import sent_tokenize
string = "This is a sentence. This is another. And here one another, same line, starting with space. this sentence starts with lowercase letter. Here is a site you may know: google....
如果一行不包含句点,则split将返回一个元素:行本身: >>> "asdasd".split('.')
['asdasd']
所以你要计算行数加周期数。 为什么要将文件拆分为行? with open('words.txt', 'r') as file:
file_contents = file.read()
print('Total words: ', len(file_contents.split()))
print('total stops: '...
假设最后一个“。”(点)之后的数字是标签 对于存储在文件'yourdata.txt'中的给定示例,以下代码应该生成2个列表sentence_list和label_list 。 您可以根据您的要求单独将这些列表中的数据写入文件。 fmov=open('yourdata.txt','r')
sentence_list=[]
label_list=[]
for f in fmov.readlines():
lineinfo=f.split('.')
sentenceline=".".jo...
([!?.])(?=\s*[A-Z])\s*
你可以使用这个正则表达式在你的正则表达式之前创建句子。参见demo。放置\1\n 。 https://regex101.com/r/sH8aR8/5 x="I love programming with Python-3.3! Do you? It's great... I give it a 10/10. It's free-to-use, no $$$ involved!"
print re.sub(r"([!?.])(?=\s*[A-Z])",...
相关文章
The most splendid achievement of all is the constan
...
Python 编程语言具有很高的灵活性,它支持多种编程方法,包括过程化的、面向对象的和函数式的。但最重
...
python2和python3的区别,1.性能 Py3.0运行 pystone benchmark的速
...
Python的文件类型 Python有三种文件类型,分别是源代码文件、字节码文件和优化代码文件
源代
...
python的官网:http://www.python.org/ 有两个版本,就像struts1和st
...
好久没有写了,还不是近期刚过的期末考试和期中考试 最近因为一个微信公众平台大赛在学phthon 找了本
...