# 无需训练RNN或生成模型，我写了一个AI来讲故事

1.蓝图：概述整个项目及其构成部分。

2.程序演示：在完成编写代码的工作后，作为预览演示系统的功能。

3.数据加载和清理：加载数据并准备好进行处理。

4.寻找最具有代表性的情节：该项目的第一部分，使用K-Means选择用户最感兴趣的情节。

5.总结图：使用基于图表的总结来获取每个情节的摘要，这是UI的组成部分。

6.推荐引擎：使用简单的预测式机器学习模型推荐新故事。

7.综合所有组件：编写能够将所有组件结合在一起的生态系统结构。

1.这个程序会输出五个特性鲜明的故事的概要（这些故事的评论可以更好地区分用户的口味。例如，像《教父》这样的故事，几乎无法分辨每个人的口味，因为每个人都喜欢这部电影。）

2.用户的评分，他们是喜欢、不喜欢还是保持中立。

3.这个程序接收用户对这五个故事的喜好程度，并输出完整故事的摘要。如果用户感兴趣，则程序会输出完整的故事。每个完整的故事结束后，程序都会要求用户提供反馈。该程序将从实时反馈中学习，并尝试提出更好的推荐（强化学习系统）。

import pandas as pddata = pd.read_csv('/kaggle/input/wikipedia-movie-plots/wiki_movie_plots_deduped.csv')data.head()


data.drop(['Director','Cast'],axis=1,inplace=True)


"Grace Roberts (played by Lea Leland), marries rancher Edward Smith, who is revealed to be a neglectful, vice-ridden spouse. They have a daughter, Vivian. Dr. Franklin (Leonid Samoloff) whisks Grace away from this unhappy life, and they move to New York under aliases, pretending to be married (since surely Smith would not agree to a divorce). Grace and Franklin have a son, Walter (Milton S. Gould). Vivian gets sick, however, and Grace and Franklin return to save her. Somehow this reunion, as Smith had assumed Grace to be dead, causes the death of Franklin. This plot device frees Grace to return to her father's farm with both children.[1]"


blacklist = []for i in range(100):    blacklist.append('['+str(i)+']')


def remove_brackets(string):    for item in blacklist:        string = string.replace(item,'')    return string


data['Plot'] = data['Plot'].apply(remove_brackets)


PageRank算法计算图中的节点“中心”，这对于衡量句子中相关信息的内容很有用。该图的构造使用了词袋特征序列和基于余弦相似度的边缘权重。

import gensimstring = '''The PageRank algorithm outputs a probability distribution used to represent the likelihood that a person randomly clicking on links will arrive at any particular page. PageRank can be calculated for collections of documents of any size. It is assumed in several research papers that the distribution is evenly divided among all documents in the collection at the beginning of the computational process. The PageRank computations require several passes, called “iterations”, through the collection to adjust approximate PageRank values to more closely reflect the theoretical true value.Assume a small universe of four web pages: A, B, C and D. Links from a page to itself, or multiple outbound links from one single page to another single page, are ignored. PageRank is initialized to the same value for all pages. In the original form of PageRank, the sum of PageRank over all pages was the total number of pages on the web at that time, so each page in this example would have an initial value of 1. However, later versions of PageRank, and the remainder of this p, assume a probability distribution between 0 and 1. Hence the initial value for each page in this example is 0.25.The PageRank transferred from a given page to the targets of its outbound links upon the next iteration is divided equally among all outbound links.If the only links in the system were from pages B, C, and D to A, each link would transfer 0.25 PageRank to A upon the next iteration, for a total of 0.75.Suppose instead that page B had a link to pages C and A, page C had a link to page A, and page D had links to all three pages. Thus, upon the first iteration, page B would transfer half of its existing value, or 0.125, to page A and the other half, or 0.125, to page C. Page C would transfer all of its existing value, 0.25, to the only page it links to, A. Since D had three outbound links, it would transfer one third of its existing value, or approximately 0.083, to A. At the completion of this iteration, page A will have a PageRank of approximately 0.458.In other words, the PageRank conferred by an outbound link is equal to the document’s own PageRank score divided by the number of outbound links L( ). In the general case, the PageRank value for any page u can be expressed as: i.e. the PageRank value for a page u is dependent on the PageRank values for each page v contained in the set Bu (the set containing all pages linking to page u), divided by the number L(v) of links from page v. The algorithm involves a damping factor for the calculation of the pagerank. It is like the income tax which the govt extracts from one despite paying him itself.'''print(gensim.summarization.summarize(string))


In the original form of PageRank, the sum of PageRank over all pages was the total number of pages on the web at that time, so each page in this example would have an initial value of 1. The PageRank transferred from a given page to the targets of its outbound links upon the next iteration is divided equally among all outbound links. If the only links in the system were from pages B, C, and D to A, each link would transfer 0.25 PageRank to A upon the next iteration, for a total of 0.75. Since D had three outbound links, it would transfer one third of its existing value, or approximately 0.083, to A.


• 如果文本长度小于500个字符，则直接返回原始文本。总结会让文本的内容过于简短。

• 如果文本只有一个句子，则genism 无法处理，因为它只能选择文本中的重要句子。我们将使用TextBlob对象，该对象具有.sentences属性，可将文本分成多个句子。如果文本的第一个句子就等于文本本身，则可以判断该文本只有一个句子。

import gensimfrom textblob import TextBlobdef summary(x):    if len(x) < 500 or str(TextBlob(x).sentences[0]) == x:        return x    else:        return gensim.summarization.summarize(x)data['Summary'] = data['Plot'].apply(summary)


"The earliest known adaptation of the classic fairytale, this films shows Jack trading his cow for the beans, his mother forcing him to drop them in the front yard, and beig forced upstairs. As he sleeps, Jack is visited by a fairy who shows him glimpses of what will await him when he ascends the bean stalk. In this version, Jack is the son of a deposed king. When Jack wakes up, he finds the beanstalk has grown and he climbs to the top where he enters the giant's home. The giant finds Jack, who narrowly escapes. The giant chases Jack down the bean stalk, but Jack is able to cut it down before the giant can get to safety. He falls and is killed as Jack celebrates. The fairy then reveals that Jack may return home as a prince."


'As he sleeps, Jack is visited by a fairy who shows him glimpses of what will await him when he ascends the bean stalk.'


import stringimport redef clean(text):    return re.sub('[%s]' % string.punctuation,'',text).lower()


data['Cleaned'] = data['Plot'].apply(clean)


from sklearn.feature_extraction.text import TfidfVectorizervectorizer = TfidfVectorizer(stop_words='english',max_features=500)X = vectorizer.fit_transform(data['Plot'])


from sklearn.preprocessing import StandardScalerscaler = StandardScaler()X = scaler.fit_transform(X)


n_clusters = []scores = []


from sklearn.cluster import KMeansfrom sklearn.metrics import silhouette_score


for n in [3,4,5,6]:    kmeans = KMeans(n_clusters=n)    kmeans.fit(X)    scores.append(silhouette_score(X,kmeans.predict(X)))    n_clusters.append(n)


data['Age'] = data['Release Year'].apply(lambda x:2017-x)


Age从0开始是有实际意义的。

Origin/Ethnicity列很重要，故事的风格可以追溯到故事的来源。但是，该列有分类，例如可以是[‘American’，‘Telegu’，‘Chinese’]。如果想转换为机器可读的内容，我们需要对其进行One-Hot编码，我们通过 sklearn 的 OneHotEncoder 来实现。

from sklearn.preprocessing import OneHotEncoderenc = OneHotEncoder(handle_unknown=’ignore’)nation = enc.fit_transform(np.array(data[‘Origin/Ethnicity’]) .reshape(-1, 1)).toarray()


for i in range(len(nation[0])):    data[enc.categories_[0][i]] = nation[:,i]


data[‘Genre’].value_counts()


top_genres = pd.DataFrame(data['Genre'].value_counts()).reset_index().head(21)['index'].tolist()top_genres.remove('unknown')


def process(genre):    if genre in top_genres:        return genre    else:        return 'unknown'data['Genre'] = data['Genre'].apply(process)


enc1 = OneHotEncoder(handle_unknown='ignore')genres = enc1.fit_transform(np.array(data['Genre']).reshape(-1, 1)).toarray()


for i in range(len(genres[0])):    data[enc1.categories_[0][i]] = genres[:,i]


for i in data[data['unknown']==1].index:    for column in ['action',       'adventure', 'animation', 'comedy', 'comedy, drama', 'crime',       'crime drama', 'drama', 'film noir', 'horror', 'musical', 'mystery', 'romance', 'romantic comedy', 'sci-fi', 'science fiction', 'thriller', 'unknown', 'war', 'western']:        data.loc[i,column] = np.nan


import redata['Cleaned'] = data['Plot'].apply(lambda x:re.sub('[^A-Za-z0-9]+',' ',str(x)).lower())


from sklearn.feature_extraction.text import TfidfVectorizervectorizer = TfidfVectorizer(stop_words=’english’,max_features=30)X = vectorizer.fit_transform(data[‘Cleaned’]).toarray()


keys = list(vectorizer.vocabulary_.keys())for i in range(len(keys)):    data[keys[i]] = X[:,i]


from sklearn.impute import KNNImputerimputer = KNNImputer(n_neighbors=5)column_list = ['Age', 'American', 'Assamese','Australian', 'Bangladeshi', 'Bengali', 'Bollywood', 'British','Canadian', 'Chinese', 'Egyptian', 'Filipino', 'Hong Kong', 'Japanese','Kannada', 'Malayalam', 'Malaysian', 'Maldivian', 'Marathi', 'Punjabi','Russian', 'South_Korean', 'Tamil', 'Telugu', 'Turkish','man', 'night', 'gets', 'film', 'house', 'takes', 'mother', 'son','finds', 'home', 'killed', 'tries', 'later', 'daughter', 'family','life', 'wife', 'new', 'away', 'time', 'police', 'father', 'friend','day', 'help', 'goes', 'love', 'tells', 'death', 'money', 'action', 'adventure', 'animation', 'comedy', 'comedy, drama', 'crime','crime drama', 'drama', 'film noir', 'horror', 'musical', 'mystery','romance', 'romantic comedy', 'sci-fi', 'science fiction', 'thriller','war', 'western']imputed = imputer.fit_transform(data[column_list])


for i in range(len(column_list)):    data[column_list[i]] = imputed[:,i]


data.drop(['Title','Release Year','Director','Cast','Wiki Page','Origin/Ethnicity','Unknown','Genre'],axis=1,inplace=True)


……数据已准备就绪，没有缺失值。KNN分类的另一个有趣的方面是，它可以给出十进制的值，也就是说，一部电影20%是西方，其余部分是另一种或几种类型。

from sklearn.cluster import KMeansfrom sklearn.metrics import silhouette_scoreXcluster = data.drop(['Plot','Summary','Cleaned'],axis=1)score = []for i in [3,4,5,6]:    kmeans = KMeans(n_clusters=i)    prediction = kmeans.fit_predict(Xcluster)    score = silhouette_score(Xcluster,prediction)    score.append(score)


from sklearn.cluster import KMeansXcluster = data.drop(['Plot','Summary','Cleaned'],axis=1)kmeans = KMeans(n_clusters=3)kmeans.fit(Xcluster)pd.Series(kmeans.predict(Xcluster)).value_counts()


centers = kmeans.cluster_centers_centers


Xcluster['Label'] = kmeans.labels_


for cluster in [0,1,2]:    subset = Xcluster[Xcluster['Label']==cluster]    subset.drop(['Label'],axis=1,inplace=True)    indexes = subset.index    subset = subset.reset_index().drop('index',axis=1)    center = centers[cluster]    scores = {'Index':[],'Distance':[]}


for index in range(len(subset)):   scores['Index'].append(indexes[index])   scores['Distance'].append(np.linalg.norm(center-np.array( subset.loc[index])))


scores = pd.DataFrame(scores)    print('Cluster',cluster,':',scores[scores['Distance']==scores['Distance'].min()]['Index'].tolist())


data.loc[4114]['Summary']


'On a neutral island in the Pacific called Shadow Island (above the island of Formosa), run by American gangster Lucky Kamber, both sides in World War II attempt to control the secret of element 722, which can be used to create synthetic aviation fuel.'


data.loc[15176]['Summary']


'Jake Rodgers (Cedric the Entertainer) wakes up near a dead body. Freaked out, he is picked up by Diane.'


data.loc[9761]['Summary']


'Jewel thief Jack Rhodes, a.k.a. "Jack of Diamonds", is masterminding a heist of $30 million worth of uncut gems. He also has his eye on lovely Gillian Bromley, who becomes a part of the gang he is forming to pull off the daring robbery. However, Chief Inspector Cyril Willis from Scotland Yard is blackmailing Gillian, threatening her with prosecution on another theft if she doesn\'t cooperate in helping him bag the elusive Rhodes, the last jewel in his crown before the Chief Inspector formally retires from duty.'  很好！现在我们获得了三个最有代表性的故事情节。虽然人类看不出其中的区别，但在机器学习模型的心中，这些数据为它提供了大量信息，可供随时使用。 推荐引擎 这里的推荐引擎只是一个机器学习模型，可以预测哪些电影情节更有可能获得用户的高度评价。该引擎接收电影的特征，例如年龄或国家，以及TF-IDF向量化的摘要，最大可接收100个特征。 每个电影情节的目标是1或0。模型经过在数据（用户已评价的故事）上的训练后，可预测用户对故事评价良好的概率。接下来，模型会向用户推荐最有可能受到喜爱的故事，并记录用户对该故事的评分，最后还会将该故事添加到训练数据列表中。 至于训练数据，我们仅使用每部电影中数据的属性。 我们可能需要决策树分类器，因为它可以做出有效的预测，快速训练并开发高方差解决方案，这正是推荐系统所追求的。 综合所有组件 首先，我们针对三个最有代表性的电影，编写用户的评分。这个程序会确保针对每个输入，输出为0或1。 import timestarting = []print("Indicate if like (1) or dislike (0) the following three story snapshots.")print("\n> > > 1 < < <")print('On a neutral island in the Pacific called Shadow Island (above the island of Formosa), run by American gangster Lucky Kamber, both sides in World War II attempt to control the secret of element 722, which can be used to create synthetic aviation fuel.')time.sleep(0.5) #Kaggle sometimes has a glitch with inputswhile True: response = input(':: ') try: if int(response) == 0 or int(response) == 1: starting.append(int(response)) break else: print('Invalid input. Try again') except: print('Invalid input. Try again')print('\n> > > 2 < < <')print('Jake Rodgers (Cedric the Entertainer) wakes up near a dead body. Freaked out, he is picked up by Diane.')time.sleep(0.5) #Kaggle sometimes has a glitch with inputswhile True: response = input(':: ') try: if int(response) == 0 or int(response) == 1: starting.append(int(response)) break else: print('Invalid input. Try again') except: print('Invalid input. Try again')print('\n> > > 3 < < <')print("Jewel thief Jack Rhodes, a.k.a. 'Jack of Diamonds', is masterminding a heist of$30 million worth of uncut gems. He also has his eye on lovely Gillian Bromley, who becomes a part of the gang he is forming to pull off the daring robbery. However, Chief Inspector Cyril Willis from Scotland Yard is blackmailing Gillian, threatening her with prosecution on another theft if she doesn't cooperate in helping him bag the elusive Rhodes, the last jewel in his crown before the Chief Inspector formally retires from duty.")time.sleep(0.5) #Kaggle sometimes has a glitch with inputswhile True:    response = input(':: ')    try:        if int(response) == 0 or int(response) == 1:            starting.append(int(response))            break        else:            print('Invalid input. Try again')    except:        print('Invalid input. Try again')


X = data.loc[[9761,15176,4114]].drop( ['Plot','Summary','Cleaned'],axis=1)y = startingdata.drop([[9761,15176,4114]],inplace=True)


from sklearn.tree import DecisionTreeClassifiersubset = data.drop(['Plot','Summary','Cleaned'],axis=1)while True:    dec = DecisionTreeClassifier().fit(X,y)


dic = {'Index':[],'Probability':[]}subdf = shuffle(subset).head(10_000) #select about 1/3 of datafor index in tqdm(subdf.index.values):     dic['Index'].append(index)     dic['Probability'].append(dec.predict_proba(  np.array(subdf.loc[index]).reshape(1, -1))[0][1])     dic = pd.DataFrame(dic)


index = dic[dic['Probability']==dic['Probability'].max()] .loc[0,'Index']


print('> > > Would you be interested in this snippet from a story? (1/0/-1 to quit) < < <')print(data.loc[index]['Summary'])time.sleep(0.5)


while True:        response = input(':: ')        try:            if int(response) == 0 or int(response) == 1:                response = int(response)                break            else:                print('Invalid input. Try again')        except:            print('Invalid input. Try again')


……我们可以开始添加训练数据。但是，首先，我们必须允许用户在需要退出的时候结束循环。

if response == -1:        break


X = pd.concat([X,pd.DataFrame(data.loc[index].drop(['Plot','Summary','Cleaned'])).T])


if response == 0:        y.append(0)


else:        print('\n> > > Printing full story. < < <')        print(data.loc[index]['Plot'])        time.sleep(2)        print("\n> > > Did you enjoy this story? (1/0) < < <")


while True:      response = input(':: ')      try:          if int(response) == 0 or int(response) == 1:              response = int(response)              break          else:              print('Invalid input. Try again')      except:          print('Invalid input. Try again')


……并相应地将0或1添加到y。

if response == 1:      y.append(1)else:      y.append(0)


data.drop(index,inplace=True)


https://www.kaggle.com/washingtongold/tell-me-a-story-1-2?scriptVersionId=31773396

https://towardsdatascience.com/tell-me-a-story-ai-one-that-i-like-4c0bc60f46ae

12-04