Ted Talks
环境:Python2.7 Anaconda Jupyter Notebook
数据集: https://www.kaggle.com/rounakbanik/ted-talks
导入相应的库
%matplotlib inline
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns #matplotlib的默认作图风格就会被覆盖成seaborn的格式
import json
from pandas.io.json import json_normalize
from wordcloud import WordCloud, STOPWORDS #词云
df = pd.read_csv('ted_main.csv')
df.colums #数据集的首行表头
Index([u'comments', u'description', u'duration', u'event', u'film_date',
u'languages', u'main_speaker', u'name', u'num_speaker',
u'published_date', u'ratings', u'related_talks', u'speaker_occupation',
u'tags', u'title', u'url', u'views'],
dtype='object')
#调整表头顺序
df = df[['name', 'title', 'description', 'main_speaker', 'speaker_occupation', 'num_speaker', 'duration', 'event', 'film_date', 'published_date', 'comments', 'tags', 'languages', 'ratings', 'related_talks', 'url', 'views']]
Features Available
name: The official name of the TED Talk. Includes the title and the speaker.
title: The title of the talk
description: A blurb of what the talk is about.
main_speaker: The first named speaker of the talk.
speaker_occupation: The occupation of the main speaker.
num_speaker: The number of speakers in the talk.
duration: The duration of the talk in seconds.
event: The TED/TEDx event where the talk took place.
film_date: The Unix timestamp of the filming.
published_date: The Unix timestamp for the publication of the talk on TED.com
comments: The number of first level