1 usa.gov data from bit.ly
1. read txt: open(path).readline()
2. converting json: json.loads(line)
3. list comprehension: records = [json.loads(line) for line in open(path)]
1.2 Counting Time Zones in Pure Python
A Python standard library: collections
Creating dict:
collections.defaultdict
Getting counts:
collections.Counter
1.3 Counting Time Zones with pandas
Turn a list of dicts into a pandas data frame: DataFrame(records)
methods:
DataFrame.<col>:
get column
DataFrame['col']:
get column
DataFrame.sum():
sum columns
DataFrame.div(DataFrame.sum(1), axis=0):
normalize to sum to 1
DataFrame.groupby([]):
get tables
DataFrameGroupby.size().unstack():
build table
Series.value_count():
count values
Series.fillna('...'):
fill all NA with given argument
Series.plot(kind='barh', (stacked=True)):
draw bar plot
Series.argsort():
get sorted index
2. MovieLens 1M Data Set
DataFrame methods:
read .dat files:
pandas.read table('file', sep='', header=None, names=)
merge(join) two and more datasets:
pd.merge(pd.merge(ratings, users), movies)
get statistics grouped by variables:
DataFrame.pivot_table('var', index=[], columns='', aggfunc='mean')
counts grouped by:
DataFrame.groupby('title').size()
select rows from Series:
DataFrame.ix[<Series>]
sort:
DataFrame.sort_index(by='F', ascending=False)
Series methods:
filter values:
Series.index[a >= 250]
2.1 Measuring rating disagreement
DataFrame methods:
Add a column in data frame:
mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F']
reverse the order:
DataFrame.[::-1]
Series methods:
get standard deviation:
Series.std()
reorder:
Series.order(ascending=False)