1. count()与countByValue()
一种总体统计,一种分组统计。
总体统计:
num_occupations = user_fields.map(lambda fields: fields[3]).<strong>count</strong>()
print "num_occupations ",num_occupations
输出结果
num_occupations 943
count_by_occupation2=user_fields.map(lambda fields: fields[3]).countByValue()
print "Map-reduce approach:"
print dict(count_by_occupation2)
输出结果
{u'administrator': 79, u'retired': 14, u'lawyer': 12, u'healthcare': 16, u'marketing': 26, u'executive': 32, u'scientist': 31, u'student': 196, u'technician': 27, u'librarian': 51, u'programmer': 66, u'salesman': 12, u'homemaker': 7, u'engineer': 67, u'none': 9, u'doctor': 7, u'writer': 45, u'entertainment': 18, u'other': 105, u'educator': 95, u'artist': 28}
一种是MR,一种是countByValue,输出结果相同。
count_by_occupation = user_fields.map(lambda fields: (fields[3], 1)).reduceByKey(lambda x, y: x + y).collect()
print "countByValue approach:"
print dict(count_by_occupation)
4. reduceByKey(lambda x, y: x + y)与reduce(lambda x, y: x + y)
reduce将RDD中元素两两传递给输入函数,同时产生一个新的值,新产生的值与RDD中下一个元素再被传递给输入函数直到最后只有一个值为止。
reduceByKey就是对元素为KV对的RDD中Key相同的元素的Value进行reduce,因此,Key相同的多个元素的值被reduce为一个值,然后与原RDD中的Key组成一个新的KV对。
max_rating = ratings.reduce(lambda x, y: max(x, y))
print "Max rating: %d" % max_rating
输出结果:
Max rating: 5
5. stats() 产生数据集的profile,主要有如下内容:
print ratings.stats()
输出结果
(<strong>count</strong>: 100000, <strong>mean</strong>: 3.52986, <strong>stdev</strong>: 1.12566797076, <strong>max</strong>: 5, <strong>min</strong>: 1)