用spark建立一个单词统计的应用

原创 2016年08月29日 18:13:55

声明:版权所有,转载请联系作者并注明出处  http://blog.csdn.net/u013719780?viewmode=contents


博主简介:风雪夜归子(Allen),机器学习算法攻城狮,喜爱钻研Meachine Learning的黑科技,对Deep Learning和Artificial Intelligence充满兴趣,经常关注Kaggle数据挖掘竞赛平台,对数据、Machine Learning和Artificial Intelligence有兴趣的童鞋可以一起探讨哦,个人CSDN博客:http://blog.csdn.net/u013719780?viewmode=contents




本文我们将建立一个简单的单词统计应用

创建rdd

In [1]:
wordsList = ['cat', 'elephant', 'rat', 'rat', 'cat']
wordsRDD = sc.parallelize(wordsList, 4)
# Print out the type of wordsRDD
print type(wordsRDD)
<class 'pyspark.rdd.RDD'>

将单词变成复数形式并且进行测试

In [2]:
# One way of completing the function
def makePlural(word):
    return word + 's'

print makePlural('cat')
cats

测试结果是否正确,如果不正确就返回'incorrect result: makePlural does not add an s'

In [3]:
# Make sure to rerun any cell you change before trying the test again
from test_helper import Test
Test.assertEquals(makePlural('rat'), 'rats', 'incorrect result: makePlural does not add an s')
1 test passed.

应用makePlural()函数到rdd上

In [4]:
# TODO: Replace <FILL IN> with appropriate code
pluralRDD = wordsRDD.map(makePlural)
print pluralRDD.collect()
['cats', 'elephants', 'rats', 'rats', 'cats']
In [5]:
# TEST Apply makePlural to the base RDD(1c)
Test.assertEquals(pluralRDD.collect(), ['cats', 'elephants', 'rats', 'rats', 'cats'],
                  'incorrect values for pluralRDD')
1 test passed.

使用lambda函数将单词变成复数形式的功能

In [6]:
# TODO: Replace <FILL IN> with appropriate code
pluralLambdaRDD = wordsRDD.map(lambda x: x + 's')
print pluralLambdaRDD.collect()
['cats', 'elephants', 'rats', 'rats', 'cats']
In [7]:
# TEST Pass a lambda function to map (1d)
Test.assertEquals(pluralLambdaRDD.collect(), ['cats', 'elephants', 'rats', 'rats', 'cats'],
                  'incorrect values for pluralLambdaRDD (1d)')
1 test passed.

计算每个单词的长度

In [8]:
# TODO: Replace <FILL IN> with appropriate code
pluralLengths = (pluralRDD
                 .map(lambda x: len(x))
                 .collect())
print pluralLengths
[4, 9, 4, 4, 4]
In [9]:
# TEST Length of each word (1e)
Test.assertEquals(pluralLengths, [4, 9, 4, 4, 4],
                  'incorrect values for pluralLengths')
1 test passed.

接下来创建rdd对

In [10]:
# TODO: Replace <FILL IN> with appropriate code
wordPairs = wordsRDD.map(lambda x: (x, 1))
print wordPairs.collect()
[('cat', 1), ('elephant', 1), ('rat', 1), ('rat', 1), ('cat', 1)]
In [11]:
# TEST Pair RDDs (1f)
Test.assertEquals(wordPairs.collect(),
                  [('cat', 1), ('elephant', 1), ('rat', 1), ('rat', 1), ('cat', 1)],
                  'incorrect value for wordPairs')
1 test passed.
In [ ]:
 

下面我们将对每个单词统计其出现的次数,实现这个目标有很多方法。

In [12]:
# TODO: Replace <FILL IN> with appropriate code
# Note that groupByKey requires no parameters
wordsGrouped = wordPairs.groupByKey()
for key, value in wordsGrouped.collect():
    print '{0}: {1}'.format(key, list(value))
rat: [1, 1]
elephant: [1]
cat: [1, 1]
In [13]:
# TEST groupByKey() approach (2a)
Test.assertEquals(sorted(wordsGrouped.mapValues(lambda x: list(x)).collect()),
                  [('cat', [1, 1]), ('elephant', [1]), ('rat', [1, 1])],
                  'incorrect value for wordsGrouped')
1 test passed.
In [14]:
# TODO: Replace <FILL IN> with appropriate code
wordCountsGrouped = wordsGrouped.map(lambda (k, v): (k, sum(v)))
print wordCountsGrouped.collect()
[('rat', 2), ('elephant', 1), ('cat', 2)]
In [15]:
# TEST Use groupByKey() to obtain the counts (2b)
Test.assertEquals(sorted(wordCountsGrouped.collect()),
                  [('cat', 2), ('elephant', 1), ('rat', 2)],
                  'incorrect value for wordCountsGrouped')
1 test passed.

用reduceByKey()实现统计每个单词出现的次数的任务

In [16]:
# TODO: Replace <FILL IN> with appropriate code
# Note that reduceByKey takes in a function that accepts two values and returns a single value
wordCounts = wordPairs.reduceByKey(lambda x, y: x + y)
print wordCounts.collect()
[('rat', 2), ('elephant', 1), ('cat', 2)]
In [17]:
# TEST Counting using reduceByKey (2c)
Test.assertEquals(sorted(wordCounts.collect()), [('cat', 2), ('elephant', 1), ('rat', 2)],
                  'incorrect value for wordCounts')
1 test passed.
In [18]:
# 将两个步骤连起来
wordCountsCollected = (wordsRDD
                       .map(lambda x: (x, 1))
                       .reduceByKey(lambda x, y: x + y)
                       .collect())
print wordCountsCollected
[('rat', 2), ('elephant', 1), ('cat', 2)]
In [19]:
# TEST All together
Test.assertEquals(sorted(wordCountsCollected), [('cat', 2), ('elephant', 1), ('rat', 2)],
                  'incorrect value for wordCountsCollected')
1 test passed.

统计不同单词的个数

In [20]:
uniqueWords = wordCounts.count()
print uniqueWords
3
In [21]:
# TEST Unique words 
Test.assertEquals(uniqueWords, 3, 'incorrect count of uniqueWords')
1 test passed.

计算每个单词平均出现的次数

In [22]:
from operator import add
totalCount = (wordCounts
              .map(lambda (k, v): v)
              .reduce(lambda x, y: x + y))
average = totalCount / float(wordCounts.count())
print totalCount
print round(average, 2)
5
1.67
In [23]:
# TEST Mean using reduce 
Test.assertEquals(round(average, 2), 1.67, 'incorrect value of average')
1 test passed.

接下来看一个完整的在单词统计在文本上的应用

首先,定义一个单词统计的函数

In [24]:
# TODO: Replace <FILL IN> with appropriate code
def wordCount(wordListRDD):
    """Creates a pair RDD with word counts from an RDD of words.

    Args:
        wordListRDD (RDD of str): An RDD consisting of words.

    Returns:
        RDD of (str, int): An RDD consisting of (word, count) tuples.
    """
    wordCountsCollected = (wordListRDD
                       .map(lambda x: (x, 1))
                       .reduceByKey(lambda x, y: x + y))
    return wordCountsCollected
print wordCount(wordsRDD).collect()
[('rat', 2), ('elephant', 1), ('cat', 2)]
In [25]:
# TEST wordCount function (4a)
Test.assertEquals(sorted(wordCount(wordsRDD).collect()),
                  [('cat', 2), ('elephant', 1), ('rat', 2)],
                  'incorrect definition for wordCount function')
1 test passed.

接下来将文本中的标点符号去掉并将字母都转换为小写

In [26]:
# TODO: Replace <FILL IN> with appropriate code
import re
import string
def removePunctuation(text):
    """Removes punctuation, changes to lower case, and strips leading and trailing spaces.

    Note:
        Only spaces, letters, and numbers should be retained.  Other characters should should be
        eliminated (e.g. it's becomes its).  Leading and trailing spaces should be removed after
        punctuation is removed.

    Args:
        text (str): A string.

    Returns:
        str: The cleaned up string.
    """
    regex = re.compile('[%s]' % re.escape(string.punctuation))
    return regex.sub('', text).lower().strip()
print removePunctuation('Hi, you!')
print removePunctuation(' No under_score!')
hi you
no underscore
In [27]:
# TEST Capitalization and punctuation 
Test.assertEquals(removePunctuation(" The Elephant's 4 cats. "),
                  'the elephants 4 cats',
                  'incorrect definition for removePunctuation function')
1 test passed.
In [28]:
import os.path
fileName = os.path.join('/Users/youwei.tan/Downloads', 'shakespeare.txt')

shakespeareRDD = (sc
                  .textFile(fileName, 8)
                  .map(removePunctuation))
print '\n'.join(shakespeareRDD
                .zipWithIndex()  # to (line, lineNum)
                .map(lambda (l, num): '{0}: {1}'.format(num, l))  # to 'lineNum: line'
                .take(15))
0: the project gutenberg ebook of the complete works of william shakespeare by
1: william shakespeare
2: 
3: this ebook is for the use of anyone anywhere at no cost and with
4: almost no restrictions whatsoever  you may copy it give it away or
5: reuse it under the terms of the project gutenberg license included
6: with this ebook or online at wwwgutenbergorg
7: 
8: this is a copyrighted project gutenberg ebook details below
9: please follow the copyright guidelines in this file
10: 
11: title the complete works of william shakespeare
12: 
13: author william shakespeare
14: 

在使用wordcount()函数之前,需要完成两个任务: 1、以空格对字符串进行分割 2、过滤掉空行

In [29]:
shakespeareWordsRDD = shakespeareRDD.flatMap(lambda x: x.split())
shakespeareWordCount = shakespeareWordsRDD.count()
print shakespeareWordsRDD.top(5)
print shakespeareWordCount
[u'zwaggerd', u'zounds', u'zounds', u'zounds', u'zounds']
903705
In [30]:
# TEST Words from lines 
# This test allows for leading spaces to be removed either before or after
# punctuation is removed.
Test.assertTrue(shakespeareWordCount == 903705 or shakespeareWordCount == 928908,
                'incorrect value for shakespeareWordCount')
Test.assertEquals(shakespeareWordsRDD.top(5),
                  [u'zwaggerd', u'zounds', u'zounds', u'zounds', u'zounds'],
                  'incorrect value for shakespeareWordsRDD')
1 test passed.
1 test passed.
In [32]:
shakeWordsRDD = shakespeareWordsRDD  # already removed
shakeWordCount = shakeWordsRDD.count()
print shakeWordCount
903705
In [33]:
# TEST Remove empty elements 
Test.assertEquals(shakeWordCount, 903705, 'incorrect value for shakeWordCount')
1 test passed.

实现空格分割和过滤掉恐狼的另一种方法

In [35]:
shakespeareRDD.map(lambda x: x.split()).filter(lambda x: len(x)>0).flatMap(lambda x:x).take(10)
Out[35]:
[u'the',
 u'project',
 u'gutenberg',
 u'ebook',
 u'of',
 u'the',
 u'complete',
 u'works',
 u'of',
 u'william']

接下来统计单词

In [36]:
# TODO: Replace <FILL IN> with appropriate code
top15WordsAndCounts = wordCount(shakeWordsRDD).takeOrdered(15, key=lambda (k, v): -v)
print '\n'.join(map(lambda (w, c): '{0}: {1}'.format(w, c), top15WordsAndCounts))
the: 27825
and: 26791
i: 20681
to: 19261
of: 18289
a: 14667
you: 13716
my: 12481
that: 11135
in: 11027
is: 9621
not: 8745
for: 8261
with: 8046
me: 7769

版权声明:本文为博主原创文章,未经博主允许不得转载。 举报

相关文章推荐

执业医师培训

专业培训课程: 执业医师考前培训 执业医师资格考试培训 助理执业医师培训 临床执业医师培训班 执业助理医师考前培训 执业医师考试培训 医师培训 医师考试培训 执业医师...

spark机器学习笔记:(四)用Spark Python构建分类模型(上)

本文,我将简单介绍分类模型的基础知识以及如何在各种应用中使用这些模型。分类通常是指将事物分成不同的类别。在分类模型中,我们期望根据一组特征来判断类别,这些特征代表了物体、事件或上下文相关的属性(变量)...

精选:深入理解 Docker 内部原理及网络配置

网络绝对是任何系统的核心,对于容器而言也是如此。Docker 作为目前最火的轻量级容器技术,有很多令人称道的功能,如 Docker 的镜像管理。然而,Docker的网络一直以来都比较薄弱,所以我们有必要深入了解Docker的网络知识,以满足更高的网络需求。

Spark编程指南(python版)

主要是翻译官网的编程指南,自己调整了一下内容安排,同时为了偷懒大量参考了淘宝的翻译版嘿嘿。但他们的编程指南主要是写java、scala语言用的,要求掌握sbt(scala),maven(java),我...

数据可视化漫谈(四)

声明:版权所有,转载请联系作者并注明出处  http://blog.csdn.net/u013719780?viewmode=contents 博主简介:风雪夜归子(Allen)...

spark机器学习笔记:(五)用Spark Python构建分类模型(下)

声明:版权所有,转载请联系作者并注明出处  http://blog.csdn.net/u013719780?viewmode=contents 博主简介:风雪夜归子(英文名:Allen),机器学...

使用spark建立逻辑回归(Logistic)模型帮Helen找男朋友

假设海伦一直使用在线约会网站寻找适合自己的约会对象。尽管约会网站会推荐不同的人选,但她没有从中找到喜欢的人。经过一番总结,她发现曾交往过三种类型的人: □ 不喜欢的人 □ 魅力一般的人 □ 极具...

数据可视化漫谈(三)

声明:版权所有,转载请联系作者并注明出处  http://blog.csdn.net/u013719780?viewmode=contents 博主简介:风雪夜归子(Allen)...

二十种特征变换方法及Spark MLlib调用实例(Scala/Java/python)(二)

VectorIndexer 算法介绍:         VectorIndexer解决数据集中的类别特征Vector。它可以自动识别哪些特征是类别型的,并且将原始值转换为类别指标。它的处理流程如下: ...

spark机器学习笔记:(七)用Spark Python构建聚类模型

声明:版权所有,转载请联系作者并注明出处  http://blog.csdn.net/u013719780?viewmode=contents 博主简介:风雪夜归子(英文名:Allen),机器学...

数据可视化漫谈(一)

声明:版权所有,转载请联系作者并注明出处  http://blog.csdn.net/u013719780?viewmode=contents 博主简介:风雪夜归子(Allen),机器学习算...
返回顶部
收藏助手
不良信息举报
您举报文章:深度学习:神经网络中的前向传播和反向传播算法推导
举报原因:
原因补充:

(最多只允许输入30个字)