cs120_lab1b作业答案-CSDN博客

Word Count Lab: Building a word count application

This lab will build on the techniques covered in the Spark tutorial to develop a simple word count application. The volume of unstructured text in existence is growing dramatically, and Spark is an excellent tool for analyzing this type of data. In this lab, we will write code that calculates the most common words in theComplete Works of William Shakespeare retrieved from Project Gutenberg.

This could also be scaled to find the most common words in Wikipedia.

During this lab we will cover:

Part 1: Creating a base RDD and pair RDDs
Part 2: Counting with pair RDDs
Part 3: Finding unique words and a mean value
Part 4: Apply word count to a file
Appendix A: Submitting your exercises to the Autograder

Note that for reference, you can look up the details of the relevant methods in:

Spark's Python API

>  
        
            labVersion = 'cs120x-lab1b-1.0.0'

 
      Command took 0.11s

>  
        
            wordsList = ['cat', 'elephant', 'rat', 'rat', 'cat'] 
            wordsRDD = sc.parallelize(wordsList, 4) 
            # Print out the type of wordsRDD 
            print type(wordsRDD)

        <class 'pyspark.rdd.RDD'> 
      

 
      Command took 0.08s

(1b) Pluralize and test

Let's use a map() transformation to add the letter 's' to each string in the base RDD we just created. We'll define a Python function that returns the word with an 's' at the end of the word. Please replace <FILL IN> with your solution. If you have trouble, the next cell has the solution. After you have defined makePlural you can run the third cell which contains a test. If you implementation is correct it will print 1 test passed.

This is the general form that exercises will take, except that no example solution will be provided. Exercises will include an explanation of what is expected, followed by code cells where one cell will have one or more <FILL IN> sections. The cell that needs to be modified will have# TODO: Replace <FILL IN> with appropriate code on its first line. Once the <FILL IN> sections are updated and the code is run, the test cell can then be run to verify the correctness of your solution. The last code cell before the next markdown section will contain the tests.

>  
        
      
 
             
           
 
            # TODO: Replace <FILL IN> with appropriate code 
            def makePlural(word): 
                """Adds an 's' to `word`. 
            ​ 
                Note: 
                    This is a simple function that only adds an 's'.  No attempt is made to follow proper 
                    pluralization rules. 
            ​ 
                Args: 
                    word (str): A string. 
            ​ 
                Returns: 
                    str: A string with 's' added to it. 
                """ 
                return word +'s' 
            ​ 
            print makePlural('cat') 
           

        cats 
      

 
      Command took 0.07s

>  
        
            # One way of completing the function 
            def makePlural(word): 
                return word + 's' 
            ​ 
            print makePlural('cat')

        cats 
      

 
      Command took 0.04s

>  
        
            # Load in the testing code and check to see if your answer is correct 
            # If incorrect it will report back '1 test failed' for each failed test 
            # Make sure to rerun any cell you change before trying the test again 
            from databricks_test_helper import Test 
            # TEST Pluralize and test (1b) 
            Test.assertEquals(makePlural('rat'), 'rats', 'incorrect result: makePlural does not add an s')

        1 test passed. 
      

 
      Command took 0.03s

>  
        
            # TODO: Replace <FILL IN> with appropriate code 
            pluralRDD = wordsRDD.map(makePlural ) 
            print pluralRDD.collect()

(1) Spark Jobs

        ['cats', 'elephants', 'rats', 'rats', 'cats'] 
      

 
      Command took 1.11s

>  
        
            # TEST Apply makePlural to the base RDD(1c) 
            Test.assertEquals(pluralRDD.collect(), ['cats', 'elephants', 'rats', 'rats', 'cats'], 
                              'incorrect values for pluralRDD')

(1) Spark Jobs

        1 test passed. 
      

 
      Command took 0.17s

>  
        
            # TODO: Replace <FILL IN> with appropriate code 
            pluralLambdaRDD = wordsRDD.map(lambda x: x + 's') 
            print pluralLambdaRDD.collect()

(1) Spark Jobs

        ['cats', 'elephants', 'rats', 'rats', 'cats'] 
      

 
      Command took 0.17s

>  
        
            # TEST Pass a lambda function to map (1d) 
            Test.assertEquals(pluralLambdaRDD.collect(), ['cats', 'elephants', 'rats', 'rats', 'cats'], 
                              'incorrect values for pluralLambdaRDD (1d)')

(1) Spark Jobs

        1 test passed. 
      

 
      Command took 0.17s

>  
        
            # TODO: Replace <FILL IN> with appropriate code 
            pluralLengths = (pluralRDD 
                             .map(lambda x: len(x)) 
                             .collect()) 
            print pluralLengths

(1) Spark Jobs

        [4, 9, 4, 4, 4] 
      

 
      Command took 0.17s

>  
        
            # TEST Length of each word (1e) 
            Test.assertEquals(pluralLengths, [4, 9, 4, 4, 4], 
                              'incorrect values for pluralLengths')

        1 test passed. 
      

 
      Command took 0.04s

(1f) Pair RDDs

The next step in writing our word counting program is to create a new type of RDD, called a pair RDD. A pair RDD is an RDD where each element is a pair tuple(k, v) where k is the key and v is the value. In this example, we will create a pair consisting of ('<word>', 1) for each word element in the RDD. We can create the pair RDD using the map() transformation with a lambda() function to create a new RDD.

>  
        
            # TODO: Replace <FILL IN> with appropriate code 
            wordPairs = wordsRDD.map(lambda x:(x,1)) 
            print wordPairs.collect()

(1) Spark Jobs

        [('cat', 1), ('elephant', 1), ('rat', 1), ('rat', 1), ('cat', 1)] 
      

 
      Command took 0.17s

>  
        
            # TEST Pair RDDs (1f) 
            Test.assertEquals(wordPairs.collect(), 
                              [('cat', 1), ('elephant', 1), ('rat', 1), ('rat', 1), ('cat', 1)], 
                              'incorrect value for wordPairs')

(1) Spark Jobs

        1 test passed. 
      

 
      Command took 0.17s

Now, let's count the number of times a particular word appears in the RDD. There are multiple ways to perform the counting, but some are much less efficient than others.

A naive approach would be to collect() all of the elements and count them in the driver program. While this approach could work for small datasets, we want an approach that will work for any size dataset including terabyte- or petabyte-sized datasets. In addition, performing all of the work in the driver program is slower than performing it in parallel in the workers. For these reasons, we will use data parallel operations.

(2a) `groupByKey()` approach

An approach you might first consider (we'll see shortly that there are better ways) is based on using the groupByKey() transformation. As the name implies, thegroupByKey() transformation groups all the elements of the RDD with the same key into a single list in one of the partitions.

There are two problems with using groupByKey():

The operation requires a lot of data movement to move all the values into the appropriate partitions.
The lists can be very large. Consider a word count of English Wikipedia: the lists for common words (e.g., the, a, etc.) would be huge and could exhaust the available memory in a worker.

Use groupByKey() to generate a pair RDD of type ('word', iterator).

>  
        
            # TODO: Replace <FILL IN> with appropriate code 
            # Note that groupByKey requires no parameters 
            wordsGrouped = wordPairs.groupByKey() 
            for key, value in wordsGrouped.collect(): 
                print '{0}: {1}'.format(key, list(value))

(1) Spark Jobs

        rat: [1, 1]elephant: [1]cat: [1, 1] 
      

 
      Command took 0.43s

>  
        
            # TEST groupByKey() approach (2a) 
            Test.assertEquals(sorted(wordsGrouped.mapValues(lambda x: list(x)).collect()), 
                              [('cat', [1, 1]), ('elephant', [1]), ('rat', [1, 1])], 
                              'incorrect value for wordsGrouped')

(1) Spark Jobs

        1 test passed. 
      

 
      Command took 0.22s

>  
        
            # TODO: Replace <FILL IN> with appropriate code 
            wordCountsGrouped = wordsGrouped.mapValues(len) 
            print wordCountsGrouped.collect()

(1) Spark Jobs

        [('rat', 2), ('elephant', 1), ('cat', 2)] 
      

 
      Command took 0.22s

>  
        
            # TEST Use groupByKey() to obtain the counts (2b) 
            Test.assertEquals(sorted(wordCountsGrouped.collect()), 
                              [('cat', 2), ('elephant', 1), ('rat', 2)], 
                              'incorrect value for wordCountsGrouped')

(1) Spark Jobs

        1 test passed. 
      

 
      Command took 0.17s

(2c) Counting using reduceByKey

A better approach is to start from the pair RDD and then use the reduceByKey() transformation to create a new pair RDD. The reduceByKey() transformation gathers together pairs that have the same key and applies the function provided to two values at a time, iteratively reducing all of the values to a single value.reduceByKey() operates by applying the function first within each partition on a per-key basis and then across the partitions, allowing it to scale efficiently to large datasets.

>  
        
            # TODO: Replace <FILL IN> with appropriate code 
            # Note that reduceByKey takes in a function that accepts two values and returns a single value 
            wordCounts = wordPairs.reduceByKey(lambda a, b: a + b) 
            print wordCounts.collect()

(1) Spark Jobs

        [('rat', 2), ('elephant', 1), ('cat', 2)] 
      

 
      Command took 0.33s

>  
        
            # TEST Counting using reduceByKey (2c) 
            Test.assertEquals(sorted(wordCounts.collect()), [('cat', 2), ('elephant', 1), ('rat', 2)], 
                              'incorrect value for wordCounts')

(1) Spark Jobs

        1 test passed. 
      

 
      Command took 0.18s

>  
        
            # TODO: Replace <FILL IN> with appropriate code 
            wordCountsCollected = (wordsRDD 
                                   .map(lambda x: (x,1)).reduceByKey(lambda a, b: a + b) 
                                   .collect()) 
            print wordCountsCollected

(1) Spark Jobs

        [('rat', 2), ('elephant', 1), ('cat', 2)] 
      

 
      Command took 0.33s

>  
        
            # TEST All together (2d) 
            Test.assertEquals(sorted(wordCountsCollected), [('cat', 2), ('elephant', 1), ('rat', 2)], 
                              'incorrect value for wordCountsCollected')

        1 test passed. 
      

 
      Command took 0.04s

>  
        
            # TODO: Replace <FILL IN> with appropriate code 
            uniqueWords = wordCounts.count() 
            print uniqueWords

(1) Spark Jobs

3

 
      Command took 0.18s

>  
        
            ## ANSWER 
            uniqueWords = wordCounts.count() 
            print uniqueWords

(1) Spark Jobs

3

 
      Command took 0.17s

>  
        
            # TEST Unique words (3a) 
            Test.assertEquals(uniqueWords, 3, 'incorrect count of uniqueWords')

        1 test passed. 
      

 
      Command took 0.04s

>  
        
      
 
             
           
 
            # TODO: Replace <FILL IN> with appropriate code 
            from operator import add 
            print wordCounts.collect() 
            totalCount = (wordCounts 
                          .map(lambda (a,b):b) 
                          .reduce(add)) 
            average = totalCount / float(wordCounts.count()) 
            print totalCount 
            print round(average, 2) 
            ​ 
           

(3) Spark Jobs

        [('rat', 2), ('elephant', 1), ('cat', 2)]51.67 
      

 
      Command took 0.38s

>  
        
            # TEST Mean using reduce (3b) 
            Test.assertEquals(round(average, 2), 1.67, 'incorrect value of average')

        1 test passed. 
      

 
      Command took 0.03s

>  
        
      
 
             
           
 
            # TODO: Replace <FILL IN> with appropriate code 
            def wordCount(wordListRDD): 
                """Creates a pair RDD with word counts from an RDD of words. 
            ​ 
                Args: 
                    wordListRDD (RDD of str): An RDD consisting of words. 
            ​ 
                Returns: 
                    RDD of (str, int): An RDD consisting of (word, count) tuples. 
                """ 
                return wordListRDD.map(lambda x:(x,1)).reduceByKey(add) 
            print wordCount(wordsRDD).collect() 
           

(1) Spark Jobs

        [('rat', 2), ('elephant', 1), ('cat', 2)] 
      

 
      Command took 0.27s

>  
        
            # TEST wordCount function (4a) 
            Test.assertEquals(sorted(wordCount(wordsRDD).collect()), 
                              [('cat', 2), ('elephant', 1), ('rat', 2)], 
                              'incorrect definition for wordCount function')

(1) Spark Jobs

        1 test passed. 
      

 
      Command took 0.28s

(4b) Capitalization and punctuation

Real world files are more complicated than the data we have been using in this lab. Some of the issues we have to address are:

Words should be counted independent of their capitialization (e.g., Spark and spark should be counted as the same word).
All punctuation should be removed.
Any leading or trailing spaces on a line should be removed.

Define the function removePunctuation that converts all text to lower case, removes any punctuation, and removes leading and trailing spaces. Use the Python remodule to remove any text that is not a letter, number, or space. Reading help(re.sub) might be useful. If you are unfamiliar with regular expressions, you may want to review this tutorial from Google. Also, this website is a great resource for debugging your regular expression.

>  
        
      
 
             
           
 
            # TODO: Replace <FILL IN> with appropriate code 
            import re 
            def removePunctuation(text): 
                """Removes punctuation, changes to lower case, and strips leading and trailing spaces. 
            ​ 
                Note: 
                    Only spaces, letters, and numbers should be retained.  Other characters should should be 
                    eliminated (e.g. it's becomes its).  Leading and trailing spaces should be removed after 
                    punctuation is removed. 
            ​ 
                Args: 
                    text (str): A string. 
            ​ 
                Returns: 
                    str: The cleaned up string. 
                """ 
                return re.sub( '[^a-z0-9 ]', "", text.lower().strip()) 
            print removePunctuation('Hi, you!') 
            print removePunctuation(' No under_score!') 
            print removePunctuation(' *      Remove punctuation then spaces  * ') 
           

        hi youno underscore remove punctuation then spaces 
      

 
      Command took 0.07s

>  
        
            # TEST Capitalization and punctuation (4b) 
            Test.assertEquals(removePunctuation(" The Elephant's 4 cats. "), 
                              'the elephants 4 cats', 
                              'incorrect definition for removePunctuation function')

        1 test passed. 
      

 
      Command took 0.04s

(4c) Load a text file

For the next part of this lab, we will use the Complete Works of William Shakespeare from Project Gutenberg. To convert a text file into an RDD, we use theSparkContext.textFile() method. We also apply the recently defined removePunctuation() function using a map() transformation to strip out the punctuation and change all text to lower case. Since the file is large we use take(15), so that we only print 15 lines.

>  
        
            %fs

 
      Command took 5.11s

>  
        
            # Just run this code 
            import os.path 
            fileName = "dbfs:/" + os.path.join('databricks-datasets', 'cs100', 'lab1', 'data-001', 'shakespeare.txt') 
            ​ 
            shakespeareRDD = sc.textFile(fileName, 8).map(removePunctuation) 
            print '\n'.join(shakespeareRDD 
                            .zipWithIndex()  # to (line, lineNum) 
                            .map(lambda (l, num): '{0}: {1}'.format(num, l))  # to 'lineNum: line' 
                            .take(15))

(2) Spark Jobs

        0: 16091: 2: the sonnets3: 4: by william shakespeare5: 6: 7: 8: 19: from fairest creatures we desire increase10: that thereby beautys rose might never die11: but as the riper should by time decease12: his tender heir might bear his memory13: but thou contracted to thine own bright eyes14: feedst thy lights flame with selfsubstantial fuel 
      

 
      Command took 1.82s

(4d) Words from lines

Before we can use the wordcount() function, we have to address two issues with the format of the RDD:

The first issue is that that we need to split each line by its spaces. Performed in (4d).
The second issue is we need to filter out empty lines. Performed in (4e).

Apply a transformation that will split each element of the RDD by its spaces. For each element of the RDD, you should apply Python's string split() function. You might think that a map() transformation is the way to do this, but think about what the result of the split() function will be.

Note:

Do not use the default implemenation of split(), but pass in a separator value. For example, to split line by commas you would use line.split(',').

>  
        
            # TODO: Replace <FILL IN> with appropriate code 
            shakespeareWordsRDD = shakespeareRDD.flatMap(lambda x:x.split(' ')) 
            shakespeareWordCount = shakespeareWordsRDD.count() 
            print shakespeareWordsRDD.top(5) 
            print shakespeareWordCount

(2) Spark Jobs

        [u'zwaggerd', u'zounds', u'zounds', u'zounds', u'zounds']928908 
      

 
      Command took 2.42s

>  
        
            # TEST Words from lines (4d) 
            # This test allows for leading spaces to be removed either before or after 
            # punctuation is removed. 
            Test.assertTrue(shakespeareWordCount == 927631 or shakespeareWordCount == 928908, 
                            'incorrect value for shakespeareWordCount') 
            Test.assertEquals(shakespeareWordsRDD.top(5), 
                              [u'zwaggerd', u'zounds', u'zounds', u'zounds', u'zounds'], 
                              'incorrect value for shakespeareWordsRDD')

(1) Spark Jobs

        1 test passed.1 test passed. 
      

 
      Command took 1.21s

>  
        
            # TODO: Replace <FILL IN> with appropriate code 
            shakeWordsRDD = shakespeareWordsRDD.filter(lambda x: x!='') 
            shakeWordCount = shakeWordsRDD.count() 
            print shakeWordCount

(1) Spark Jobs

882996

 
      Command took 1.21s

>  
        
            # TEST Remove empty elements (4e) 
            Test.assertEquals(shakeWordCount, 882996, 'incorrect value for shakeWordCount')

        1 test passed. 
      

 
      Command took 0.01s

(4f) Count the words

We now have an RDD that is only words. Next, let's apply the wordCount() function to produce a list of word counts. We can view the top 15 words by using thetakeOrdered() action; however, since the elements of the RDD are pairs, we need a custom sort function that sorts using the value part of the pair.

You'll notice that many of the words are common English words. These are called stopwords. In a later lab, we will see how to eliminate them from the results. Use the wordCount() function and takeOrdered() to obtain the fifteen most common words and their counts.

>  
        
            # TODO: Replace <FILL IN> with appropriate code 
            top15WordsAndCounts = wordCount(shakeWordsRDD).takeOrdered(15,lambda (a,b):-b ) 
            print '\n'.join(map(lambda (w, c): '{0}: {1}'.format(w, c), top15WordsAndCounts))

(1) Spark Jobs

        the: 27361and: 26028i: 20681to: 19150of: 17463a: 14593you: 13615my: 12481in: 10956that: 10890is: 9134not: 8497with: 7771me: 7769it: 7678 
      

 
      Command took 1.71s

>  
        
            # TEST Count the words (4f) 
            Test.assertEquals(top15WordsAndCounts, 
                              [(u'the', 27361), (u'and', 26028), (u'i', 20681), (u'to', 19150), (u'of', 17463), 
                               (u'a', 14593), (u'you', 13615), (u'my', 12481), (u'in', 10956), (u'that', 10890), 
                               (u'is', 9134), (u'not', 8497), (u'with', 7771), (u'me', 7769), (u'it', 7678)], 
                              'incorrect value for top15WordsAndCounts')