Document distance note

16 篇文章 0 订阅

Document Distance

  • Document = sequence of words

    -Ignore punctuation & formatting

  • Word = sequence of alphanumeric characters

  • How to define "distance"?

  • Idea: focus on shared words

  • Word frequencies:

    - D(w) = # occurrences of word w in document D

- Vector Space model

In [231]:
# Initial version of document distance

# This program computes the "distance" between two text file
# as the angle between their word frequency vectors (in radians).

# For each input file, a word-frequency vector is computed as follows:
#    (1) the specified file is read in
#    (2) it is converted into a list of alphanumeric "words"
#        Here a "word" is a sequence of consecutive alphanumeric
#        characters. Non-alphanumeric characters are treated as blanks.
#        Case is not significant.
#    (3) for each word, its frequency of occurrenc is determined
#    (4) the word/frequency lists are sorted into order alphabetically

# The "distance" between two vectors is the angle between them.
# If x = (x1, x2, ..., xn) is the first vector (xi = freq of word i)
# and y = (y1, y2, ..., yn) is the second vector,
# then the angle between them is defined as:
#    d(x, y) = arccos(inner_product(x, y) / (norm(x)*norm(y)))
# where:
#    inner_product(x, y) = x1*y1 + x2*y2 + ... + xn*yn
#    norm(x) = sqrt(inner_product(x, x))
In [232]:
import math
import sys
In [233]:
#################################
# Operation 1: read a text file##
#################################
def read_file(filename):
    '''
    Read the text file with the given filename;
    return a list of the lines of text in the file.
    '''
    
    try:
        f = open(filename, 'r')
        return f.readlines()
    except IOError:
        print "Error opening or reading input file: ", filename
        sys.exit()
In [234]:
#################################################
# Operation 2: split the text lines into words ##
#################################################
def get_words_from_line_list(L):
    '''
    Parse the given list L of text lines into words.
    Return list of all words found.
    '''
    
    word_list = []
    for line in L:
        words_in_line = get_words_from_string(line)
        word_list = word_list + words_in_line
        
    return word_list
    
def get_words_from_string(line):
    '''
    Return a list of the words in the given input string,
    converting each word to lower-case.
    
    Input: line(a string)
    Output: a list of strings (each string is a sequence of alphanumeric characters)
    '''
    
    word_list = []    # accumulates words in line
    character_list = []    # accumulates characters in word
    
    for c in line:
        if c.isalnum():
            character_list.append(c)
        elif len(character_list) > 0:
            word = "".join(character_list)
            word = word.lower()
            word_list.append(word)
            character_list = []
    
    if len(character_list) > 0:
        word = "".join(character_list)
        word = word.lower()
        word_list.append(word)
    
    return word_list
In [235]:
# test get_words_from_string
s = "This is a test String!"
word_list = get_words_from_string(s)
print word_list
['this', 'is', 'a', 'test', 'string']
In [236]:
# test get_words_from_line_list
L_test = []
L_test.append("Parse the given list L of text lines into words.")
L_test.append("Return list of all words found.")
L_test.append("get_words_from_line_list")

word_list = get_words_from_line_list(L_test)
word_list
Out[236]:
['parse',
 'the',
 'given',
 'list',
 'l',
 'of',
 'text',
 'lines',
 'into',
 'words',
 'return',
 'list',
 'of',
 'all',
 'words',
 'found',
 'get',
 'words',
 'from',
 'line',
 'list']
In [237]:
##############################################
# Operation 3: count frequency of each word ##
##############################################
def count_frequency(word_list):
    '''
    Return a list giving pairs of form: (word, frequency)
    '''
    
    L = []
    
    for new_word in word_list:
        for entry in L:
            if new_word == entry[0]:
                entry[1] = entry[1] + 1
                break
        else:
            L.append([new_word,1])
    
    return L
In [238]:
# test count_frequency
count_frequency(word_list)
Out[238]:
[['parse', 1],
 ['the', 1],
 ['given', 1],
 ['list', 3],
 ['l', 1],
 ['of', 2],
 ['text', 1],
 ['lines', 1],
 ['into', 1],
 ['words', 3],
 ['return', 1],
 ['all', 1],
 ['found', 1],
 ['get', 1],
 ['from', 1],
 ['line', 1]]
In [239]:
###################################################
# Operation 4: sort words into alphabetic order  ##
###################################################
def insertion_sort(A):
    '''
    Sort list A into order, in place
    '''
    
    for j in range(len(A)):
        key = A[j]
        
        i = j - 1
        while i > -1 and A[i] > key:
            A[i + 1] = A[i]
            i = i - 1
        A[i + 1] = key
    
    return A
    
In [240]:
#########################################################
# Operation 5: compute word frequencies for input file ##
#########################################################
def word_frequencies_for_file(filename):
    '''
    Return alphabetically sorted list of (word, frequency) pairs
    for the given file
    '''
    
    line_list = read_file(filename)
    word_list = get_words_from_line_list(line_list)
    freq_mapping = count_frequency(word_list)
    insertion_sort(freq_mapping)
    
    print "File ", filename, ": "
    print len(line_list), "lines, ",
    print len(word_list), "words, ",
    print len(freq_mapping), "distinct words"
    
    return freq_mapping
    
In [241]:
# test word_frequencies_for_file
file_name = "/home/will/myspace/mydev/mytest/sparktest/pyspark/6006/unit1/lec01_data/t1.verne.txt"
freq_mapping = word_frequencies_for_file(file_name)
File  /home/will/myspace/mydev/mytest/sparktest/pyspark/6006/unit1/lec01_data/t1.verne.txt : 
1057 lines,  8943 words,  2150 distinct words
In [242]:
def inner_product(L1, L2):
    '''
    Inner product between two vectors, where vectors
    are represented as alphabetically sorted (word, freq) pairs.
    
    Example: inner_product([["and",3],["of",2],["the",5]],
                           [["and",4],["in",1],["of",1],["this",2]]) = 14.0
    '''
    
    sum = 0.0
    i = 0
    j = 0
    
    while i < len(L1) and j < len(L2):
        # L1[i:] and L[j:] yet to be processed
        if L1[i][0] == L2[j][0]:
            # both vectors have this word
            sum += L1[i][1] * L2[j][1]
            i += 1
            j += 1
        elif L1[i][0] < L2[j][0]:
            # word L1[i][0] is in L1 but not L2
            i += 1
        else:
            # word L2[j][0] is in L2 but not L1
            j += 1
    
    return sum
In [243]:
def vector_angle(L1, L2):
    '''
    The input is a list of (word, freq) pairs, sorted alphabetically.
    
    Return the angle between these two vectors.
    '''
    
    numerator = inner_product(L1, L2)
    denominator = math.sqrt(inner_product(L1, L1) * inner_product(L2, L2))
    return math.acos(numerator/denominator)
    #return math.acos(numerator / float(denominator))
In [244]:
# document distance version 1 test
def test_docdist_1():
    filename_1 = "/home/will/myspace/mydev/mytest/sparktest/pyspark/6006/unit1/lec01_data/t1.verne.txt"
    filename_2 = "/home/will/myspace/mydev/mytest/sparktest/pyspark/6006/unit1/lec01_data/t2.bobsey.txt"

    sorted_word_list_1 = word_frequencies_for_file(filename_1)
    sorted_word_list_2 = word_frequencies_for_file(filename_2)

    distance = vector_angle(sorted_word_list_1,sorted_word_list_2)

    print "The distance between the documents is: %0.6f (radians)" % distance
    
test_docdist_1()
File  /home/will/myspace/mydev/mytest/sparktest/pyspark/6006/unit1/lec01_data/t1.verne.txt : 
1057 lines,  8943 words,  2150 distinct words
File  /home/will/myspace/mydev/mytest/sparktest/pyspark/6006/unit1/lec01_data/t2.bobsey.txt : 
6667 lines,  49785 words,  3354 distinct words
The distance between the documents is: 0.582949 (radians)
In [245]:
# document distance version 2
# add profiling 

import profile

profile.run("test_docdist_1()")
File  /home/will/myspace/mydev/mytest/sparktest/pyspark/6006/unit1/lec01_data/t1.verne.txt : 
1057 lines,  8943 words,  2150 distinct words
File  /home/will/myspace/mydev/mytest/sparktest/pyspark/6006/unit1/lec01_data/t2.bobsey.txt : 
6667 lines,  49785 words,  3354 distinct words
The distance between the documents is: 0.582949 (radians)
         862559 function calls in 11.596 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 :0(acos)
   300916    1.131    0.000    1.131    0.000 :0(append)
        2    0.000    0.000    0.000    0.000 :0(close)
        8    0.000    0.000    0.000    0.000 :0(copy)
        2    0.000    0.000    0.000    0.000 :0(count)
       38    0.000    0.000    0.001    0.000 :0(decode)
        2    0.000    0.000    0.000    0.000 :0(digest)
        6    0.000    0.000    0.000    0.000 :0(encode)
        4    0.000    0.000    0.000    0.000 :0(extend)
        2    0.000    0.000    0.000    0.000 :0(get)
        2    0.000    0.000    0.000    0.000 :0(get_ident)
        2    0.000    0.000    0.000    0.000 :0(getattr)
       44    0.000    0.000    0.000    0.000 :0(getpid)
        2    0.000    0.000    0.000    0.000 :0(getvalue)
        4    0.000    0.000    0.000    0.000 :0(group)
        2    0.000    0.000    0.000    0.000 :0(hasattr)
        2    0.000    0.000    0.000    0.000 :0(hexdigest)
   322488    1.204    0.000    1.204    0.000 :0(isalnum)
       94    0.001    0.000    0.001    0.000 :0(isinstance)
        2    0.000    0.000    0.000    0.000 :0(isoformat)
    58736    0.232    0.000    0.232    0.000 :0(join)
   113167    0.423    0.000    0.423    0.000 :0(len)
        2    0.000    0.000    0.000    0.000 :0(locals)
    58728    0.226    0.000    0.226    0.000 :0(lower)
        2    0.000    0.000    0.000    0.000 :0(map)
        2    0.000    0.000    0.000    0.000 :0(max)
        2    0.000    0.000    0.000    0.000 :0(now)
        2    0.000    0.000    0.000    0.000 :0(open)
        4    0.000    0.000    0.000    0.000 :0(range)
        2    0.001    0.001    0.001    0.001 :0(readlines)
       14    0.000    0.000    0.000    0.000 :0(send)
        1    0.000    0.000    0.000    0.000 :0(setprofile)
        1    0.000    0.000    0.000    0.000 :0(sqrt)
       52    0.000    0.000    0.000    0.000 :0(sub)
       38    0.000    0.000    0.000    0.000 :0(time)
       10    0.000    0.000    0.000    0.000 :0(update)
        2    0.000    0.000    0.000    0.000 :0(upper)
       38    0.000    0.000    0.000    0.000 :0(utf_8_decode)
       38    0.000    0.000    0.000    0.000 :0(write)
        2    0.000    0.000    0.000    0.000 :0(zmq_poll)
        2    0.000    0.000    0.001    0.001 <ipython-input-233-02a2fcdaef67>:4(read_file)
     7724    3.395    0.000    6.517    0.001 <ipython-input-234-197de0f6765b>:17(get_words_from_string)
        2    1.349    0.675    7.866    3.933 <ipython-input-234-197de0f6765b>:4(get_words_from_line_list)
        2    2.288    1.144    2.309    1.155 <ipython-input-237-8969006c1981>:4(count_frequency)
        2    1.254    0.627    1.254    0.627 <ipython-input-239-3e98c1c9bee1>:4(insertion_sort)
        2    0.001    0.000   11.437    5.719 <ipython-input-240-ba790dd58c4e>:4(word_frequencies_for_file)
        3    0.084    0.028    0.154    0.051 <ipython-input-242-6827588a201a>:1(inner_product)
        1    0.000    0.000    0.154    0.154 <ipython-input-243-db4e3492a3db>:1(vector_angle)
        1    0.001    0.001   11.595   11.595 <ipython-input-244-9eaee09f69ba>:2(test_docdist_1)
        1    0.001    0.001   11.596   11.596 <string>:1(<module>)
        8    0.000    0.000    0.002    0.000 __init__.py:193(dumps)
        2    0.000    0.000    0.000    0.000 __init__.py:52(create_string_buffer)
        2    0.000    0.000    0.000    0.000 attrsettr.py:35(__getattr__)
        8    0.000    0.000    0.000    0.000 encoder.py:101(__init__)
        8    0.000    0.000    0.002    0.000 encoder.py:186(encode)
        8    0.000    0.000    0.001    0.000 encoder.py:212(iterencode)
       52    0.000    0.000    0.001    0.000 encoder.py:33(encode_basestring)
        4    0.000    0.000    0.000    0.000 encoder.py:37(replace)
        2    0.000    0.000    0.000    0.000 hmac.py:100(_current)
        2    0.000    0.000    0.000    0.000 hmac.py:119(hexdigest)
        2    0.000    0.000    0.000    0.000 hmac.py:30(__init__)
        8    0.000    0.000    0.000    0.000 hmac.py:83(update)
        2    0.000    0.000    0.000    0.000 hmac.py:88(copy)
       40    0.000    0.000    0.001    0.000 iostream.py:102(_check_mp_mode)
        2    0.000    0.000    0.000    0.000 iostream.py:123(_flush_from_subprocesses)
        2    0.000    0.000    0.005    0.002 iostream.py:151(flush)
       38    0.001    0.000    0.008    0.000 iostream.py:207(write)
        2    0.000    0.000    0.000    0.000 iostream.py:238(_flush_buffer)
        2    0.000    0.000    0.000    0.000 iostream.py:247(_new_buffer)
       42    0.000    0.000    0.000    0.000 iostream.py:93(_is_master_process)
        2    0.000    0.000    0.000    0.000 iostream.py:96(_is_master_thread)
        8    0.000    0.000    0.002    0.000 jsonapi.py:31(dumps)
        2    0.000    0.000    0.000    0.000 jsonutil.py:75(date_default)
        2    0.000    0.000    0.000    0.000 poll.py:77(poll)
        0    0.000             0.000          profile:0(profiler)
        1    0.000    0.000   11.596   11.596 profile:0(test_docdist_1())
        2    0.000    0.000    0.000    0.000 py3compat.py:12(no_code)
        2    0.000    0.000    0.000    0.000 session.py:206(msg_header)
        2    0.000    0.000    0.000    0.000 session.py:211(extract_header)
        2    0.000    0.000    0.000    0.000 session.py:452(msg_id)
        2    0.000    0.000    0.001    0.000 session.py:504(msg_header)
        2    0.000    0.000    0.001    0.000 session.py:507(msg)
        2    0.000    0.000    0.000    0.000 session.py:526(sign)
        2    0.000    0.000    0.003    0.001 session.py:541(serialize)
        2    0.000    0.000    0.004    0.002 session.py:600(send)
        8    0.000    0.000    0.002    0.000 session.py:94(<lambda>)
        2    0.000    0.000    0.001    0.000 socket.py:289(send_multipart)
        2    0.000    0.000    0.000    0.000 threading.py:1152(currentThread)
        2    0.000    0.000    0.000    0.000 threading.py:983(ident)
       26    0.000    0.000    0.000    0.000 traitlets.py:420(__get__)
       38    0.000    0.000    0.001    0.000 utf_8.py:15(decode)
        2    0.000    0.000    0.000    0.000 uuid.py:101(__init__)
        2    0.000    0.000    0.000    0.000 uuid.py:197(__str__)
        2    0.000    0.000    0.000    0.000 uuid.py:546(uuid4)


In [246]:
# document distance version 3
# replace + with extend 
# for list operation A + B costs O(|A| + |B|)
# for list operation A.extend(B) costs O(|B|)
In [247]:
def get_words_from_line_list_3(L):
    '''
    Parse the given list L of text lines into words.
    Return list of all words found.
    '''
    
    word_list = []
    
    for line in L:
        words_in_line = get_words_from_string(line)
        # Using "extend" is much more efficient than concatenation here:
        word_list.extend(words_in_line)
    
    return word_list
    
In [248]:
def word_frequencies_for_file_3(filename):
    """
    Return alphabetically sorted list of (word,frequency) pairs 
    for the given file.
    """

    line_list = read_file(filename)
    word_list = get_words_from_line_list_3(line_list)
    freq_mapping = count_frequency(word_list)
    insertion_sort(freq_mapping)

    print "File",filename,":",
    print len(line_list),"lines,",
    print len(word_list),"words,",
    print len(freq_mapping),"distinct words"

    return freq_mapping
In [249]:
# document distance version 3 test
def test_docdist_3():
    filename_1 = "/home/will/myspace/mydev/mytest/sparktest/pyspark/6006/unit1/lec01_data/t1.verne.txt"
    filename_2 = "/home/will/myspace/mydev/mytest/sparktest/pyspark/6006/unit1/lec01_data/t2.bobsey.txt"

    sorted_word_list_1 = word_frequencies_for_file_3(filename_1)
    sorted_word_list_2 = word_frequencies_for_file_3(filename_2)

    distance = vector_angle(sorted_word_list_1,sorted_word_list_2)

    print "The distance between the documents is: %0.6f (radians)" % distance
    
test_docdist_3()
File /home/will/myspace/mydev/mytest/sparktest/pyspark/6006/unit1/lec01_data/t1.verne.txt : 1057 lines, 8943 words, 2150 distinct words
File /home/will/myspace/mydev/mytest/sparktest/pyspark/6006/unit1/lec01_data/t2.bobsey.txt : 6667 lines, 49785 words, 3354 distinct words
The distance between the documents is: 0.582949 (radians)
In [250]:
profile.run("test_docdist_3()")
File /home/will/myspace/mydev/mytest/sparktest/pyspark/6006/unit1/lec01_data/t1.verne.txt : 1057 lines, 8943 words, 2150 distinct words
File /home/will/myspace/mydev/mytest/sparktest/pyspark/6006/unit1/lec01_data/t2.bobsey.txt : 6667 lines, 49785 words, 3354 distinct words
The distance between the documents is: 0.582949 (radians)
         870279 function calls in 10.097 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 :0(acos)
   300916    1.101    0.000    1.101    0.000 :0(append)
        2    0.000    0.000    0.000    0.000 :0(close)
        8    0.000    0.000    0.000    0.000 :0(copy)
        2    0.000    0.000    0.000    0.000 :0(count)
       38    0.000    0.000    0.001    0.000 :0(decode)
        2    0.000    0.000    0.000    0.000 :0(digest)
        6    0.000    0.000    0.000    0.000 :0(encode)
     7728    0.030    0.000    0.030    0.000 :0(extend)
        2    0.000    0.000    0.000    0.000 :0(get)
        2    0.000    0.000    0.000    0.000 :0(get_ident)
        2    0.000    0.000    0.000    0.000 :0(getattr)
       44    0.000    0.000    0.000    0.000 :0(getpid)
        2    0.000    0.000    0.000    0.000 :0(getvalue)
        2    0.000    0.000    0.000    0.000 :0(group)
        2    0.000    0.000    0.000    0.000 :0(hasattr)
        2    0.000    0.000    0.000    0.000 :0(hexdigest)
   322488    1.172    0.000    1.172    0.000 :0(isalnum)
       94    0.000    0.000    0.000    0.000 :0(isinstance)
        2    0.000    0.000    0.000    0.000 :0(isoformat)
    58736    0.226    0.000    0.226    0.000 :0(join)
   113167    0.419    0.000    0.419    0.000 :0(len)
        2    0.000    0.000    0.000    0.000 :0(locals)
    58728    0.219    0.000    0.219    0.000 :0(lower)
        2    0.000    0.000    0.000    0.000 :0(map)
        2    0.000    0.000    0.000    0.000 :0(max)
        2    0.000    0.000    0.000    0.000 :0(now)
        2    0.000    0.000    0.000    0.000 :0(open)
        4    0.000    0.000    0.000    0.000 :0(range)
        2    0.001    0.000    0.001    0.000 :0(readlines)
       14    0.000    0.000    0.000    0.000 :0(send)
        1    0.000    0.000    0.000    0.000 :0(setprofile)
        1    0.000    0.000    0.000    0.000 :0(sqrt)
       52    0.000    0.000    0.000    0.000 :0(sub)
       38    0.000    0.000    0.000    0.000 :0(time)
       10    0.000    0.000    0.000    0.000 :0(update)
        2    0.000    0.000    0.000    0.000 :0(upper)
       38    0.000    0.000    0.000    0.000 :0(utf_8_decode)
       38    0.000    0.000    0.000    0.000 :0(write)
        2    0.000    0.000    0.000    0.000 :0(zmq_poll)
        2    0.000    0.000    0.001    0.001 <ipython-input-233-02a2fcdaef67>:4(read_file)
     7724    3.287    0.000    6.332    0.001 <ipython-input-234-197de0f6765b>:17(get_words_from_string)
        2    2.232    1.116    2.254    1.127 <ipython-input-237-8969006c1981>:4(count_frequency)
        2    1.244    0.622    1.245    0.622 <ipython-input-239-3e98c1c9bee1>:4(insertion_sort)
        3    0.083    0.028    0.154    0.051 <ipython-input-242-6827588a201a>:1(inner_product)
        1    0.000    0.000    0.154    0.154 <ipython-input-243-db4e3492a3db>:1(vector_angle)
        2    0.070    0.035    6.432    3.216 <ipython-input-247-39e11d2bc6c1>:1(get_words_from_line_list_3)
        2    0.001    0.000    9.937    4.969 <ipython-input-248-aa12883d24f7>:1(word_frequencies_for_file_3)
        1    0.001    0.001   10.095   10.095 <ipython-input-249-e2ab858cd19a>:2(test_docdist_3)
        1    0.001    0.001   10.096   10.096 <string>:1(<module>)
        8    0.000    0.000    0.002    0.000 __init__.py:193(dumps)
        2    0.000    0.000    0.000    0.000 __init__.py:52(create_string_buffer)
        2    0.000    0.000    0.000    0.000 attrsettr.py:35(__getattr__)
        8    0.000    0.000    0.000    0.000 encoder.py:101(__init__)
        8    0.000    0.000    0.001    0.000 encoder.py:186(encode)
        8    0.000    0.000    0.001    0.000 encoder.py:212(iterencode)
       52    0.000    0.000    0.001    0.000 encoder.py:33(encode_basestring)
        2    0.000    0.000    0.000    0.000 encoder.py:37(replace)
        2    0.000    0.000    0.000    0.000 hmac.py:100(_current)
        2    0.000    0.000    0.000    0.000 hmac.py:119(hexdigest)
        2    0.000    0.000    0.000    0.000 hmac.py:30(__init__)
        8    0.000    0.000    0.000    0.000 hmac.py:83(update)
        2    0.000    0.000    0.000    0.000 hmac.py:88(copy)
       40    0.000    0.000    0.001    0.000 iostream.py:102(_check_mp_mode)
        2    0.000    0.000    0.000    0.000 iostream.py:123(_flush_from_subprocesses)
        2    0.000    0.000    0.005    0.002 iostream.py:151(flush)
       38    0.001    0.000    0.008    0.000 iostream.py:207(write)
        2    0.000    0.000    0.000    0.000 iostream.py:238(_flush_buffer)
        2    0.000    0.000    0.000    0.000 iostream.py:247(_new_buffer)
       42    0.000    0.000    0.000    0.000 iostream.py:93(_is_master_process)
        2    0.000    0.000    0.000    0.000 iostream.py:96(_is_master_thread)
        8    0.000    0.000    0.002    0.000 jsonapi.py:31(dumps)
        2    0.000    0.000    0.000    0.000 jsonutil.py:75(date_default)
        2    0.000    0.000    0.000    0.000 poll.py:77(poll)
        0    0.000             0.000          profile:0(profiler)
        1    0.000    0.000   10.097   10.097 profile:0(test_docdist_3())
        2    0.000    0.000    0.000    0.000 py3compat.py:12(no_code)
        2    0.000    0.000    0.000    0.000 session.py:206(msg_header)
        2    0.000    0.000    0.000    0.000 session.py:211(extract_header)
        2    0.000    0.000    0.000    0.000 session.py:452(msg_id)
        2    0.000    0.000    0.001    0.000 session.py:504(msg_header)
        2    0.000    0.000    0.001    0.000 session.py:507(msg)
        2    0.000    0.000    0.000    0.000 session.py:526(sign)
        2    0.000    0.000    0.003    0.001 session.py:541(serialize)
        2    0.000    0.000    0.004    0.002 session.py:600(send)
        8    0.000    0.000    0.002    0.000 session.py:94(<lambda>)
        2    0.000    0.000    0.001    0.000 socket.py:289(send_multipart)
        2    0.000    0.000    0.000    0.000 threading.py:1152(currentThread)
        2    0.000    0.000    0.000    0.000 threading.py:983(ident)
       26    0.000    0.000    0.000    0.000 traitlets.py:420(__get__)
       38    0.000    0.000    0.001    0.000 utf_8.py:15(decode)
        2    0.000    0.000    0.000    0.000 uuid.py:101(__init__)
        2    0.000    0.000    0.000    0.000 uuid.py:197(__str__)
        2    0.000    0.000    0.000    0.000 uuid.py:546(uuid4)


In [251]:
# document distance version 4
# count frequencies using dictionary

def count_frequency_4(word_list):
    '''
    Return a list giving pairs of form: (word, frequency)
    '''
    
    D = {}
    
    for new_word in word_list:
        if new_word in D:
            D[new_word] = D[new_word] + 1
        else:
            D[new_word] = 1
    
    return D.items()
In [252]:
def word_frequencies_for_file_4(filename):
    """
    Return alphabetically sorted list of (word,frequency) pairs 
    for the given file.
    """

    line_list = read_file(filename)
    word_list = get_words_from_line_list_3(line_list)
    freq_mapping = count_frequency_4(word_list)
    insertion_sort(freq_mapping)

    print "File",filename,":",
    print len(line_list),"lines,",
    print len(word_list),"words,",
    print len(freq_mapping),"distinct words"

    return freq_mapping
In [253]:
# document distance version 4 test
def test_docdist_4():
    filename_1 = "/home/will/myspace/mydev/mytest/sparktest/pyspark/6006/unit1/lec01_data/t1.verne.txt"
    filename_2 = "/home/will/myspace/mydev/mytest/sparktest/pyspark/6006/unit1/lec01_data/t2.bobsey.txt"

    sorted_word_list_1 = word_frequencies_for_file_4(filename_1)
    sorted_word_list_2 = word_frequencies_for_file_4(filename_2)

    distance = vector_angle(sorted_word_list_1,sorted_word_list_2)

    print "The distance between the documents is: %0.6f (radians)" % distance
    
test_docdist_4()
File /home/will/myspace/mydev/mytest/sparktest/pyspark/6006/unit1/lec01_data/t1.verne.txt : 1057 lines, 8943 words, 2150 distinct words
File /home/will/myspace/mydev/mytest/sparktest/pyspark/6006/unit1/lec01_data/t2.bobsey.txt : 6667 lines, 49785 words, 3354 distinct words
The distance between the documents is: 0.582949 (radians)
In [254]:
profile.run("test_docdist_4()")
File /home/will/myspace/mydev/mytest/sparktest/pyspark/6006/unit1/lec01_data/t1.verne.txt : 1057 lines, 8943 words, 2150 distinct words
File /home/will/myspace/mydev/mytest/sparktest/pyspark/6006/unit1/lec01_data/t2.bobsey.txt : 6667 lines, 49785 words, 3354 distinct words
The distance between the documents is: 0.582949 (radians)
         864777 function calls in 7.864 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 :0(acos)
   295412    1.086    0.000    1.086    0.000 :0(append)
        2    0.000    0.000    0.000    0.000 :0(close)
        8    0.000    0.000    0.000    0.000 :0(copy)
        2    0.000    0.000    0.000    0.000 :0(count)
       38    0.000    0.000    0.001    0.000 :0(decode)
        2    0.000    0.000    0.000    0.000 :0(digest)
        6    0.000    0.000    0.000    0.000 :0(encode)
     7728    0.030    0.000    0.030    0.000 :0(extend)
        2    0.000    0.000    0.000    0.000 :0(get)
        2    0.000    0.000    0.000    0.000 :0(get_ident)
        2    0.000    0.000    0.000    0.000 :0(getattr)
       44    0.000    0.000    0.000    0.000 :0(getpid)
        2    0.000    0.000    0.000    0.000 :0(getvalue)
        2    0.000    0.000    0.000    0.000 :0(group)
        2    0.000    0.000    0.000    0.000 :0(hasattr)
        2    0.000    0.000    0.000    0.000 :0(hexdigest)
   322488    1.186    0.000    1.186    0.000 :0(isalnum)
       94    0.000    0.000    0.000    0.000 :0(isinstance)
        2    0.000    0.000    0.000    0.000 :0(isoformat)
        2    0.001    0.001    0.001    0.001 :0(items)
    58736    0.228    0.000    0.228    0.000 :0(join)
   113167    0.416    0.000    0.416    0.000 :0(len)
        2    0.000    0.000    0.000    0.000 :0(locals)
    58728    0.221    0.000    0.221    0.000 :0(lower)
        2    0.000    0.000    0.000    0.000 :0(map)
        2    0.000    0.000    0.000    0.000 :0(max)
        2    0.000    0.000    0.000    0.000 :0(now)
        2    0.000    0.000    0.000    0.000 :0(open)
        4    0.000    0.000    0.000    0.000 :0(range)
        2    0.001    0.000    0.001    0.000 :0(readlines)
       14    0.000    0.000    0.000    0.000 :0(send)
        1    0.000    0.000    0.000    0.000 :0(setprofile)
        1    0.000    0.000    0.000    0.000 :0(sqrt)
       52    0.000    0.000    0.000    0.000 :0(sub)
       38    0.000    0.000    0.000    0.000 :0(time)
       10    0.000    0.000    0.000    0.000 :0(update)
        2    0.000    0.000    0.000    0.000 :0(upper)
       38    0.000    0.000    0.000    0.000 :0(utf_8_decode)
       38    0.000    0.000    0.000    0.000 :0(write)
        2    0.000    0.000    0.000    0.000 :0(zmq_poll)
        2    0.000    0.000    0.001    0.001 <ipython-input-233-02a2fcdaef67>:4(read_file)
     7724    3.327    0.000    6.393    0.001 <ipython-input-234-197de0f6765b>:17(get_words_from_string)
        2    1.186    0.593    1.186    0.593 <ipython-input-239-3e98c1c9bee1>:4(insertion_sort)
        3    0.082    0.027    0.153    0.051 <ipython-input-242-6827588a201a>:1(inner_product)
        1    0.000    0.000    0.153    0.153 <ipython-input-243-db4e3492a3db>:1(vector_angle)
        2    0.071    0.035    6.494    3.247 <ipython-input-247-39e11d2bc6c1>:1(get_words_from_line_list_3)
        2    0.017    0.009    0.018    0.009 <ipython-input-251-cdecf979a6e1>:4(count_frequency_4)
        2    0.001    0.000    7.706    3.853 <ipython-input-252-73c39f19d34d>:1(word_frequencies_for_file_4)
        1    0.001    0.001    7.863    7.863 <ipython-input-253-a98f6349bc3e>:2(test_docdist_4)
        1    0.000    0.000    7.864    7.864 <string>:1(<module>)
        8    0.000    0.000    0.002    0.000 __init__.py:193(dumps)
        2    0.000    0.000    0.000    0.000 __init__.py:52(create_string_buffer)
        2    0.000    0.000    0.000    0.000 attrsettr.py:35(__getattr__)
        8    0.000    0.000    0.000    0.000 encoder.py:101(__init__)
        8    0.000    0.000    0.002    0.000 encoder.py:186(encode)
        8    0.000    0.000    0.001    0.000 encoder.py:212(iterencode)
       52    0.000    0.000    0.001    0.000 encoder.py:33(encode_basestring)
        2    0.000    0.000    0.000    0.000 encoder.py:37(replace)
        2    0.000    0.000    0.000    0.000 hmac.py:100(_current)
        2    0.000    0.000    0.000    0.000 hmac.py:119(hexdigest)
        2    0.000    0.000    0.000    0.000 hmac.py:30(__init__)
        8    0.000    0.000    0.000    0.000 hmac.py:83(update)
        2    0.000    0.000    0.000    0.000 hmac.py:88(copy)
       40    0.000    0.000    0.001    0.000 iostream.py:102(_check_mp_mode)
        2    0.000    0.000    0.000    0.000 iostream.py:123(_flush_from_subprocesses)
        2    0.000    0.000    0.005    0.002 iostream.py:151(flush)
       38    0.001    0.000    0.008    0.000 iostream.py:207(write)
        2    0.000    0.000    0.000    0.000 iostream.py:238(_flush_buffer)
        2    0.000    0.000    0.000    0.000 iostream.py:247(_new_buffer)
       42    0.000    0.000    0.000    0.000 iostream.py:93(_is_master_process)
        2    0.000    0.000    0.000    0.000 iostream.py:96(_is_master_thread)
        8    0.000    0.000    0.002    0.000 jsonapi.py:31(dumps)
        2    0.000    0.000    0.000    0.000 jsonutil.py:75(date_default)
        2    0.000    0.000    0.000    0.000 poll.py:77(poll)
        0    0.000             0.000          profile:0(profiler)
        1    0.000    0.000    7.864    7.864 profile:0(test_docdist_4())
        2    0.000    0.000    0.000    0.000 py3compat.py:12(no_code)
        2    0.000    0.000    0.000    0.000 session.py:206(msg_header)
        2    0.000    0.000    0.000    0.000 session.py:211(extract_header)
        2    0.000    0.000    0.000    0.000 session.py:452(msg_id)
        2    0.000    0.000    0.001    0.000 session.py:504(msg_header)
        2    0.000    0.000    0.001    0.000 session.py:507(msg)
        2    0.000    0.000    0.000    0.000 session.py:526(sign)
        2    0.000    0.000    0.003    0.001 session.py:541(serialize)
        2    0.000    0.000    0.004    0.002 session.py:600(send)
        8    0.000    0.000    0.002    0.000 session.py:94(<lambda>)
        2    0.000    0.000    0.001    0.000 socket.py:289(send_multipart)
        2    0.000    0.000    0.000    0.000 threading.py:1152(currentThread)
        2    0.000    0.000    0.000    0.000 threading.py:983(ident)
       26    0.000    0.000    0.000    0.000 traitlets.py:420(__get__)
       38    0.000    0.000    0.001    0.000 utf_8.py:15(decode)
        2    0.000    0.000    0.000    0.000 uuid.py:101(__init__)
        2    0.000    0.000    0.000    0.000 uuid.py:197(__str__)
        2    0.000    0.000    0.000    0.000 uuid.py:546(uuid4)


In [255]:
# document distance version 5
# split words with string.translate 
In [256]:
import string
In [257]:
# global variables needed for fast parsing
# translation table maps upper case to lower case and punctuation to spaces
translation_table = string.maketrans(string.punctuation + string.uppercase,
                                    " " * len(string.punctuation) + string.lowercase)
In [258]:
def get_words_from_line_list_5(L):
    '''
    Parse the given list L of text lines into words.
    Return list of all words found.
    '''
    
    word_list = []
    for line in L:
        words_in_line = get_words_from_string_5(line)
        # Using "extend" is much more efficient than concatenation here:
        word_list.extend(words_in_line)
        
    return word_list

def get_words_from_string_5(line):
    '''
    Return a list of words in the given input string,
    converting each word to lower-case.
    
    Input: line (a string)
    Output: a list of strings
            (each string is a sequence of alphanumeric characters)
    '''
    
    line = line.translate(translation_table)
    word_list = line.split()
    
    return word_list
In [259]:
def word_frequencies_for_file_5(filename):
    """
    Return alphabetically sorted list of (word,frequency) pairs 
    for the given file.
    """

    line_list = read_file(filename)
    word_list = get_words_from_line_list_5(line_list)
    freq_mapping = count_frequency_4(word_list)
    insertion_sort(freq_mapping)

    print "File",filename,":",
    print len(line_list),"lines,",
    print len(word_list),"words,",
    print len(freq_mapping),"distinct words"

    return freq_mapping
In [260]:
# document distance version 5 test
def test_docdist_5():
    filename_1 = "/home/will/myspace/mydev/mytest/sparktest/pyspark/6006/unit1/lec01_data/t1.verne.txt"
    filename_2 = "/home/will/myspace/mydev/mytest/sparktest/pyspark/6006/unit1/lec01_data/t2.bobsey.txt"

    sorted_word_list_1 = word_frequencies_for_file_5(filename_1)
    sorted_word_list_2 = word_frequencies_for_file_5(filename_2)

    distance = vector_angle(sorted_word_list_1,sorted_word_list_2)

    print "The distance between the documents is: %0.6f (radians)" % distance
    
test_docdist_5()
File /home/will/myspace/mydev/mytest/sparktest/pyspark/6006/unit1/lec01_data/t1.verne.txt : 1057 lines, 8943 words, 2150 distinct words
File /home/will/myspace/mydev/mytest/sparktest/pyspark/6006/unit1/lec01_data/t2.bobsey.txt : 6667 lines, 49785 words, 3354 distinct words
The distance between the documents is: 0.582949 (radians)
In [284]:
profile.run("test_docdist_5()")
File /home/will/myspace/mydev/mytest/sparktest/pyspark/6006/unit1/lec01_data/t1.verne.txt : 1057 lines, 8943 words, 2150 distinct words
File /home/will/myspace/mydev/mytest/sparktest/pyspark/6006/unit1/lec01_data/t2.bobsey.txt : 6667 lines, 49785 words, 3354 distinct words
The distance between the documents is: 0.582949 (radians)
         51341 function calls in 1.689 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 :0(acos)
        6    0.000    0.000    0.000    0.000 :0(append)
        2    0.000    0.000    0.000    0.000 :0(close)
        8    0.000    0.000    0.000    0.000 :0(copy)
        2    0.000    0.000    0.000    0.000 :0(count)
       38    0.000    0.000    0.001    0.000 :0(decode)
        2    0.000    0.000    0.000    0.000 :0(digest)
        6    0.000    0.000    0.000    0.000 :0(encode)
     7728    0.029    0.000    0.029    0.000 :0(extend)
        2    0.000    0.000    0.000    0.000 :0(get)
        2    0.000    0.000    0.000    0.000 :0(get_ident)
        2    0.000    0.000    0.000    0.000 :0(getattr)
       44    0.000    0.000    0.000    0.000 :0(getpid)
        2    0.000    0.000    0.000    0.000 :0(getvalue)
        2    0.000    0.000    0.000    0.000 :0(group)
        2    0.000    0.000    0.000    0.000 :0(hasattr)
        2    0.000    0.000    0.000    0.000 :0(hexdigest)
       94    0.000    0.000    0.000    0.000 :0(isinstance)
        2    0.000    0.000    0.000    0.000 :0(isoformat)
        2    0.035    0.018    0.035    0.018 :0(items)
        8    0.000    0.000    0.000    0.000 :0(join)
    19633    0.077    0.000    0.077    0.000 :0(len)
        2    0.000    0.000    0.000    0.000 :0(locals)
        2    0.000    0.000    0.000    0.000 :0(map)
        2    0.000    0.000    0.000    0.000 :0(max)
        2    0.000    0.000    0.000    0.000 :0(now)
        2    0.000    0.000    0.000    0.000 :0(open)
        4    0.000    0.000    0.000    0.000 :0(range)
        2    0.001    0.001    0.001    0.001 :0(readlines)
       14    0.000    0.000    0.000    0.000 :0(send)
        1    0.000    0.000    0.000    0.000 :0(setprofile)
     7724    0.037    0.000    0.037    0.000 :0(split)
        1    0.000    0.000    0.000    0.000 :0(sqrt)
       52    0.000    0.000    0.000    0.000 :0(sub)
       38    0.000    0.000    0.000    0.000 :0(time)
     7724    0.031    0.000    0.031    0.000 :0(translate)
       10    0.000    0.000    0.000    0.000 :0(update)
        2    0.000    0.000    0.000    0.000 :0(upper)
       38    0.000    0.000    0.000    0.000 :0(utf_8_decode)
       38    0.000    0.000    0.000    0.000 :0(write)
        2    0.000    0.000    0.000    0.000 :0(zmq_poll)
        2    0.000    0.000    0.001    0.001 <ipython-input-233-02a2fcdaef67>:4(read_file)
        2    1.207    0.603    1.207    0.603 <ipython-input-239-3e98c1c9bee1>:4(insertion_sort)
        3    0.087    0.029    0.164    0.055 <ipython-input-242-6827588a201a>:1(inner_product)
        1    0.000    0.000    0.164    0.164 <ipython-input-243-db4e3492a3db>:1(vector_angle)
        2    0.018    0.009    0.053    0.026 <ipython-input-251-cdecf979a6e1>:4(count_frequency_4)
        2    0.065    0.032    0.253    0.127 <ipython-input-258-139f245f7c34>:1(get_words_from_line_list_5)
     7724    0.092    0.000    0.159    0.000 <ipython-input-258-139f245f7c34>:15(get_words_from_string_5)
        2    0.001    0.000    1.521    0.760 <ipython-input-259-0f0c6c8368de>:1(word_frequencies_for_file_5)
        1    0.001    0.001    1.689    1.689 <ipython-input-260-a9966778a7e4>:2(test_docdist_5)
        1    0.001    0.001    1.689    1.689 <string>:1(<module>)
        8    0.000    0.000    0.002    0.000 __init__.py:193(dumps)
        2    0.000    0.000    0.000    0.000 __init__.py:52(create_string_buffer)
        2    0.000    0.000    0.000    0.000 attrsettr.py:35(__getattr__)
        8    0.000    0.000    0.000    0.000 encoder.py:101(__init__)
        8    0.000    0.000    0.002    0.000 encoder.py:186(encode)
        8    0.000    0.000    0.001    0.000 encoder.py:212(iterencode)
       52    0.000    0.000    0.001    0.000 encoder.py:33(encode_basestring)
        2    0.000    0.000    0.000    0.000 encoder.py:37(replace)
        2    0.000    0.000    0.000    0.000 hmac.py:100(_current)
        2    0.000    0.000    0.000    0.000 hmac.py:119(hexdigest)
        2    0.000    0.000    0.000    0.000 hmac.py:30(__init__)
        8    0.000    0.000    0.000    0.000 hmac.py:83(update)
        2    0.000    0.000    0.000    0.000 hmac.py:88(copy)
       40    0.000    0.000    0.001    0.000 iostream.py:102(_check_mp_mode)
        2    0.000    0.000    0.000    0.000 iostream.py:123(_flush_from_subprocesses)
        2    0.000    0.000    0.005    0.002 iostream.py:151(flush)
       38    0.001    0.000    0.008    0.000 iostream.py:207(write)
        2    0.000    0.000    0.000    0.000 iostream.py:238(_flush_buffer)
        2    0.000    0.000    0.000    0.000 iostream.py:247(_new_buffer)
       42    0.000    0.000    0.000    0.000 iostream.py:93(_is_master_process)
        2    0.000    0.000    0.000    0.000 iostream.py:96(_is_master_thread)
        8    0.000    0.000    0.002    0.000 jsonapi.py:31(dumps)
        2    0.000    0.000    0.000    0.000 jsonutil.py:75(date_default)
        2    0.000    0.000    0.000    0.000 poll.py:77(poll)
        0    0.000             0.000          profile:0(profiler)
        1    0.000    0.000    1.689    1.689 profile:0(test_docdist_5())
        2    0.000    0.000    0.000    0.000 py3compat.py:12(no_code)
        2    0.000    0.000    0.000    0.000 session.py:206(msg_header)
        2    0.000    0.000    0.000    0.000 session.py:211(extract_header)
        2    0.000    0.000    0.000    0.000 session.py:452(msg_id)
        2    0.000    0.000    0.001    0.000 session.py:504(msg_header)
        2    0.000    0.000    0.001    0.000 session.py:507(msg)
        2    0.000    0.000    0.000    0.000 session.py:526(sign)
        2    0.000    0.000    0.003    0.001 session.py:541(serialize)
        2    0.000    0.000    0.004    0.002 session.py:600(send)
        8    0.000    0.000    0.002    0.000 session.py:94(<lambda>)
        2    0.000    0.000    0.001    0.000 socket.py:289(send_multipart)
        2    0.000    0.000    0.000    0.000 threading.py:1152(currentThread)
        2    0.000    0.000    0.000    0.000 threading.py:983(ident)
       26    0.000    0.000    0.000    0.000 traitlets.py:420(__get__)
       38    0.000    0.000    0.001    0.000 utf_8.py:15(decode)
        2    0.000    0.000    0.000    0.000 uuid.py:101(__init__)
        2    0.000    0.000    0.000    0.000 uuid.py:197(__str__)
        2    0.000    0.000    0.000    0.000 uuid.py:546(uuid4)


In [262]:
# document distance version 6
# change insertionsort to merge sort
In [263]:
def merge_sort(A):
    """
    Sort list A into order, and return result.
    """
    n = len(A)
    if n==1: 
        return A
    mid = n//2     # floor division
    L = merge_sort(A[:mid])
    R = merge_sort(A[mid:])
    return merge(L,R)

def merge(L,R):
    """
    Given two sorted sequences L and R, return their merge.
    """
    i = 0
    j = 0
    answer = []
    while i<len(L) and j<len(R):
        if L[i]<R[j]:
            answer.append(L[i])
            i += 1
        else:
            answer.append(R[j])
            j += 1
    if i<len(L):
        answer.extend(L[i:])
    if j<len(R):
        answer.extend(R[j:])
    return answer
In [264]:
# test merge_sort
merge_result = merge_sort([1, 81, 65, 68, 34, 21, 10, 7, 9])
print merge_result
[1, 7, 9, 10, 21, 34, 65, 68, 81]
In [265]:
insertion_sort([1, 81, 65, 68, 34, 21, 10, 7, 9])
Out[265]:
[1, 7, 9, 10, 21, 34, 65, 68, 81]
In [266]:
def count_frequency_6(word_list):
    """
    Return a list giving pairs of form: (word,frequency)
    """
    D = {}
    for new_word in word_list:
        if new_word in D:
            D[new_word] = D[new_word]+1
        else:
            D[new_word] = 1
    return D.items()
In [267]:
def word_frequencies_for_file_6(filename):
    """
    Return alphabetically sorted list of (word,frequency) pairs 
    for the given file.
    """

    #line_list = read_file(filename)
    #word_list = get_words_from_line_list_5(line_list)
    #freq_mapping = count_frequency_4(word_list)
    #merge_sort(freq_mapping)
    
    line_list = read_file(filename)
    word_list = get_words_from_line_list(line_list)
    freq_mapping = count_frequency_6(word_list)
    freq_mapping = merge_sort(freq_mapping)

    print "File",filename,":",
    print len(line_list),"lines,",
    print len(word_list),"words,",
    print len(freq_mapping),"distinct words"

    return freq_mapping
In [268]:
# document distance version 6 test
def test_docdist_6():
    filename_1 = "/home/will/myspace/mydev/mytest/sparktest/pyspark/6006/unit1/lec01_data/t1.verne.txt"
    filename_2 = "/home/will/myspace/mydev/mytest/sparktest/pyspark/6006/unit1/lec01_data/t2.bobsey.txt"

    sorted_word_list_1 = word_frequencies_for_file_6(filename_1)
    sorted_word_list_2 = word_frequencies_for_file_6(filename_2)

    distance = vector_angle(sorted_word_list_1,sorted_word_list_2)

    print "The distance between the documents is: %0.6f (radians)" % distance
    
test_docdist_6()
File /home/will/myspace/mydev/mytest/sparktest/pyspark/6006/unit1/lec01_data/t1.verne.txt : 1057 lines, 8943 words, 2150 distinct words
File /home/will/myspace/mydev/mytest/sparktest/pyspark/6006/unit1/lec01_data/t2.bobsey.txt : 6667 lines, 49785 words, 3354 distinct words
The distance between the documents is: 0.582949 (radians)
In [269]:
profile.run("test_docdist_6()")
File /home/will/myspace/mydev/mytest/sparktest/pyspark/6006/unit1/lec01_data/t1.verne.txt : 1057 lines, 8943 words, 2150 distinct words
File /home/will/myspace/mydev/mytest/sparktest/pyspark/6006/unit1/lec01_data/t2.bobsey.txt : 6667 lines, 49785 words, 3354 distinct words
The distance between the documents is: 0.582949 (radians)
         1077632 function calls (1066628 primitive calls) in 9.615 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 :0(acos)
   351587    1.301    0.000    1.301    0.000 :0(append)
        2    0.000    0.000    0.000    0.000 :0(close)
        8    0.000    0.000    0.000    0.000 :0(copy)
        2    0.000    0.000    0.000    0.000 :0(count)
       38    0.000    0.000    0.001    0.000 :0(decode)
        2    0.000    0.000    0.000    0.000 :0(digest)
        6    0.000    0.000    0.000    0.000 :0(encode)
     5506    0.021    0.000    0.021    0.000 :0(extend)
        2    0.000    0.000    0.000    0.000 :0(get)
        2    0.000    0.000    0.000    0.000 :0(get_ident)
        2    0.000    0.000    0.000    0.000 :0(getattr)
       44    0.000    0.000    0.000    0.000 :0(getpid)
        2    0.000    0.000    0.000    0.000 :0(getvalue)
        2    0.000    0.000    0.000    0.000 :0(group)
        2    0.000    0.000    0.000    0.000 :0(hasattr)
        2    0.000    0.000    0.000    0.000 :0(hexdigest)
   322488    1.184    0.000    1.184    0.000 :0(isalnum)
       94    0.000    0.000    0.000    0.000 :0(isinstance)
        2    0.000    0.000    0.000    0.000 :0(isoformat)
        2    0.002    0.001    0.002    0.001 :0(items)
    58736    0.229    0.000    0.229    0.000 :0(join)
   255565    0.942    0.000    0.942    0.000 :0(len)
        2    0.000    0.000    0.000    0.000 :0(locals)
    58728    0.220    0.000    0.220    0.000 :0(lower)
        2    0.000    0.000    0.000    0.000 :0(map)
        2    0.000    0.000    0.000    0.000 :0(max)
        2    0.000    0.000    0.000    0.000 :0(now)
        2    0.000    0.000    0.000    0.000 :0(open)
        2    0.000    0.000    0.000    0.000 :0(range)
        2    0.001    0.000    0.001    0.000 :0(readlines)
       14    0.000    0.000    0.000    0.000 :0(send)
        1    0.000    0.000    0.000    0.000 :0(setprofile)
        1    0.000    0.000    0.000    0.000 :0(sqrt)
       52    0.000    0.000    0.000    0.000 :0(sub)
       38    0.000    0.000    0.000    0.000 :0(time)
       10    0.000    0.000    0.000    0.000 :0(update)
        2    0.000    0.000    0.000    0.000 :0(upper)
       38    0.000    0.000    0.000    0.000 :0(utf_8_decode)
       38    0.000    0.000    0.000    0.000 :0(write)
        2    0.000    0.000    0.000    0.000 :0(zmq_poll)
        2    0.000    0.000    0.001    0.000 <ipython-input-233-02a2fcdaef67>:4(read_file)
     7724    3.328    0.000    6.404    0.001 <ipython-input-234-197de0f6765b>:17(get_words_from_string)
        2    1.323    0.661    7.727    3.863 <ipython-input-234-197de0f6765b>:4(get_words_from_line_list)
        3    0.083    0.028    0.155    0.052 <ipython-input-242-6827588a201a>:1(inner_product)
        1    0.000    0.000    0.155    0.155 <ipython-input-243-db4e3492a3db>:1(vector_angle)
  11006/2    0.172    0.000    1.703    0.852 <ipython-input-263-b089260f6291>:1(merge_sort)
     5502    0.783    0.000    1.486    0.000 <ipython-input-263-b089260f6291>:13(merge)
        2    0.016    0.008    0.018    0.009 <ipython-input-266-b25e3b7d8d72>:1(count_frequency_6)
        2    0.001    0.000    9.456    4.728 <ipython-input-267-cc5e6fd7e649>:1(word_frequencies_for_file_6)
        1    0.001    0.001    9.614    9.614 <ipython-input-268-a0c32fb6b53e>:2(test_docdist_6)
        1    0.001    0.001    9.615    9.615 <string>:1(<module>)
        8    0.000    0.000    0.002    0.000 __init__.py:193(dumps)
        2    0.000    0.000    0.000    0.000 __init__.py:52(create_string_buffer)
        2    0.000    0.000    0.000    0.000 attrsettr.py:35(__getattr__)
        8    0.000    0.000    0.000    0.000 encoder.py:101(__init__)
        8    0.000    0.000    0.001    0.000 encoder.py:186(encode)
        8    0.000    0.000    0.001    0.000 encoder.py:212(iterencode)
       52    0.000    0.000    0.001    0.000 encoder.py:33(encode_basestring)
        2    0.000    0.000    0.000    0.000 encoder.py:37(replace)
        2    0.000    0.000    0.000    0.000 hmac.py:100(_current)
        2    0.000    0.000    0.000    0.000 hmac.py:119(hexdigest)
        2    0.000    0.000    0.000    0.000 hmac.py:30(__init__)
        8    0.000    0.000    0.000    0.000 hmac.py:83(update)
        2    0.000    0.000    0.000    0.000 hmac.py:88(copy)
       40    0.000    0.000    0.001    0.000 iostream.py:102(_check_mp_mode)
        2    0.000    0.000    0.000    0.000 iostream.py:123(_flush_from_subprocesses)
        2    0.000    0.000    0.005    0.002 iostream.py:151(flush)
       38    0.001    0.000    0.008    0.000 iostream.py:207(write)
        2    0.000    0.000    0.000    0.000 iostream.py:238(_flush_buffer)
        2    0.000    0.000    0.000    0.000 iostream.py:247(_new_buffer)
       42    0.000    0.000    0.000    0.000 iostream.py:93(_is_master_process)
        2    0.000    0.000    0.000    0.000 iostream.py:96(_is_master_thread)
        8    0.000    0.000    0.002    0.000 jsonapi.py:31(dumps)
        2    0.000    0.000    0.000    0.000 jsonutil.py:75(date_default)
        2    0.000    0.000    0.000    0.000 poll.py:77(poll)
        0    0.000             0.000          profile:0(profiler)
        1    0.000    0.000    9.615    9.615 profile:0(test_docdist_6())
        2    0.000    0.000    0.000    0.000 py3compat.py:12(no_code)
        2    0.000    0.000    0.000    0.000 session.py:206(msg_header)
        2    0.000    0.000    0.000    0.000 session.py:211(extract_header)
        2    0.000    0.000    0.000    0.000 session.py:452(msg_id)
        2    0.000    0.000    0.001    0.000 session.py:504(msg_header)
        2    0.000    0.000    0.001    0.000 session.py:507(msg)
        2    0.000    0.000    0.000    0.000 session.py:526(sign)
        2    0.000    0.000    0.003    0.001 session.py:541(serialize)
        2    0.000    0.000    0.004    0.002 session.py:600(send)
        8    0.000    0.000    0.002    0.000 session.py:94(<lambda>)
        2    0.000    0.000    0.001    0.000 socket.py:289(send_multipart)
        2    0.000    0.000    0.000    0.000 threading.py:1152(currentThread)
        2    0.000    0.000    0.000    0.000 threading.py:983(ident)
       26    0.000    0.000    0.000    0.000 traitlets.py:420(__get__)
       38    0.000    0.000    0.000    0.000 utf_8.py:15(decode)
        2    0.000    0.000    0.000    0.000 uuid.py:101(__init__)
        2    0.000    0.000    0.000    0.000 uuid.py:197(__str__)
        2    0.000    0.000    0.000    0.000 uuid.py:546(uuid4)


In [270]:
# document distance version 7
# no sorting,dot product with dictionary
In [271]:
def count_frequency_7(word_list):
    '''
    Return a dictionary mapping words to frequency.
    '''
    
    D = {}
    
    for new_word in word_list:
        if new_word in D:
            D[new_word] += 1
        else:
            D[new_word] = 1
        
    return D
In [272]:
def word_frequencies_for_file_7(filename):
    """
    Return alphabetically sorted list of (word,frequency) pairs 
    for the given file.
    """

    line_list = read_file(filename)
    word_list = get_words_from_line_list_5(line_list)
    freq_mapping = count_frequency_7(word_list)

    print "File",filename,":",
    print len(line_list),"lines,",
    print len(word_list),"words,",
    print len(freq_mapping),"distinct words"

    return freq_mapping
In [273]:
def inner_product_7(D1, D2):
    '''
    Inner product between two vectors, where vectors are
    represented as dictionaries of (word, freq) pairs.
    
    Example: inner_product_7({"and":3,"of":2,"the":5},
                           {"and":4,"in":1,"of":1,"this":2}) = 14.0 
    '''
    
    sum = 0.0
    
    for key in D1:
        if key in D2:
            sum += D1[key] * D2[key]
    
    return sum
In [274]:
# test inner_product_7
inner_product_7({"and":3,"of":2,"the":5},
                           {"and":4,"in":1,"of":1,"this":2})
Out[274]:
14.0
In [275]:
def vector_angle_7(L1, L2):
    '''
    The input is a list of (word, freq) pairs, sorted alphabetically.
    
    Return the angle between these two vectors.
    '''
    
    numerator = inner_product_7(L1, L2)
    denominator = math.sqrt(inner_product_7(L1, L1) * inner_product_7(L2, L2))
    return math.acos(numerator / denominator)
In [276]:
# document distance version 6 test
def test_docdist_7():
    filename_1 = "/home/will/myspace/mydev/mytest/sparktest/pyspark/6006/unit1/lec01_data/t1.verne.txt"
    filename_2 = "/home/will/myspace/mydev/mytest/sparktest/pyspark/6006/unit1/lec01_data/t2.bobsey.txt"

    sorted_word_list_1 = word_frequencies_for_file_7(filename_1)
    sorted_word_list_2 = word_frequencies_for_file_7(filename_2)

    distance = vector_angle_7(sorted_word_list_1,sorted_word_list_2)

    print "The distance between the documents is: %0.6f (radians)" % distance
    
#test_docdist_7()
In [277]:
profile.run("test_docdist_7()")
File /home/will/myspace/mydev/mytest/sparktest/pyspark/6006/unit1/lec01_data/t1.verne.txt : 1057 lines, 8943 words, 2150 distinct words
File /home/will/myspace/mydev/mytest/sparktest/pyspark/6006/unit1/lec01_data/t2.bobsey.txt : 6667 lines, 49785 words, 3354 distinct words
The distance between the documents is: 0.582949 (radians)
         31514 function calls in 0.280 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 :0(acos)
        3    0.000    0.000    0.000    0.000 :0(append)
        1    0.000    0.000    0.000    0.000 :0(close)
        4    0.000    0.000    0.000    0.000 :0(copy)
        1    0.000    0.000    0.000    0.000 :0(count)
       38    0.000    0.000    0.001    0.000 :0(decode)
        1    0.000    0.000    0.000    0.000 :0(digest)
        3    0.000    0.000    0.000    0.000 :0(encode)
     7726    0.030    0.000    0.030    0.000 :0(extend)
        1    0.000    0.000    0.000    0.000 :0(get)
        1    0.000    0.000    0.000    0.000 :0(get_ident)
        1    0.000    0.000    0.000    0.000 :0(getattr)
       41    0.000    0.000    0.000    0.000 :0(getpid)
        1    0.000    0.000    0.000    0.000 :0(getvalue)
        1    0.000    0.000    0.000    0.000 :0(group)
        1    0.000    0.000    0.000    0.000 :0(hasattr)
        1    0.000    0.000    0.000    0.000 :0(hexdigest)
       66    0.000    0.000    0.000    0.000 :0(isinstance)
        1    0.000    0.000    0.000    0.000 :0(isoformat)
        4    0.000    0.000    0.000    0.000 :0(join)
       14    0.000    0.000    0.000    0.000 :0(len)
        1    0.000    0.000    0.000    0.000 :0(locals)
        1    0.000    0.000    0.000    0.000 :0(map)
        1    0.000    0.000    0.000    0.000 :0(max)
        1    0.000    0.000    0.000    0.000 :0(now)
        2    0.000    0.000    0.000    0.000 :0(open)
        1    0.000    0.000    0.000    0.000 :0(range)
        2    0.001    0.000    0.001    0.000 :0(readlines)
        7    0.000    0.000    0.000    0.000 :0(send)
        1    0.000    0.000    0.000    0.000 :0(setprofile)
     7724    0.036    0.000    0.036    0.000 :0(split)
        1    0.000    0.000    0.000    0.000 :0(sqrt)
       26    0.000    0.000    0.000    0.000 :0(sub)
       38    0.000    0.000    0.000    0.000 :0(time)
     7724    0.031    0.000    0.031    0.000 :0(translate)
        5    0.000    0.000    0.000    0.000 :0(update)
        1    0.000    0.000    0.000    0.000 :0(upper)
       38    0.000    0.000    0.000    0.000 :0(utf_8_decode)
       38    0.000    0.000    0.000    0.000 :0(write)
        1    0.000    0.000    0.000    0.000 :0(zmq_poll)
        2    0.000    0.000    0.001    0.000 <ipython-input-233-02a2fcdaef67>:4(read_file)
        2    0.066    0.033    0.255    0.127 <ipython-input-258-139f245f7c34>:1(get_words_from_line_list_5)
     7724    0.092    0.000    0.159    0.000 <ipython-input-258-139f245f7c34>:15(get_words_from_string_5)
        2    0.015    0.007    0.015    0.007 <ipython-input-271-aaa5b7b7bbbe>:1(count_frequency_7)
        2    0.001    0.000    0.276    0.138 <ipython-input-272-5a73205effc0>:1(word_frequencies_for_file_7)
        3    0.002    0.001    0.002    0.001 <ipython-input-273-d3884678b0f2>:1(inner_product_7)
        1    0.000    0.000    0.002    0.002 <ipython-input-275-90f1605c227b>:1(vector_angle_7)
        1    0.001    0.001    0.280    0.280 <ipython-input-276-19c564a5021b>:2(test_docdist_7)
        1    0.000    0.000    0.280    0.280 <string>:1(<module>)
        4    0.000    0.000    0.001    0.000 __init__.py:193(dumps)
        1    0.000    0.000    0.000    0.000 __init__.py:52(create_string_buffer)
        1    0.000    0.000    0.000    0.000 attrsettr.py:35(__getattr__)
        4    0.000    0.000    0.000    0.000 encoder.py:101(__init__)
        4    0.000    0.000    0.001    0.000 encoder.py:186(encode)
        4    0.000    0.000    0.001    0.000 encoder.py:212(iterencode)
       26    0.000    0.000    0.000    0.000 encoder.py:33(encode_basestring)
        1    0.000    0.000    0.000    0.000 encoder.py:37(replace)
        1    0.000    0.000    0.000    0.000 hmac.py:100(_current)
        1    0.000    0.000    0.000    0.000 hmac.py:119(hexdigest)
        1    0.000    0.000    0.000    0.000 hmac.py:30(__init__)
        4    0.000    0.000    0.000    0.000 hmac.py:83(update)
        1    0.000    0.000    0.000    0.000 hmac.py:88(copy)
       39    0.000    0.000    0.001    0.000 iostream.py:102(_check_mp_mode)
        1    0.000    0.000    0.000    0.000 iostream.py:123(_flush_from_subprocesses)
        1    0.000    0.000    0.002    0.002 iostream.py:151(flush)
       38    0.001    0.000    0.006    0.000 iostream.py:207(write)
        1    0.000    0.000    0.000    0.000 iostream.py:238(_flush_buffer)
        1    0.000    0.000    0.000    0.000 iostream.py:247(_new_buffer)
       40    0.000    0.000    0.000    0.000 iostream.py:93(_is_master_process)
        1    0.000    0.000    0.000    0.000 iostream.py:96(_is_master_thread)
        4    0.000    0.000    0.001    0.000 jsonapi.py:31(dumps)
        1    0.000    0.000    0.000    0.000 jsonutil.py:75(date_default)
        1    0.000    0.000    0.000    0.000 poll.py:77(poll)
        0    0.000             0.000          profile:0(profiler)
        1    0.000    0.000    0.280    0.280 profile:0(test_docdist_7())
        1    0.000    0.000    0.000    0.000 py3compat.py:12(no_code)
        1    0.000    0.000    0.000    0.000 session.py:206(msg_header)
        1    0.000    0.000    0.000    0.000 session.py:211(extract_header)
        1    0.000    0.000    0.000    0.000 session.py:452(msg_id)
        1    0.000    0.000    0.000    0.000 session.py:504(msg_header)
        1    0.000    0.000    0.000    0.000 session.py:507(msg)
        1    0.000    0.000    0.000    0.000 session.py:526(sign)
        1    0.000    0.000    0.001    0.001 session.py:541(serialize)
        1    0.000    0.000    0.002    0.002 session.py:600(send)
        4    0.000    0.000    0.001    0.000 session.py:94(<lambda>)
        1    0.000    0.000    0.000    0.000 socket.py:289(send_multipart)
        1    0.000    0.000    0.000    0.000 threading.py:1152(currentThread)
        1    0.000    0.000    0.000    0.000 threading.py:983(ident)
       13    0.000    0.000    0.000    0.000 traitlets.py:420(__get__)
       38    0.000    0.000    0.001    0.000 utf_8.py:15(decode)
        1    0.000    0.000    0.000    0.000 uuid.py:101(__init__)
        1    0.000    0.000    0.000    0.000 uuid.py:197(__str__)
        1    0.000    0.000    0.000    0.000 uuid.py:546(uuid4)


In [278]:
# document distance version 8
# split words on whole document, not line by line.
In [279]:
def read_file_8(filename):
    """ 
    Read the text file with the given filename;
    return a list of the lines of text in the file.
    """
    try:
        f = open(filename, 'r')
        return f.read()
    except IOError:
        print "Error opening or reading input file: ",filename
        sys.exit()
In [280]:
def get_words_from_line_list_8(text):
    '''
    Parse the given text into words.
    Return list of all words found.
    '''
    
    text = text.translate(translation_table)
    word_list = text.split()
    return word_list
In [281]:
def word_frequencies_for_file_8(filename):
    """
    Return alphabetically sorted list of (word,frequency) pairs 
    for the given file.
    """

    line_list = read_file_8(filename)
    word_list = get_words_from_line_list_8(line_list)
    freq_mapping = count_frequency_7(word_list)

    print "File",filename,":",
    print len(line_list),"lines,",
    print len(word_list),"words,",
    print len(freq_mapping),"distinct words"

    return freq_mapping
In [282]:
# document distance version 6 test
def test_docdist_8():
    filename_1 = "/home/will/myspace/mydev/mytest/sparktest/pyspark/6006/unit1/lec01_data/t1.verne.txt"
    filename_2 = "/home/will/myspace/mydev/mytest/sparktest/pyspark/6006/unit1/lec01_data/t2.bobsey.txt"

    sorted_word_list_1 = word_frequencies_for_file_8(filename_1)
    sorted_word_list_2 = word_frequencies_for_file_8(filename_2)

    distance = vector_angle_7(sorted_word_list_1,sorted_word_list_2)

    print "The distance between the documents is: %0.6f (radians)" % distance
    
#test_docdist_8()
In [283]:
profile.run("test_docdist_8()")
File /home/will/myspace/mydev/mytest/sparktest/pyspark/6006/unit1/lec01_data/t1.verne.txt : 53710 lines, 8943 words, 2150 distinct words
File /home/will/myspace/mydev/mytest/sparktest/pyspark/6006/unit1/lec01_data/t2.bobsey.txt : 268778 lines, 49785 words, 3354 distinct words
The distance between the documents is: 0.582949 (radians)
         412 function calls in 0.033 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 :0(acos)
       38    0.000    0.000    0.001    0.000 :0(decode)
       38    0.000    0.000    0.000    0.000 :0(getpid)
       38    0.000    0.000    0.000    0.000 :0(isinstance)
        6    0.000    0.000    0.000    0.000 :0(len)
        2    0.000    0.000    0.000    0.000 :0(open)
        2    0.001    0.000    0.001    0.000 :0(read)
        1    0.000    0.000    0.000    0.000 :0(setprofile)
        2    0.004    0.002    0.004    0.002 :0(split)
        1    0.000    0.000    0.000    0.000 :0(sqrt)
       38    0.000    0.000    0.000    0.000 :0(time)
        2    0.001    0.000    0.001    0.000 :0(translate)
       38    0.000    0.000    0.000    0.000 :0(utf_8_decode)
       38    0.000    0.000    0.000    0.000 :0(write)
        2    0.018    0.009    0.018    0.009 <ipython-input-271-aaa5b7b7bbbe>:1(count_frequency_7)
        3    0.002    0.001    0.002    0.001 <ipython-input-273-d3884678b0f2>:1(inner_product_7)
        1    0.000    0.000    0.002    0.002 <ipython-input-275-90f1605c227b>:1(vector_angle_7)
        2    0.000    0.000    0.001    0.001 <ipython-input-279-3764f4237475>:1(read_file_8)
        2    0.000    0.000    0.005    0.002 <ipython-input-280-a37f5ea00197>:1(get_words_from_line_list_8)
        2    0.001    0.000    0.028    0.014 <ipython-input-281-7265f9cada06>:1(word_frequencies_for_file_8)
        1    0.002    0.002    0.033    0.033 <ipython-input-282-d9a6af96d96d>:2(test_docdist_8)
        1    0.000    0.000    0.033    0.033 <string>:1(<module>)
       38    0.000    0.000    0.001    0.000 iostream.py:102(_check_mp_mode)
       38    0.001    0.000    0.004    0.000 iostream.py:207(write)
       38    0.000    0.000    0.001    0.000 iostream.py:93(_is_master_process)
        0    0.000             0.000          profile:0(profiler)
        1    0.000    0.000    0.033    0.033 profile:0(test_docdist_8())
       38    0.000    0.000    0.001    0.000 utf_8.py:15(decode)


  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值