


# text.txt文件下的数据内容
1.In her mind she followed the white BUICK  along the road somewhere between here and the Niagara River.
2.There were some sweet machines other than women, an old BUGATTI , a lean Farina coachwork on an American chassis, a Swallow, a type 540- K Mercedes and lots more.
3.Only three standard models- BUICK , Chrysler, and Mercury- had slight year-to-year gains in March sales in the county.
4.The white BUICK  hadn't moved away yet.
5. The simple mechanical strain of overweight, says New York's Dr Norman Jolliffe, can overburden and damage the heart "for much the same reason that a Chevrolet engine in a CADILLAC  body would wear out sooner than if it were in a body for which it was built".
6. It is well to bear in mind that gasoline will cost from 80 to 90 for the equivalent of a United States gallon and while you might prefer a familiar Ford, Chevrolet or even a CADILLAC , which are available in some countries, it is probably wiser to choose the smaller European makes which average thirty, thirty-five and even forty miles to the gallon.
7. Your chauffeur's expenses will average between $7.00 to $12.00 a day, but this charge is the same whether you rent a 7-passenger CADILLAC  limousine or a 4-passenger Peugeot or Fiat 1800.
8.Of course, if you want to throw all caution to the winds and rent an Imperial or CADILLAC  limousine just for you and your bride, you'll have a memorable tour, but it won't be cheap, and it is not recommended unless you own a producing oil well or you've had a winner in the Irish Sweepstakes.
9.They answered him in monosyllables, nods, occasionally muttering in Greek to one another, awaiting the word from Papa, who restlessly cracked his knuckles, anxious to stuff himself into his white CADILLAC  and burst off to the freeway.
10. It was a CADILLAC , black grayed with the dust of the road, its windows closed tight so you knew that the people who climbed out of it would be cool and unwrinkled.
11. Almost immediately Howard and his daughter Debora drove up in the CADILLAC .
12. There was really no reason to refuse, and Linda Kay had never ridden in a CADILLAC .
13. Rates for American cars are somewhat higher, ranging from about $8.00 a day up to $14.00 a day for a CHEVROLET  Convertible, but the rate per kilometer driven is roughly the same as for the larger European models.
14.Friends, a picture magazine distributed by CHEVROLET  dealers, describes a paramilitary organization of employees of the Gulf Telephone Company at Foley, Alabama.
15.He had a perfectly good Audi when he moved here last year.
16.We're getting an Audi.
17. The string was walking round in a circle at the end of the gallops when Bill's Audi drew up.
18. She parked the hired Audi Coupe in front of the wire fence .
19. Sabrina zipped up her anorak as she stepped out into the cold night air and rummaged in her pockets for the keys to the Audi Coupe.
20. Ellwood drove an Audi -- fast but not flashy.
21.They parked the Audi where the guardhouse had once stood, on a small patch of concreted ground to the side of the road.
22.Adam grabbed Billie and hid her behind the Audi, glad that he'd chosen a four-wheel-drive Quattro.
23.He switched the engine on and swung the Audi out of the car-park, down Yorckstrasse towards the outskirts of the city.
24.With his other arm he wrenched the wheel to the right, forced the Audi on to the pavement and against the wall.
25.With the flames engulfing the roof of the Audi, Adam lay across the two front seats, aimed the machine-gun and shot the bomber dead.
26.Adam crawled out of the Audi, grabbed Billie and ran with her before the petrol tanks exploded.
27.As they dragged her away from the flaming Audi, she had turned and seen Adam lying on the road, shielding himself.
28.The Audi slammed into the side of the Volvo and Donna had to use all her strength to keep control of the car.
29.Minutes later cannabis worth 234,000 was found hidden in Melms's Audi at Newhaven, Sussex.
30.Gus halted the Aston Martin at the doorway instead of driving straight on to the garage, and was out of the driving-seat like a greyhound out of a trap, to dart round to the passenger side and hand Charlotte out.

TF:term frequency,文本频率,即统计单词在文本中出现的频率
IDF:inverse document frequency,逆文档频率,即统计该单词在哪些文档里面出现过,返回出现该词文档的数量
我们设一个单词在文本中重要性程度为k,则k=该单词在该文本中出现次数*log(总文档数量/出现该单词的文档数量)。不难看出,如果一个单词出现频率非常高,比如"the", “I”, "are"之类的词,很明显,这些单词不是关键词,会使得log(总文档数量/出现该单词的文档数量)变小。故而TF-IDF倾向于过滤掉常见的词语,保留重要的词语。

from collections import Counter
import math
import numpy as np
class tfIdf():
    def __init__(self, path, topK):
        self.path = path
        self.topK = topK

    function: 读取txt文件,返回由一句话组成的列表和由一个词一个词组成的二维列表
    path: txt文件的绝对路径
    def ReadTxtFile(self):
        dataByLine = []
        dataByWord = []
        with open(self.path, 'r', encoding='utf-8', errors='ignore') as f:
            for line in f.readlines():
                # if(len(line.split('\t')[0]) <= self.topK ):
                #     continue
                dataByWord.append(line.split('\n')[0].split(' ')[:-1])
        return dataByLine, dataByWord

    function: 返回每个文本中每个单词在原文本中出现频数
    def freq(self):
        _, dataByWord = self.ReadTxtFile()
        freqMat = []
        for line in dataByWord:
            _freqList = []
            temp = Counter(line)
            for word in line:
            freqMat = freqMat+[_freqList]
        return freqMat

    function: 计算每个文本中每个单词的逆文档频率
              idf = log("总文档数"/"出现该单词的文档数")
    def _wordCount(self, Word):
        _, dataByWord = self.ReadTxtFile()
        count = 0
        for line in dataByWord:
            for word in line:
                    count = count+1
        return count

    def IDF(self):
        _, dataByWord = self.ReadTxtFile()
        idfMat = []
        for line in dataByWord:
            _idfList = []
            for word in line:
            idfMat = idfMat+[_idfList]
        return idfMat

    function: 根据TF-IDF计算出每个文本的topK个关键词,返回为一个文本数*topK的二维列表
    def getKeyWord(self):
        _, dataByWord = self.ReadTxtFile()
        freqMat = self.freq()
        idfMat  = self.IDF()
        keyWord = []
        for i in range(len(freqMat)):
            _keyWordList = []
            for j in range(len(freqMat[i])):
            index = np.argsort(_keyWordList)
            _keyWord = []
            for item in index:
            _keyWord = _keyWord[::-1]
            _keyWord = _keyWord[:self.topK]
            keyWord = keyWord+[_keyWord]
        return keyWord

if __name__ == '__main__':
    fileName  =r'data/text.txt'
    ssh = tfIdf(fileName, topK=6)
    idfMat = ssh.getKeyWord()
    for line in idfMat:

keyWordA=[“I”, “love”, “you”, “deeply”]
keyWordB=[“I”, “don’t”, “love”, “you”]
那么,bag=[“I”, “love”, “you”, “deeply”, “don’t”]
A=[1, 1, 1, 1, 0]
B=[1, 1, 1, 0, 1]

from myTFIDF import tfIdf   # 调用我们刚刚写好的TF-IDF程序
from collections import Counter
import numpy as np
import warnings
warnings.filterwarnings("ignore") # 屏蔽警告

class cosDistance():
    def __init__(self, path):
        self.path = path

    function: 计算两条推文的cos距离
    input: 两条推文的关键词向量
    def dis(self, keyWordA, keyWordB):
        A = []
        B = []
        bag = set(keyWordA+keyWordB)
        counterA = Counter(keyWordA)
        counterB = Counter(keyWordB)
        for word in bag:
        for word in bag:
            if (counterB[word]):
        cosDis =, np.array(B))/(np.sum(np.array(A)**2)**0.5 * np.sum(np.array(B)**2)**0.5+0.01)
        return cosDis






当前余额3.43前往充值 >
领取后你会自动成为博主和红包主的粉丝 规则
钱包余额 0


