FM算法 大数据实验三

实验三 fm算法

1:以stream_for_fm.txt文件为自己所写程序的输入,读取文件中数据(数值范围是1-225);
2:请编写一个精确算法,来计算整个文件stream_for_fm.txt中有多少个不同的元素(number of unique elements);[可以通过有序链表来实现,共有106862个不同元素];
3:使用哈希函数:h(x) = a*x + b, 其中a,b为从整数1–225+1中随机选取的两个整数,x为stream_for_fm.txt文件中的一个数,则此时对于元素x来说,a为h(x)的二进制形式中尾部的0的个数;计算整个文件处理完毕后的R值(最大的a的值,既所有元素的哈希值中最大的尾长)。输出2R作为元素的个数的估计;

4:使用书上110页(组合估计)的技巧,估计元素元素个数。假设有m分组,每个分组包含L个哈希函数,共有m*L个哈希函数,对于每个分组中的每个哈希函数,计算其在文件处理完毕后的R值,令其为Rv[i,j],则分组i中的平局估计值为R_average[i]=(R[i,1]+R[i,2]+…+R[i,L]);所有分组都计算完毕后,我们可以得到数组R_average[1], R_average[2],…,R[m],对其进行排序后,求出中位数R_median,令2R_median做为最终的元素个数的估计。
5:设真实的元素个数为N, 令m=1,L=1, 重复实验步骤4,20次,可以得到20个元素个数的估计值N1=2R_median[1],计算平均误差error_sum ={[ (N1-N)2+( N2-N)2+( N3-N)2+…++( N20-N)2]/N}0.5;

精准计算当前数据流中的不同元素的个数

def Count_accurate():
    data_flu = set()
    with open("stream_for_fm.txt",'r') as f:
        while True:
            temp = f.readline()
            if temp =='':
                break
            temp=int(temp.strip())
            data_flu.add(temp)
    return len(data_flu)
       
print(Count_accurate())
106862
import random
def count_tail_0(c):
    count = 0
    if c==0:
        return 0
    while True:
        a = c%2
        c = c//2
        if a==0:
            count+=1
        else:
            break
    return count
print(count_tail_0(32692592705143))          
0
def Count_estimate():
    data_flu_R =[]
    a,b,c=0,0,0
    max_R=0
    a,b=random_hash_parameter()
    print(a,b)
    with open("stream_for_fm.txt",'r') as f:
        while True:
            temp = f.readline()
            if temp =='':
                break
            temp=int(temp.strip())
            # a,b=random_hash_parameter()
            # print(a,b)
            c= a*temp+b
            data_flu_R.append(count_tail_0(c))
    max_R= max(data_flu_R)
    print("max_r是",max_R)
    return 2**max_R

import random
def random_hash_parameter():
    return random.randint(1,2**25+1),random.randint(1,2**25+1)
a,b = random_hash_parameter()
print(a,b)
15734241 4938549
print("精准算法不同个数为:",Count_accurate())
print("fm算法估计的不同个数为",Count_estimate())
精准算法不同个数为: 106862
9538267 13344983
max_r是 15
fm算法估计的不同个数为 32768
def random_hansh_parameter(m,l):
    temp =[]
    final =[]
    for i in range(m*l):        
       a=random.randint(1,2**25+1)
       b=random.randint(1,2**25+1)
       temp.append(a)
       temp.append(b)
       final.append(temp)
       temp=[]
    return final
# print(random_hansh_parameter(2,2))
def transpose(matrix):
        new_matrix = []
        for i in range(len(matrix[0])):
            matrix1 = []
            for j in range(len(matrix)):
                matrix1.append(matrix[j][i])
            new_matrix.append(matrix1)
        return new_matrix
def RV(m,l):
    data_flu_R =[]
    a,b,c=0,0,0
    max_R=0
    R_avarage=[]# 用来存储最后的r组
    count =0
    all_hansh_parameter =random_hansh_parameter(m,l) #随机生成m*l个哈希变量
    print(all_hansh_parameter)
    all_hansh_R =[]# 所有哈希下的尾数0的个数r 一共 len*(m*l)个
    with open("stream_for_fm.txt",'r') as f:
        while True:
            temp = f.readline()
            if temp =='':
                break
            temp=int(temp.strip())
            data_flu_R=[]#一个数据的m*l个哈希函数
            for i in all_hansh_parameter:                
                c = i[0]*temp+i[1]
                # count+=0
                # print(c)
                data_flu_R.append(count_tail_0(c))        
            all_hansh_R.append(data_flu_R) 
        # print(all_hansh_R) 
        all_hansh_R=transpose(all_hansh_R)
        # print(all_hansh_R)   
        line_max_temp=[]
        # print(len( all_hansh_R))
        for i in all_hansh_R:
            line_max = max(i)
            # print(line_max)
            line_max_temp.append(line_max)
            count+=1 
            if count%l ==0:        
                R_avarage.append(line_max_temp)
                line_max_temp=[]              
    return R_avarage
# print(RV(2,4))

def Count_estimate(m,l):
    rv=RV(m,l)
    print(rv)
    R_average = []
    for i in rv:
        R_average.append(int(sum(i)/l))
    R_average.sort()
    # print(R_average)
    R_median = R_average[len(R_average)//2]
    # print(R_median)
    return 2**R_median 

print("估计值是",Count_estimate(2,2))
[[14773612, 4461589], [4217790, 9585159], [26052293, 23943193], [15372314, 29157118]]
[[0, 0], [15, 17]]
估计值是 65536
def error_sum(m,l,t):
    sum=0
    real_variety = Count_accurate()
    for i in range(t):
        sum+=(Count_estimate(m,l)-real_variety)**2
    return (sum/t)**0.5

print("(1,1)重复20次 的平均误差值是",error_sum(1,1,20))
        
    
[[23145178, 14052597]]
[[0]]
[[23788264, 25556650]]
[[1]]
[[11068063, 17640594]]
[[21]]
[[15246022, 33464365]]
[[0]]
[[22006273, 30577114]]
[[18]]
[[6856435, 27989501]]
[[17]]
[[33369251, 20816316]]
[[21]]
[[16824243, 21928881]]
[[17]]
[[7760937, 2810024]]
[[15]]
[[3826403, 19592108]]
[[19]]
[[25420979, 30142290]]
[[15]]
[[21987572, 24782209]]
[[0]]
[[9466184, 23118225]]
[[0]]
[[10303643, 25290342]]
[[16]]
[[20391863, 26914874]]
[[17]]
[[8975542, 201042]]
[[15]]
[[28090245, 18800900]]
[[16]]
[[15408801, 13781627]]
[[17]]
[[6366227, 3552044]]
[[15]]
[[29047138, 3569373]]
[[0]]
(1,1)重复20次 的平均误差值是 640979.7506825391
def error_sum(m,l,t):
    sum=0
    real_variety = Count_accurate()
    for i in range(t):
        sum+=(Count_estimate(m,l)-real_variety)**2
    return (sum/t)**0.5

print("(4,4)重复20次 的平均误差值是",error_sum(4,4,20))
[[17141872, 33238544], [4662871, 19786648], [9198783, 16873846], [20780971, 860552], [7116752, 25265878], [19836096, 25430460], [29115932, 17155707], [2692069, 19510896], [2855284, 14645167], [15332577, 6359770], [27742084, 11633482], [15549425, 21365485], [3629996, 1939163], [13660666, 33331867], [20863532, 17361899], [1747205, 2594966]]
[[25, 17, 18, 15], [1, 2, 0, 15], [0, 18, 1, 18], [0, 0, 0, 16]]
[[23673051, 12572541], [22949855, 21957684], [4957183, 12622501], [17163348, 17988713], [21892248, 7929545], [1273734, 16252747], [30041369, 3350919], [3939779, 5842146], [28080737, 943082], [26568683, 30859508], [20880264, 31506054], [16684946, 7330177], [2656894, 28317350], [30736658, 25112776], [5634726, 9677147], [32363092, 23614237]]
[[17, 19, 15, 0], [0, 0, 16, 16], [21, 17, 1, 0], [22, 17, 0, 0]]
[[25284409, 20335705], [12588476, 17929103], [20084836, 14219391], [3173346, 32849907], [32687352, 1397817], [26670765, 14788980], [9746638, 20050572], [32286355, 18851464], [20423779, 15644881], [22065012, 11650886], [26660361, 26190355], [22984764, 11596545], [18117682, 29216808], [18854953, 3648617], [27676211, 17427131], [22292293, 23193030]]
[[19, 0, 0, 0], [0, 17, 19, 18], [16, 1, 19, 0], [17, 23, 18, 16]]
[[15798364, 7748874], [31709117, 32920912], [13637792, 28034388], [8207017, 13605618], [2163633, 25649806], [242192, 20958410], [4126872, 16250532], [4857021, 169880], [29738935, 11589087], [22136291, 6664153], [22622038, 17132310], [7559459, 21151024], [18300674, 25925426], [31725736, 6261021], [17633564, 5487964], [6321953, 25534592]]
[[1, 15, 2, 17], [16, 1, 2, 19], [17, 15, 17, 16], [17, 0, 19, 16]]
[[31719709, 19258170], [29802272, 22954610], [5446975, 3713021], [11300467, 9227679], [8630636, 21395794], [12935881, 30934411], [23570476, 23129826], [9653876, 17619755], [27085172, 2828284], [3101876, 16727505], [9931446, 8404236], [30810698, 29825358], [15880628, 16568024], [2472349, 23983949], [26093754, 26973484], [16619806, 22713087]]
[[21, 1, 18, 14], [1, 16, 1, 0], [21, 0, 16, 18], [17, 17, 16, 0]]
[[14081417, 25501866], [22288140, 31117644], [13382772, 27033684], [27682404, 1529772], [24895781, 12458622], [26128167, 30563326], [13501346, 22967534], [23399832, 20079718], [19053852, 23228097], [13169316, 24081357], [29757052, 20749979], [320005, 16310995], [2903567, 8388999], [10993910, 1750867], [5318201, 26192489], [13178763, 20940293]]
[[16, 17, 19, 22], [15, 18, 20, 1], [0, 0, 0, 17], [15, 0, 16, 21]]
[[30310176, 17842758], [18144830, 2585679], [30267766, 14171127], [19514706, 23336072], [11384072, 197879], [8150465, 27256824], [2591770, 28258189], [31286939, 10745292], [17095041, 27245473], [25299750, 17191048], [28583489, 29806273], [32455957, 12266208], [24914844, 27029076], [6366519, 22509455], [16061524, 956750], [24499727, 9588052]]
[[1, 0, 0, 17], [0, 18, 0, 16], [15, 18, 17, 19], [21, 14, 1, 19]]
[[6908858, 18279950], [26338524, 25904649], [4437525, 21647427], [14477786, 3264232], [24305017, 27214101], [14118750, 15850420], [26310068, 2448631], [16068823, 33297113], [25451945, 24050446], [14697440, 24710639], [28960389, 15859973], [32828778, 18800592], [11503789, 18837046], [26580957, 15129200], [33315723, 12784630], [7037684, 13552962]]
[[19, 0, 15, 16], [17, 18, 0, 19], [15, 0, 17, 15], [20, 16, 16, 1]]
[[20523378, 1430453], [32641259, 33163948], [31891152, 16656868], [5442980, 16003427], [30961078, 31873525], [29841779, 21386369], [13988593, 29009317], [15675892, 9776397], [28317901, 5457010], [4466575, 28227301], [2933623, 30632073], [2027748, 18988551], [90994, 8456910], [38978, 1005847], [1119462, 28709640], [28870318, 10052228]]
[[0, 15, 2, 0], [0, 19, 20, 0], [16, 17, 15, 0], [19, 0, 18, 18]]
[[8100105, 17521129], [22074566, 5436966], [22586719, 17969470], [19296428, 15542857], [30118605, 9027775], [14212132, 4168922], [1562680, 16274684], [28159984, 11943884], [24539429, 15756622], [23342694, 24361], [32269670, 32722670], [17218208, 15341300], [18754755, 16251303], [25487088, 775041], [32386048, 21365713], [32570167, 8057751]]
[[17, 18, 17, 0], [16, 1, 2, 2], [16, 0, 18, 2], [17, 0, 0, 22]]
[[15432114, 33103186], [6692049, 3079253], [1216766, 18802963], [22463924, 6077449], [11241728, 2171832], [18493582, 3321334], [12941174, 13735114], [14544620, 30091232], [1192583, 18041401], [32317525, 27167369], [16174797, 32090718], [6844053, 9629461], [10632164, 23500665], [1849715, 4597286], [17692374, 20961443], [4817318, 22833537]]
[[19, 16, 0, 0], [3, 17, 16, 18], [16, 20, 17, 16], [0, 17, 0, 0]]
[[33266555, 26102839], [269635, 24683554], [5092621, 608504], [19349688, 28365257], [13695561, 21020481], [5129209, 24428567], [11804870, 33111012], [15042342, 33492667], [3840892, 8396172], [8372058, 20083007], [7400538, 9144692], [6636020, 19074597], [22188981, 29580387], [16559454, 3126364], [915844, 16104216], [18575181, 11278163]]
[[18, 17, 18, 0], [16, 17, 17, 0], [21, 0, 18, 0], [16, 17, 18, 17]]
[[26063571, 20897151], [13763134, 17954145], [12281876, 11925180], [20261491, 13640676], [21746213, 19578636], [26013228, 14394006], [6883564, 11350133], [4660474, 8471577], [23430903, 22975406], [6515409, 16837569], [24273079, 22185973], [27892578, 17217024], [7134008, 4705327], [18658520, 5073976], [15761150, 28150593], [12447678, 15642389]]
[[18, 0, 22, 15], [15, 1, 0, 0], [19, 16, 14, 16], [0, 18, 0, 0]]
[[27152773, 28111016], [21250232, 2686162], [9647805, 11364320], [8494135, 6498521], [33426612, 11024665], [11408819, 31958674], [10774019, 20878432], [31230877, 21645034], [8443101, 2281193], [30193141, 7090260], [30735528, 20283869], [18536176, 22656762], [738293, 9899028], [21546202, 17366631], [2702796, 15389104], [7088968, 22534095]]
[[16, 1, 26, 19], [0, 17, 15, 18], [16, 16, 0, 1], [16, 0, 20, 0]]
[[17348237, 15006802], [22073886, 17179834], [6936700, 2240047], [32319704, 24250821], [26869228, 21861524], [19943285, 6759072], [31870694, 5146523], [509580, 19900914], [3447365, 9173630], [20082799, 10642210], [2250178, 8262038], [25600265, 24872635], [14841172, 12535378], [1614928, 16192849], [20719469, 13178563], [28055511, 28277533]]
[[15, 23, 0, 0], [22, 18, 0, 1], [16, 16, 17, 18], [1, 0, 16, 21]]
[[27509269, 2357567], [27820342, 20798974], [13444334, 7082116], [32577606, 2859611], [19619904, 26885306], [32853505, 21174327], [18880466, 6739336], [14173402, 14479256], [8163485, 4671600], [235936, 19005623], [19830569, 14871240], [16340122, 32264336], [12369069, 3093718], [16308825, 33495140], [11920357, 7911252], [26244531, 31735408]]
[[19, 17, 16, 0], [1, 17, 17, 17], [19, 0, 18, 19], [16, 15, 18, 17]]
[[1164732, 30768603], [7519654, 10634774], [10704397, 17867203], [19171176, 5536456], [31707468, 708838], [1717624, 5725349], [18659880, 23311393], [12713569, 18829183], [33080115, 19119988], [5702114, 26361493], [23652850, 19062083], [28441341, 17009881], [31858670, 29825442], [7059573, 9877133], [16758741, 4362116], [20647822, 29748498]]
[[0, 18, 15, 18], [1, 0, 0, 18], [16, 0, 0, 18], [19, 17, 16, 20]]
[[29825616, 15588100], [12582033, 19370497], [4178128, 4159478], [17778300, 22199182], [33342244, 6808363], [33519104, 14199487], [27127700, 8934596], [8071433, 23112023], [26129882, 27052430], [1239684, 33251928], [22698403, 23913250], [17207090, 514707], [32813176, 25808101], [2572786, 5859571], [29729608, 6780245], [13434021, 21320934]]
[[2, 20, 1, 1], [0, 0, 18, 19], [19, 18, 14, 0], [0, 0, 0, 21]]
[[8896309, 16679033], [74217, 12204207], [29496472, 24526044], [6186754, 15980795], [1532544, 29707641], [4133613, 27975665], [8287934, 27180432], [28055910, 27440963], [1226632, 10829260], [11865056, 10130238], [24555475, 30033960], [15530525, 8266397], [7567913, 7565741], [15261230, 20182840], [29012540, 17657173], [11912317, 16542066]]
[[17, 16, 2, 0], [0, 18, 18, 0], [2, 1, 15, 17], [17, 21, 0, 16]]
[[14950583, 3215769], [7736683, 7922892], [24878318, 23366294], [6117924, 27738932], [33181143, 4072035], [13999184, 2299868], [8595332, 3998120], [23338734, 9446225], [32388539, 4456226], [11060251, 17938551], [32497845, 8756993], [11178148, 20305093], [26273975, 8392692], [16367336, 29346532], [9054842, 12484191], [4727887, 5448069]]
[[14, 17, 17, 20], [17, 2, 19, 0], [17, 18, 17, 0], [19, 2, 0, 19]]
(4,4)重复20次 的平均误差值是 101237.29660950058

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值