Hashing


一、What is hashing?

Hashing is to transform a string of characters into a shorter value or key so as to increase the speed of searching.


二、常见的hashing method

1.Division-remainder method:

Division-remainder method: The size of the number of items in the table is estimated. That number is then used as a divisor into each original value or key to extract a quotient and a remainder. The remainder is the hashed value. (Since this method is liable to produce a number of collisions, any search mechanism would have to be able to recognize a collision and offer an alternate search mechanism.)

2.Folding method:

Folding method: This method divides the original value (digits in this case) into several parts, adds the parts together, and then uses the last four digits (or some other arbitrary number of digits that will work ) as the hashed value or key.

3.Radix transformation method:

Radix transformation method: Where the value or key is digital, the number base (or radix) can be changed resulting in a different sequence of digits. (For example, a decimal numbered key could be transformed into a hexadecimal numbered key.) High-order digits could be discarded to fit a hash value of uniform length.

4.Digit rearrangement method:

Digit rearrangement method: This is simply taking part of the original value or key such as digits in positions 3 through 6, reversing their order, and then using that sequence of digits as the hash value or key.

三、hashing algorithm in AB test sampling:

1. Collision problem:

For example, if you want to hash 500m item id into 1000 buckets, the number of hashed value could be balanced, but the related metrics may not.

Possible solutions:
–Identify root cause
–Compute Fisher exact p-value:
基于超几何分布 (hypergeometric distribution) 理论直接计算拒绝零假设的概率。在sample size相对较小时,作为卡方检验的补充检验。
卡方检验:若n个相互独立的随机变量ξ₁、ξ₂、……、ξn ,均服从标准正态分布(也称独立同分布于标准正态分布),则这n个服从标准正态分布的随机变量的平方和
Q=∑i=1nξ2i 构成一新的随机变量,其卡方分布规律称为x^2 分布(chi-square distribution),其中参数n称为自由度,正如正态分布中均值或方差不同就是另一个x2正态分布一样,自由度不同就是另一个分布。记为 Q~x^2(k). 卡方分布是由正态分布构造而成的一个新的分布,当自由度n很大时,X^2分布近似为正态分布。 对于任意正整数k, 自由度为 k的卡方分布是一个随机变量X的机率分布。
在事实与期望不符合时,用卡方分布检验偏差是正常波动还是建模错误。

–More detailed checks for balance – need more item covariates

–Handling long-tailed data and outliers

–Can confirm by using analysis methods which correct for the root cause

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值