字符串 唯一整数_如何从字符串创建唯一的64位整数

字符串 唯一整数

PHP provides the popular md5() hash function out of the box, which returns 32 a hex character string. It’s a great way to generate a fingerprint for any arbitrary length string. But what if you need to generate an integer fingerprint out of a URL?

PHP提供了开箱即用的流行的md5()哈希函数 ,该函数返回32个十六进制字符串。 这是为任意长度的字符串生成指纹的好方法。 但是,如果您需要从URL生成整数指纹怎么办?

alt

挑战 (Challenge)

We faced that challenge in RatingWidget when we had to bind our rating widgets to a unique Int64 IDs based on the website’s page it’s being loaded from. Theoretically we could just store the URLs and query the URL column, but URLs can be very long and creating an index for text column with unknown length is very inefficient.

当我们不得不根据从其加载的网站页面将评分小部件绑定到唯一的Int64 ID时,我们在RatingWidget中遇到了挑战。 从理论上讲,我们可以只存储URL并查询URL列,但是URL可能很长,因此为长度未知的文本列创建索引效率很低。

So if you are working on any kind of dynamic widget development that should load different data based on the URL it’s loaded from, this post will save you tonnes of time.

因此,如果您正在从事任何类型的动态窗口小部件开发工作,并且应该根据加载的URL来加载不同的数据,那么本篇文章将为您节省大量时间。

To simplify the problem, let’s divide it into two sub-challenges:

为了简化问题,让我们将其分为两个子挑战:

  1. URL Canonization

    URL规范化
  2. String to unique Int64 conversion

    字符串到唯一的Int64转换

URL规范化 (URL Canonization)

In our case, we wanted to assign a unique Int64 for a page, not for a URL. For instance, http://domain.com?x=1&y=2 and http://domain.com?y=2&x=1 are different URLs but in fact both of them will load the exact same page. Therefore, we wanted to assign them an identical Int64 ID. Thus, by canonizing the URLs before mapping them to Int64, we can convert the URLs to uniform representation.

在我们的例子中,我们想为页面而不是URL分配唯一的Int64。 例如, http://domain.com?x=1&y=2http://domain.com?y=2&x=1是不同的URL,但实际上它们两个都将加载完全相同的页面。 因此,我们想为他们分配一个相同的Int64 ID。 因此,通过在将URL映射到Int64之前规范化URL,我们可以将URL转换为统一的表示形式。

function canonizeUrl($url)
    {
        $url = parse_url(strtolower($url));
        
        $canonic = $url['host'] . $url['path'];
        
        if (isset($url['query']))
        {
            parse_str($url['query'], $queryString);
            $canonic .= '?' . canonizeQueryString($queryString);
        }
        
        return $canonic;
    }
    
    function canonizeQueryString(array $params)
    {
        if (!is_array($params) || 0 === count($params))
            return '';

        // Urlencode both keys and values.
        $keys = urlencode_rfc3986(array_keys($params));
        $values = urlencode_rfc3986(array_values($params));
        $params = array_combine($keys, $values);
    
        // Parameters are sorted by name, using lexicographical byte value ordering.
        // Ref: Spec: 9.1.1 (1)
        uksort($params, 'strcmp');
    
        $pairs = array();
        foreach ($params as $parameter => $value) 
        {
            $lower_param = strtolower($parameter);
            
            if (is_array($value)) 
            {
                // If two or more parameters share the same name, they are sorted by their value
                // Ref: Spec: 9.1.1 (1)
                natsort($value);
                foreach ($value as $duplicate_value)
                    $pairs[] = $lower_param . '=' . $duplicate_value;
            } 
            else 
            {
                $pairs[] = $lower_param . '=' . $value;
            }
        }            
        
        if (0 === count($pairs))
            return '';
    
        return implode('&', $pairs);
    }

    function urlencode_rfc3986($input) 
    {
        if (is_array($input))
            return array_map(array(&$this, 'urlencode_rfc3986'), $input);
        else if (is_scalar($input)) 
            return str_replace('+', ' ', str_replace('%7E', '~', rawurlencode($input)));
            
        return '';
    }

Basically what this code does is reorder the query string parameters by lexicographical order, and slightly tweak the URL encoding based on RFC 3986 URI syntax standard, to compensate for the different browsers + server URL encoding inconsistency.

基本上,此代码的作用是按字典顺序对查询字符串参数进行重新排序,并根据RFC 3986 URI语法标准对URL编码进行略微调整,以补偿不同的浏览器+服务器URL编码不一致。

Notes:

笔记:

  1. In our case canonizeUrl, the canonization function, gets rid of the protocol. So https://domain.com and http://domain.com are both canonized to domain.com because we wanted to show the same rating widget on HTTP and HTTPS equivalent pages.

    在我们的例子中,规范函数canonizeUrl摆脱了协议。 因此, https://domain.comhttp://domain.com都被标准化为domain.com因为我们希望在HTTP和HTTPS等效页面上显示相同的评级小部件。

  2. As you can notice, we also ignore everything the after hashmark fragment. Therefore, if you would like to generate unique IDs for SPA (Single Page Application) different states like http://my-spa.com/#state1 and http://my-spa.com/#state2, the URL canonization function has to be modified to support that.

    如您所见,我们也忽略了井号标记后的所有内容。 因此,如果您想为SPA( 单页应用程序 )的不同状态(例如http://my-spa.com/#state1http://my-spa.com/#state2 )生成唯一ID,则URL规范化功能必须进行修改以支持这一点。

将字符串转换为MySql BIGINT索引列的唯一Int64 ID (Converting String to unique Int64 ID for MySql BIGINT Indexed Column)

After fooling around with various bit conversion functions like bindec(), decbin(), base_convert(). We have found out that 64 bit integers and PHP are not playing well. None of the mentioned functions consistently supports 64 bit. After digging around on Google, we were lead to a post about 32 bit limitations in PHP which included the suggestion to use GMP, a really cool library for multiple precision integers. Using this library, we managed to create this one line hash function that generates a 64 bit integer out of arbitrary length string.

在使用各种位转换功能(例如bindec()decbin()base_convert() decbin() 。 我们发现64位整数和PHP不能很好地发挥作用。 所提到的功能均未始终支持64位。 在Google上进行深入研究之后,我们得到了有关PHP中32位限制的文章,其中包括建议使用GMP ,这是一个非常酷的用于多精度整数的库。 使用此库,我们设法创建了这一行哈希函数,该函数从任意长度的字符串中生成64位整数。

function get64BitHash($str)
    {
        return gmp_strval(gmp_init(substr(md5($str), 0, 16), 16), 10);
    }

Post factum, we could have implemented the CRC64 algorithm which generates a string checksum and should perform faster than MD5. But the advantage of the technique we’ve used over CRC is that we’ve created a one-way-hash function, so we can reuse it for various cryptography purposes in the code.

事后 ,我们可以实现CRC64算法 ,该算法生成字符串校验和,并且执行速度应比MD5快。 但是,相对于CRC使用的技术的优势在于,我们创建了单向哈希函数,因此可以在代码中将其重用于各种加密目的。

To find out more about GMP, see here.

要了解有关GMP的更多信息,请参见此处

大结局 (Grand Finale)

Combining the URL canonization with the String to Int64 mapping, the final solution looks like this:

将URL规范化与String到Int64映射相结合,最终解决方案如下所示:

function urlTo64BitHash($url)
    {
        return get64BitHash(canonizeUrl($url));
    }

get64BitHash的冲突和性能测试 (Collision and Performance Test of get64BitHash)

Platform: Intel i3, Windows 7 64 bit, PHP 5.3 Iterations: 10,000,000 Times generated get64BitHash Elapsed Time: 460 millisecond for every 100,000 generations Collision: Not found

平台: Intel i3,Windows 7 64位,PHP 5.3 迭代: 10,000,000次生成的get64BitHash 经过时间:每100,000代460毫秒冲突:未找到

摘要 (Summary)

I hope this straightforward solution will save you time on your next project. If you have comments or any additional use-cases where this technique can be applied, please feel free to comment below.

我希望这个简单的解决方案可以节省您下一个项目的时间。 如果您有意见或可以应用此技术的其他用例,请在下面随意评论。

翻译自: https://www.sitepoint.com/create-unique-64bit-integer-string/

字符串 唯一整数

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值