字符串 唯一整数
PHP provides the popular md5() hash function out of the box, which returns 32 a hex character string. It’s a great way to generate a fingerprint for any arbitrary length string. But what if you need to generate an integer fingerprint out of a URL?
PHP提供了开箱即用的流行的md5()哈希函数 ,该函数返回32个十六进制字符串。 这是为任意长度的字符串生成指纹的好方法。 但是,如果您需要从URL生成整数指纹怎么办?
挑战 (Challenge)
We faced that challenge in RatingWidget when we had to bind our rating widgets to a unique Int64 IDs based on the website’s page it’s being loaded from. Theoretically we could just store the URLs and query the URL column, but URLs can be very long and creating an index for text column with unknown length is very inefficient.
当我们不得不根据从其加载的网站页面将评分小部件绑定到唯一的Int64 ID时,我们在RatingWidget中遇到了挑战。 从理论上讲,我们可以只存储URL并查询URL列,但是URL可能很长,因此为长度未知的文本列创建索引效率很低。
So if you are working on any kind of dynamic widget development that should load different data based on the URL it’s loaded from, this post will save you tonnes of time.
因此,如果您正在从事任何类型的动态窗口小部件开发工作,并且应该根据加载的URL来加载不同的数据,那么本篇文章将为您节省大量时间。
To simplify the problem, let’s divide it into two sub-challenges:
为了简化问题,让我们将其分为两个子挑战:
- URL Canonization URL规范化
- String to unique Int64 conversion 字符串到唯一的Int64转换
URL规范化 (URL Canonization)
In our case, we wanted to assign a unique Int64 for a page, not for a URL. For instance, http://domain.com?x=1&y=2
and http://domain.com?y=2&x=1
are different URLs but in fact both of them will load the exact same page. Therefore, we wanted to assign them an identical Int64 ID. Thus, by canonizing the URLs before mapping them to Int64, we can convert the URLs to uniform representation.
在我们的例子中,我们想为页面而不是URL分配唯一的Int64。 例如, http://domain.com?x=1&y=2
和http://domain.com?y=2&x=1
是不同的URL,但实际上它们两个都将加载完全相同的页面。 因此,我们想为他们分配一个相同的Int64 ID。 因此,通过在将URL映射到Int64之前规范化URL,我们可以将URL转换为统一的表示形式。
function canonizeUrl($url)
{
$url = parse_url(strtolower($url));
$canonic = $url['host'] . $url['path'];
if (isset($url['query']))
{
parse_str($url['query'], $queryString);
$canonic .= '?' . canonizeQueryString($queryString);
}
return $canonic;
}
function canonizeQueryString(array $params)
{
if (!is_array($params) || 0 === count($params))
return '';
// Urlencode both keys and values.
$keys = urlencode_rfc3986(array_keys($params));
$values = urlencode_rfc3986(array_values($params));
$params = array_combine($keys, $values);
// Parameters are sorted by name, using lexicographical byte value ordering.
// Ref: Spec: 9.1.1 (1)
uksort($params, 'strcmp');
$pairs = array();
foreach ($params as $parameter => $value)
{
$lower_param = strtolower($parameter);
if (is_array($value))
{
// If two or more parameters share the same name, they are sorted by their value
// Ref: Spec: 9.1.1 (1)
natsort($value);
foreach ($value as $duplicate_value)
$pairs[] = $lower_param . '=' . $duplicate_value;
}
else
{
$pairs[] = $lower_param . '=' . $value;
}
}
if (0 === count($pairs))
return '';
return implode('&', $pairs);
}
function urlencode_rfc3986($input)
{
if (is_array($input))
return array_map(array(&$this, 'urlencode_rfc3986'), $input);
else if (is_scalar($input))
return str_replace('+', ' ', str_replace('%7E', '~', rawurlencode($input)));
return '';
}
Basically what this code does is reorder the query string parameters by lexicographical order, and slightly tweak the URL encoding based on RFC 3986 URI syntax standard, to compensate for the different browsers + server URL encoding inconsistency.
基本上,此代码的作用是按字典顺序对查询字符串参数进行重新排序,并根据RFC 3986 URI语法标准对URL编码进行略微调整,以补偿不同的浏览器+服务器URL编码不一致。
Notes:
笔记:
In our case canonizeUrl, the canonization function, gets rid of the protocol. So
https://domain.com
andhttp://domain.com
are both canonized todomain.com
because we wanted to show the same rating widget on HTTP and HTTPS equivalent pages.在我们的例子中,规范函数canonizeUrl摆脱了协议。 因此,
https://domain.com
和http://domain.com
都被标准化为domain.com
因为我们希望在HTTP和HTTPS等效页面上显示相同的评级小部件。As you can notice, we also ignore everything the after hashmark fragment. Therefore, if you would like to generate unique IDs for SPA (Single Page Application) different states like
http://my-spa.com/#state1
andhttp://my-spa.com/#state2
, the URL canonization function has to be modified to support that.如您所见,我们也忽略了井号标记后的所有内容。 因此,如果您想为SPA( 单页应用程序 )的不同状态(例如
http://my-spa.com/#state1
和http://my-spa.com/#state2
)生成唯一ID,则URL规范化功能必须进行修改以支持这一点。
将字符串转换为MySql BIGINT索引列的唯一Int64 ID (Converting String to unique Int64 ID for MySql BIGINT Indexed Column)
After fooling around with various bit conversion functions like bindec()
, decbin()
, base_convert()
. We have found out that 64 bit integers and PHP are not playing well. None of the mentioned functions consistently supports 64 bit. After digging around on Google, we were lead to a post about 32 bit limitations in PHP which included the suggestion to use GMP, a really cool library for multiple precision integers. Using this library, we managed to create this one line hash function that generates a 64 bit integer out of arbitrary length string.
在使用各种位转换功能(例如bindec()
, decbin()
, base_convert()
decbin()
。 我们发现64位整数和PHP不能很好地发挥作用。 所提到的功能均未始终支持64位。 在Google上进行深入研究之后,我们得到了有关PHP中32位限制的文章,其中包括建议使用GMP ,这是一个非常酷的用于多精度整数的库。 使用此库,我们设法创建了这一行哈希函数,该函数从任意长度的字符串中生成64位整数。
function get64BitHash($str)
{
return gmp_strval(gmp_init(substr(md5($str), 0, 16), 16), 10);
}
Post factum, we could have implemented the CRC64 algorithm which generates a string checksum and should perform faster than MD5. But the advantage of the technique we’ve used over CRC is that we’ve created a one-way-hash function, so we can reuse it for various cryptography purposes in the code.
事后 ,我们可以实现CRC64算法 ,该算法生成字符串校验和,并且执行速度应比MD5快。 但是,相对于CRC使用的技术的优势在于,我们创建了单向哈希函数,因此可以在代码中将其重用于各种加密目的。
To find out more about GMP, see here.
要了解有关GMP的更多信息,请参见此处 。
大结局 (Grand Finale)
Combining the URL canonization with the String to Int64 mapping, the final solution looks like this:
将URL规范化与String到Int64映射相结合,最终解决方案如下所示:
function urlTo64BitHash($url)
{
return get64BitHash(canonizeUrl($url));
}
get64BitHash的冲突和性能测试 (Collision and Performance Test of get64BitHash)
Platform: Intel i3, Windows 7 64 bit, PHP 5.3 Iterations: 10,000,000 Times generated get64BitHash Elapsed Time: 460 millisecond for every 100,000 generations Collision: Not found
平台: Intel i3,Windows 7 64位,PHP 5.3 迭代: 10,000,000次生成的get64BitHash 经过时间:每100,000代460毫秒冲突:未找到
摘要 (Summary)
I hope this straightforward solution will save you time on your next project. If you have comments or any additional use-cases where this technique can be applied, please feel free to comment below.
我希望这个简单的解决方案可以节省您下一个项目的时间。 如果您有意见或可以应用此技术的其他用例,请在下面随意评论。
翻译自: https://www.sitepoint.com/create-unique-64bit-integer-string/
字符串 唯一整数