字谜分组_数组中的字谜分组

weixin_26720549

于 2020-09-02 23:46:23 发布

阅读量125

点赞数

文章标签： java python

原文链接：https://medium.com/swlh/grouping-of-anagrams-in-an-array-70cabfd3414

版权

字谜分组

问题：(Problem:)

Given an array of strings, group anagrams together.

给定一个字符串数组，将字谜分组在一起。

For example, given the following array:

例如，给定以下数组：

['eat', 'ate', 'apt', 'pat', 'tea', 'now']

Return:

[
  ['eat', 'ate', 'tea'], 
  ['apt', 'pat'], 
  ['now']
]

解： (Solution:)

At first glance, this looks like a simple comparison problem. A naive algorithm compares every string with every other string, putting them in a same bucket if they are anagrams of each other. What’s the complexity of the naive algorithm? If C is the average length of a string, and N is the total number of strings, the complexity of this algorithm is O(CN²). This is because every string-to-string comparison can be done with O(C) time, and there are O(N²) string pairs.

乍一看，这似乎是一个简单的比较问题。天真的算法会将每个字符串与其他字符串进行比较，如果它们彼此相似，则将它们放在同一存储桶中。天真的算法的复杂性是什么？如果C为字符串的平均长度，而N为字符串的总数，则此算法的复杂度为O(CN²)。这是因为每个字符串之间的比较都可以用O(C)时间完成，并且有O(N²)个字符串对。

A little more insight can give a more efficient hashing based algorithm:

多一点洞察力可以提供更有效的基于哈希的算法：

Compute a hash function for every string. The hash function must be carefully chosen so that strings that are anagrams of each other have same hash function, while the strings that are not anagrams of each other have different hash function. The hash function must also be efficient to calculate.
为每个字符串计算一个哈希函数。必须谨慎选择哈希函数，以使彼此组成字母的字符串具有相同的哈希函数，而彼此组成字母的字符串具有不同的哈希函数。 哈希函数还必须高效地进行计算。
After computing the hash function, insert the strings in an unordered_map with hash function as the key. If the hash function is good enough, the constructed unordered_map will contain all the right buckets with anagrams.
计算完哈希函数后，将字符串插入到以哈希函数为键的unordered_map 。如果哈希函数足够好，则构造的unordered_map将包含所有带有字谜的正确存储桶。

The remaining of this section develops a few hash functions and discusses their pros and cons.

本节的其余部分将开发一些哈希函数，并讨论它们的优缺点。

哈希函数和素数分解 (Hash Function and Prime Factorization)

Assign a unique prime number to every character from a to z. (e.g. first 26 prime numbers like a:2, b:3, c:5, d:7,...).
为从a到z每个字符分配唯一的质数。 (例如，前26个素数，例如a:2, b:3, c:5, d:7,... )。
Construct a prime-factorization of the given string:
构造给定字符串的素数分解：

For every character C₁, find the number of times, X₁, that it appears in the string.
对于每个字符C₁，找出它出现在字符串中的次数X₁。
Let P(C₁) be the prime number associated with C₁. Calculate P(C₁)^X₁
令P(C₁)为与C₁相关的素数。计算P(C₁)^X₁
The required hash function will be the product of all of these P(C₁)^X₁ quantities computed for all characters.
所需的哈希函数将是为所有字符计算的所有这些P(C₁)^X₁数量的乘积。

So, assuming the standard prime encoding of characters defined above {a:2, b:3, c:5, d:7,...}, the hash function for aab or aba will be 2²* 3 = 12, and the hash function for ccc will be 5³ = 125.

因此，假设{a:2, b:3, c:5, d:7,...}上面定义的字符的标准质数编码， aab或aba的哈希函数将为2²* 3 = 12，并且ccc哈希函数将为5³= 125。

Here is a simple algorithm to bucketize the strings using the hash function.

这是使用哈希函数对字符串进行存储的简单算法。

static const int PRIMES[26] = {2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101};


using V = std::vector<std::string>;


unsigned long long hash_function(std::string str) {
  unsigned long long h = 1;
  for (const auto& c : str) {
    int offset = c - 'a';
    h *= PRIMES[offset];
  }
  return h;
}


std::vector<V> anagrams(const V& strings) {
  std::unordered_map<unsigned long long, V> table;
  for (const auto& s : strings) {
    unsigned long long h = hash_function(s);
    if (table.find(h) == table.end()) {
      table[h] = {s};
    } else {
      table[h].push_back(s);
    }
  }
  // Reconstruct the buckets from table
  std::vector<V> result;
  for (const auto& kv : table) {
    V bucket;
    for (const auto& str : kv.second) {
      bucket.push_back(str);
    }
    result.push_back(bucket);
  }
  return result;
}

优点：(Pros:)

The hash function is easy to calculate and efficient. It requires O(C) time and takes a constant space (where C is the number of characters in the string).
哈希函数易于计算且高效。它需要O(C)时间并占用一个恒定的空间(其中C是字符串中的字符数)。
The hash function is good: The Fundamental Theorem of Mathematics guarantees that the prime factorization depends only on the frequency of every character in the string, and not their position. As a result, anagrams have identical hash function value. Non-anagrams are guaranteed to have different hash values.
哈希函数很好：数学基础定理保证素数分解仅取决于字符串中每个字符的频率，而不取决于它们的位置。结果，字谜具有相同的哈希函数值。 非字谜保证具有不同的哈希值。
The entire algorithm runs in O(NC) time, and takes O(NC) space, where N is the total number of strings, and C is the average size of a string.
整个算法以O(NC)时间运行，并占用O(NC)空间，其中N是字符串的总数， C是字符串的平均大小。

缺点： (Cons:)

The biggest drawback of this hash function is integer overflow. On most architectures, unsigned long long occupies 64 bits. (some modern architectures have more bits). So, the largest value represented by unsigned long long is about 1.8 * 10¹⁹. The largest prime number in our map is 101 (Corresponding to the letter z). That means, as soon as the number of characters in the string exceed 10, we have a good chance of an integer overflow: 101¹⁰ > 1.8 * 10¹⁹.
此哈希函数的最大缺点是整数溢出。在大多数体系结构上， unsigned long long占用64位。 (一些现代架构有更多位)。因此，由unsigned long long表示的unsigned long long约为1.8 * 10 15。我们的地图中最大的质数是101 (对应于字母z )。这意味着，一旦字符串中的字符数超过10 ，我们就有很大的机会出现整数溢出：101⁰> 1.8 *10⁹。

std :: array和boost :: hash (std::array and boost::hash)

The hash function defined above is transparent, and anyone reading your code should be easily able to understand it. In this section, we will use the C++ language support for hashing container types, and delegate the hash function computation to the standard libraries. In particular, we are going to use boost::hash_combine.

上面定义的哈希函数是透明的，任何阅读您的代码的人都应该能够轻松理解它。在本节中，我们将使用C ++语言对散列容器类型的支持，并将散列函数的计算委托给标准库。特别是，我们将使用boost::hash_combine 。

Here is a general idea:

这是一个总体思路：

Instead of an integer or a long, use a 26-element std::array as the key for the unordered_map. So, our map signature becomes std::unordered_map<std::array<int, 26> >, std::vector<std::string> >!
代替整数或长整数，使用26元素的std::array作为unordered_map的键。因此，我们的地图签名变为std::unordered_map<std::array<int, 26> >, std::vector<std::string> > ！
Does the above definition work? Unfortunately, the array type does not have a default hash function defined for it, and hence we have to supplement it from the outside. Looking closely, the std::unordered_map has a third template argument, which can take an arbitrary std::function returning a size_t as a hash function.
上面的定义有效吗？不幸的是，数组类型没有定义默认的哈希函数，因此我们必须从外部进行补充。仔细观察， std::unordered_map具有第三个模板参数，该参数可以采用任意std::function返回size_t作为哈希函数。
Once you define the correct hash function, and map-type, inserting and iterating over the map is identical to the functions above.
定义正确的哈希函数和map-type后，在地图上进行插入和迭代与上述功能相同。

Here is the implementation:

这是实现：

using A = std::array<int, 26>;
using V = std::vector<std::string>; 


std::size_t array_hash(const A& a) {
  std:size_t seed = 0;
  for (int i =0; i < 26; i++) {
    boost::hash_combine(seed, a[i]);
  } 
  return seed; 
}


std::vector<V> anagrams(const V& strings) {
  std::unordered_map<A, V, array_hash> table;
  for (const auto& str : strings) {
    A counters;
    for (const auto& c : str) {
      int offset = c - 'a';
      counters[offset]++;
    }
    if (table.find(counters) == table.end()) {
      table[counters] = {str};
    } else {
      table[counters].push_back(str);
    }
  }
  // Reconstruct the buckets from table
  std::vector<V> result;
  for (const auto& kv : table) {
    V bucket;
    for (const auto& str : kv.second) {
      bucket.push_back(str);
    }
    result.push_back(bucket);
  }
  return result;
}

优点：(Pros:)

Unlike the previous algorithm, we don’t have to invent our own hash function. This is especially useful when you don’t have the domain knowledge in the problem, or are not aware of the mathematical properties like the Fundamental Theorem.
与以前的算法不同，我们不必发明自己的哈希函数。当您对问题不了解或不了解数学原理(例如基本定理)时，此功能特别有用。
This algorithm, like the previous one, runs in O(NC) time, and takes O(NC) space, where N is the total number of strings, and C is the average size of a string.
与前一个算法一样，该算法运行时间为O(NC) ，占用O(NC)空间，其中N是字符串的总数， C是字符串的平均大小。

缺点： (Cons:)

The hash function here is opaque, and we need to dig deep into the boost::combine definition to figure out how it works. The hash collisions are also hard to characterize.
哈希函数在这里是不透明的，我们需要深入研究boost::combine定义以了解其工作原理。哈希冲突也难以描述。

测试： (Testing:)

We have an opportunity to use UnorderedElementsAre and AnyOf matchers of GUnit. Here are some of the test cases to try

我们必须利用机会UnorderedElementsAre和AnyOf的匹配器GUnit 。这里是一些测试用例

Empty array
空数组
Single element array
单元素数组
Array of strings with none of them being anagrams
字符串数组，没有一个是字谜
Array of all anagram string
所有字谜字符串的数组
Complex array like the one in the example
像示例中那样的复杂数组

GTEST("Empty array") {
  EXPECT_TRUE(anagrams({}).empty());
}


GTEST("Single element array") {
  const auto& result = anagrams({"abc"});
  EXPECT_EQ(1, result.size());
  EXPECT_THAT(result[0], ElementsAre("abc"));
}


GTEST("No anagrams") {
  const auto& result = anagrams({"abc", "cbd", "dab"});
  EXPECT_EQ(3, result.size());
  EXPECT_THAT(result[0], 
    AnyOf(ElementsAre("abc"), ElementsAre("cbd"), ElementsAre("dab")));
 EXPECT_THAT(result[1], 
    AnyOf(ElementsAre("abc"), ElementsAre("cbd"), ElementsAre("dab")));
   EXPECT_THAT(result[2], 
    AnyOf(ElementsAre("abc"), ElementsAre("cbd"), ElementsAre("dab")));
}


GTEST("All anagrams") {
  const auto& result = anagrams({"abca", "aabc", "baca", "acab"});
  EXPECT_EQ(1, result.size());
  EXPECT_THAT(result[0], 
    UnorderedElementsAre("abca", "aabc", "baca", "acab"));
}

Originally published at https://cppcodingzen.com on September 10, 2020.

最初于2020年9月10日发布在https://cppcodingzen.com上。