数据库测试数据生成_我们的测试数据生成器如何使假数据看起来真实

最新推荐文章于 2024-08-10 08:33:08 发布

cumi7754

最新推荐文章于 2024-08-10 08:33:08 发布

阅读量1.6k

点赞数

文章标签： python java 人工智能大数据编程语言

原文链接：https://www.freecodecamp.org/news/how-our-test-data-generator-makes-fake-data-look-real-ace01c5bde4a/

版权

数据库测试数据生成

by Tom Winter

汤姆·温特(Tom Winter)

我们的测试数据生成器如何使假数据看起来真实 (How our test data generator makes fake data look real)

We recently released DataFairy, a free tool that generates test data. But first, let me tell you the story of how it came about.

我们最近发布了DataFairy ，这是一个免费的工具，可以生成测试数据。但是首先，让我告诉您它是如何产生的。

This is the story of how we turned a fun open source side project into something that has turned out to be really useful.

这是关于我们如何将一个有趣的开源项目变成一个真正有用的故事。

This is not about fake news or tricking the masses. But the fact remains that for developers, software testers, and really anyone who has ever given a demo, fake data is essential and is surprisingly difficult to make up off the top of your head.

这与假新闻或欺骗群众无关。但是事实仍然是，对于开发人员，软件测试人员以及曾经进行过演示的任何人来说，伪造数据都是必不可少的，而且令人惊讶地难以弥补。

Our story with fake data starts back when we first developed our SaaS tool, Devskiller. Like all applications, we needed users. We weren’t even looking for paying users at this point. We just needed candidate profiles for our application. What we needed was dummy data that looked real.

关于虚假数据的故事可以追溯到我们最初开发SaaS工具Devskiller时。像所有应用程序一样，我们需要用户。目前，我们甚至都没有在寻找付费用户。我们只需要用于我们的应用程序的候选配置文件。我们需要的是看起来真实的伪数据。

我们需要一个测试数据生成器 (We needed a test data generator)

We needed fake data for a couple of reasons:

我们需要伪造数据的原因有两个：

1. We needed to see if our system worked

1.我们需要查看我们的系统是否正常工作

This meant that we needed to build a number of different dummy profiles to see if the system stored and displayed them correctly.

这意味着我们需要构建许多不同的虚拟概要文件，以查看系统是否正确存储和显示了它们。

2. We needed to sell our product

2.我们需要出售我们的产品

We needed to do demos for our first prospective customers. We wanted to show our customers what the system would look like after 6 months of inviting and testing hundreds of candidates.

我们需要为我们的第一个潜在客户进行演示。我们想向我们的客户展示经过六个月的邀请和测试数百名候选人后，系统的外观。

Our first thought was to look for an available test data generator. But the problem is that data is hard to fake convincingly. Just ask this guy,

我们的第一个想法是寻找可用的测试数据生成器。但是问题在于，很难令人信服地伪造数据。只是问这个人，

or him,

还是他

很多数据都经过算法验证 (A lot of data is validated algorithmically)

If it was easy to make convincing data, we probably wouldn’t need a tool. But generating data can be tricky for a couple of reasons.

如果说服数据很容易，我们可能就不需要工具了。但是由于以下几个原因，生成数据可能很棘手。

Fake data is more than just random numbers. Take the example of a credit card number. Most credit card numbers are based on something called a Luhn algorithm. To explain this we are going to use the example of a Visa card:

伪数据不仅仅是随机数。以信用卡号为例。大多数信用卡号都基于一种称为Luhn算法的东西。为了说明这一点，我们将使用Visa卡的示例：

如何检查信用卡号码是否有效 (How to check if a credit card number is valid)

Before you start, it’s important to know that all Visa card numbers start with a 4. Also, they all have either 16 or 13 digits.

在开始之前，重要的是要知道所有Visa卡号都以4开头。此外，它们都具有16或13位数字。

Take this Visa card number:

使用此Visa卡号：

The first thing you need to do to see if you can validate the number is to double the alternating digits starting with the first digit in the sequence.

要查看是否可以验证数字，您需要做的第一件事是从序列中的第一个数字开始将交替的数字加倍。

4574487405351567

(4x2), (7x2), (4x2), (7x2), (0x2), (3x2), (1x2), (6x2)

8, 14, 8, 14, 0, 6, 2, 12

If the doubling that you’ve just done results in a number with two digits, add them together to get a single digit number.

如果您刚进行的加倍运算得到的数字是两位数，则将它们加在一起即可得到一位数字。

8, 5, 8, 5, 0, 6, 2, 3

You then need to go back to the original credit card number and replace the digits that you doubled the new value.

然后，您需要返回到原始信用卡号，并替换将新值翻倍的数字。

8554885405652537

This could either be the doubles value or the table of values with the digits added together. Now add it all up.

这可以是double值，也可以是数字加在一起的值表。现在全部添加。

8+5+5+4+8+8+5+4+0+5+6+5+2+5+3+7=80

And then check to see if the sum is evenly divisible by 10. In this case it is, so the number is valid.

然后检查总和是否可以被10整除。在这种情况下，它是有效的，因此该数字有效。

You need some sort of computational algorithm to validate credit card numbers at scale. But credit card numbers are relatively easy pieces of data to get right. We didn’t just need individual pieces of verifiable data, we needed entire profiles.

您需要某种计算算法来大规模验证信用卡号。但是信用卡号是相对容易获得的数据。我们不仅需要单个可验证的数据，还需要整个配置文件。

可验证的配置文件需要逻辑上相互关联的各种数据 (Verifiable profiles need different kinds of data that relate to each other logically)

Credit card numbers are relatively easy to generate, because they only relate to themselves. But personal identity numbers often relate to other things about a person. Take the Swedish personal identity number, practically called the personnummer.

信用卡号相对容易生成，因为它们仅与自己相关。但是个人身份号码通常与一个人的其他事情有关。取瑞典的个人身份号码，实际上称为personnummer。

For those of you who don’t know, personnummers are designed for paying taxes, sort of like an American Social Security number. But they’re also used as a way to access services like healthcare and schools as well as non-governmental services like credit ratings.

对于不认识的人，personnummers是专为缴税而设计的，有点像美国社会保险号。但是它们也被用作访问医疗保健和学校等服务以及信用评级等非政府服务的方式。

The format of a personnummer is slightly different than that of a credit card. It is a 10 digit number split into a six digit section and a four digit section connected by a hyphen.

personnummer的格式与信用卡的格式略有不同。它是一个10位数字，分为一个六位部分和一个由连字符连接的四位部分。

Cool fact: Swedes over the age of 100 replace the hyphen in their personnummer with a plus sign.

很酷的事实：100岁以上的瑞典人用加号替换其personnummer中的连字符。

The first six digits in the personnummer are simple and correspond to the person’s birthday using a YYMMDD format. Of the second 4 digit section, the first three are a serial number. The third serial number digit is odd for males and even for females. The last number is a checksum digit.

personnummer中的前六位数字很简单，并且使用YYMMDD格式对应于该人的生日。在第二个4位数部分中，前三个是序列号。男性，甚至女性的第三个序列号数字都是奇数。最后一个数字是校验和数字。

So if you take the personnummer:

因此，如果您使用personnummer：

601128–9235

You know that it is for a man born November 28th, 1960.

您知道这是给一个1960年11月28日出生的男人的。

60(year)11(month)28(day)-(under 100 years old)92(unique numbers)3(unique odd number for male)5(checksum digit)

To calculate the checksum, multiply the individual digits in the identity number with the corresponding digits in the number 212121–212.

要计算校验和，请将身份编号中的各个数字与编号212121-212中的相应数字相乘。

(6x2)(0x1)(1x2)(1x1)(2x2)(8x1)(9x2)(2x1)(3x2)

12, 0, 2, 1, 4, 8, 18, 2, 6

Just like with the Visa card above, if the product of any of these numbers results in a two digit number, simply add the two digits together.

就像上面的Visa卡一样，如果其中任何一个数字的乘积产生两位数的数字，只需将两位数字加在一起即可。

3, 0, 2, 1, 4, 8, 9, 2, 6

Add all the remaining products together.

将所有剩余的产品加在一起。

3+0+2+1+4+8+9+2+6=35

To get the checksum digit, subtract the last digit of the added products from 10 (the exception is that if the last digit is zero, the checksum is also zero).

要获得校验和数字，请从10中减去所添加乘积的最后一位(例外是，如果最后一位为零，则校验和也为零)。

10–5=5

So if you were going to generate a profile of this person, it couldn’t be of a woman born on April 10th, 1916. Her personnummer would have to be something like: 160410+1244. In other words, you couldn’t just come up with a random number and expect it to work with just any fake profile you’ve generated.

因此，如果您要生成此人的个人资料，则不可能是1916年4月10日出生的女人。她的personnummer必须为：160410 + 1244。换句话说，您不能只想出一个随机数并期望它可以与您生成的任何伪造配置文件一起使用。

我们需要逻辑测试数据 (We needed logical test data)

The data would need to relate to each other in a logical way, since the personnummer isn’t the only piece of data that is built on outside information. Most types of identification numbers relate to other information in some way. We simply couldn’t find a test data generator which would do that, so we decided to build our own. It looks like we weren’t the only one having this problem.

数据将需要以逻辑方式相互关联，因为personnummer并不是唯一基于外部信息构建的数据。大多数类型的标识号以某种方式与其他信息相关。我们根本找不到能够做到这一点的测试数据生成器，因此我们决定构建自己的测试数据生成器。看来我们并不是唯一一个遇到此问题的人。

妖精 (JFairy)

As regular contributors the open source community, we decided that the best way to generate the test data we needed was to build our own library. Called JFairy, our goal was for it to generate sets of data that were all verifiable and logically connected.

作为开放源代码社区的定期贡献者，我们认为生成所需测试数据的最佳方法是构建自己的库。称为JFairy ，我们的目标是生成所有可验证的逻辑连接数据集。

This way we could populate our app with users. Our user data couldn’t be gibberish or else it couldn’t be imputed. So we put the library to work and it performed better than we could have expected. It even generates real people from time to time. We found this out because we used Gravatar to show the candidate pictures. We were surprised when a real photo appeared on our test account.

这样，我们可以向用户填充应用程序。我们的用户数据不能乱码，否则不能被估算。因此，我们将库投入使用，其性能超出了我们的预期。它甚至不时产生真正的人。我们发现这一点是因为我们使用Gravatar来显示候选图片。当我们的测试帐户中出现真实照片时，我们感到惊讶。

This was really useful when we started shopping around our app. We wanted to show enterprise clients an account with 300 different test candidates on the platform. If we hadn’t built JFairy, we might have all tried to use the app a few times, but there were only five of us on the team. It would have been impractical for the five of us to come up with 300 logically connected fake profiles.

当我们开始在应用程序周围购物时，这真的很有用。我们希望向企业客户显示一个平台上具有300个不同测试候选人的帐户。如果我们没有构建JFairy，我们可能都曾几次尝试使用该应用程序，但团队中只有五个人。对于我们五个人来说，想出300个逻辑连接的虚假配置文件是不切实际的。

The data generated by JFairy proved to be so convincing that new customers were puzzled as to where we had gotten all of these people to test. In fact, they asked us if we could help them with sourcing new developers, as clearly we were in touch with a number of people who have technical backgrounds, some of whom actually had validated skills.

事实证明，JFairy生成的数据令人信服，以至于新客户对于我们让所有这些人进行测试的地方感到困惑。实际上，他们问我们是否可以帮助他们寻找新的开发人员，很明显，我们与许多具有技术背景的人保持联系，其中一些人实际上已经验证了技能。

我们需要让开源社区看看JFairy (We needed to let the open source community have a look at JFairy)

We realized that this was becoming something bigger than ourselves, so we decided to put the system out on open source. The first reason is that we are all avid users of open source code. We know that it’s important to give back to that community in order to get things in return. But on top of that, open source can bring real benefits back to the product. By putting our project out there so that a number of different developers can take a look at it, we can get some new ideas that we would never have considered.

我们意识到这正在变得比我们自己更大，因此我们决定将系统发布在开源上。第一个原因是我们都是开放源代码的狂热用户。我们知道，回馈社区以换取回报很重要。但最重要的是，开源可以为产品带来真正的收益。通过将我们的项目放到那里，以便许多不同的开发人员可以看一下它，我们可以获得一些我们从未考虑过的新想法。

The most notable contributions were the inclusion of new languages. We only built JFairy to generate data for English speakers and Polish speakers. After all, we are rather limited by the languages we know well. But of course, it could be a useful tool for people from any number of different countries. Through open source contributions, we’ve been able to add support for data in Spanish, French, German, Swedish, and Chinese.

最显着的贡献是加入了新的语言。我们仅构建了JFairy来为英语使用者和波兰语使用者生成数据。毕竟，我们受到我们熟知的语言的限制。但是，当然，对于来自许多不同国家的人们来说，它可能是一个有用的工具。通过开源贡献，我们已经能够添加对西班牙语，法语，德语，瑞典语和中文数据的支持。

We also realized that while we’re reaching a great group of users in software developers, Jfairy had applications well beyond a community whose members know how to code. So we decided to build on the success of the library and create an app which could support its use for more applications and more people.

我们还意识到，当我们接触到软件开发人员中的大量用户时，Jfairy所拥有的应用程序远远超出其成员知道如何编码的社区。因此，我们决定在图书馆的成功基础上，创建一个可以支持更多应用程序和更多人员使用的应用程序。

数据童话让所有人都可以访问假数据 (Data Fairy gives everyone access to fake data)

JFairy proved to be super useful for developers who knew how to code, but they weren’t the only people out there who would use the data JFairy generated. Software testers need to be able to populate their systems to see if they work. Salespeople and marketers need data to make their demos look realistic. To make JFairy useful to the most people, we had to make its fake data easy to access.

JFairy被证明对知道如何编码的开发人员非常有用，但是并不是唯一使用JFairy生成的数据的人。软件测试人员需要能够填充其系统以查看其是否正常运行。销售人员和营销人员需要数据以使他们的演示看起来逼真。为了使JFairy对大多数人有用，我们必须使其假数据易于访问。

With that goal in mind, we built DataFairy. DataFairy is an app powered by JFairy so you can access our fake data without having to learn to code first. The data is presented in a neat notebook interface. To get more than one fake profile, you can either generate a new profile or export a bulk list of up to 100 profiles to a CSV file. It is a free and easy way to populate your software with logically connected valid data.

考虑到这一目标，我们构建了DataFairy 。 DataFairy是由JFairy提供支持的应用程序，因此您无需先学习编码即可访问我们的虚假数据。数据显示在简洁的笔记本界面中。要获取多个伪造的配置文件，您可以生成一个新的配置文件，也可以将最多100个配置文件的批量列表导出到CSV文件。这是一种使用逻辑连接的有效数据填充软件的免费简便方法。

我们对DataFairy未来的计划 (Our plans for DataFairy’s future)

DataFairy can always be improved upon and have new features added to it. In addition to our own efforts, we want to stick to the tenants of the open source community. We continue to solicit new languages that we can add to our roster and we have an open GitHub project. We would also love to eventually have users add sample data. This will help us build a community of participants who will help DataFairy grow and become more useful for more people.

DataFairy可以随时进行改进并添加新功能。除了我们自己的努力，我们还希望坚持开源社区的租户。我们继续征集可以添加到名册中的新语言，并且我们有一个开放的GitHub项目。我们也希望最终让用户添加样本数据。这将帮助我们建立一个参与者社区，这将帮助DataFairy成长并变得对更多人有用。

Whether you need to download large batches of logically validated data or simply want to have fun reading the profiles that pop up, check out DataFairy.

无论您是需要下载大量经过逻辑验证的数据，还是只是想开心地阅读弹出的配置文件，请查看DataFairy 。