从SQL Server表中选择n个随机行

本文翻译自:Select n random rows from SQL Server table

I've got a SQL Server table with about 50,000 rows in it. 我有一个SQL Server表,其中包含大约50,000行。 I want to select about 5,000 of those rows at random. 我想随机选择大约5,000行。 I've thought of a complicated way, creating a temp table with a "random number" column, copying my table into that, looping through the temp table and updating each row with RAND() , and then selecting from that table where the random number column < 0.1. 我想到了一个复杂的方法,创建一个带有“随机数”列的临时表,将我的表复制到其中,循环访问临时表并使用RAND()更新每一行,然后从该表中选择随机数列<0.1。 I'm looking for a simpler way to do it, in a single statement if possible. 我正在寻找一种更简单的方法,如果可能的话,在一个声明中。

This article suggest using the NEWID() function. 本文建议使用NEWID()函数。 That looks promising, but I can't see how I could reliably select a certain percentage of rows. 这看起来很有希望,但我看不出如何可靠地选择一定比例的行。

Anybody ever do this before? 有人曾经这样做过吗? Any ideas? 有任何想法吗?


#1楼

参考:https://stackoom.com/question/3YpU/从SQL-Server表中选择n个随机行


#2楼

Selecting Rows Randomly from a Large Table on MSDN has a simple, well-articulated solution that addresses the large-scale performance concerns. MSDN上的大表中随机选择行有一个简单明了的解决方案,可以解决大规模的性能问题。

  SELECT * FROM Table1
  WHERE (ABS(CAST(
  (BINARY_CHECKSUM(*) *
  RAND()) as int)) % 100) < 10

#3楼

If you (unlike the OP) need a specific number of records (which makes the CHECKSUM approach difficult) and desire a more random sample than TABLESAMPLE provides by itself, and also want better speed than CHECKSUM, you may make do with a merger of the TABLESAMPLE and NEWID() methods, like this: 如果你(不像OP)需要特定数量的记录(这使得CHECKSUM方法难以实现)并且希望比TABLESAMPLE本身提供的更随机的样本,并且还希望比CHECKSUM更快的速度,你可以合并TABLESAMPLE和NEWID()方法,如下所示:

DECLARE @sampleCount int = 50
SET STATISTICS TIME ON

SELECT TOP (@sampleCount) * 
FROM [yourtable] TABLESAMPLE(10 PERCENT)
ORDER BY NEWID()

SET STATISTICS TIME OFF

In my case this is the most straightforward compromise between randomness (it's not really, I know) and speed. 在我的情况下,这是随机性(它不是真的,我知道)和速度之间最直接的妥协。 Vary the TABLESAMPLE percentage (or rows) as appropriate - the higher the percentage, the more random the sample, but expect a linear drop off in speed. 根据需要改变TABLESAMPLE百分比(或行) - 百分比越高,样本越随机,但预计速度会线性下降。 (Note that TABLESAMPLE will not accept a variable) (注意TABLESAMPLE不接受变量)


#4楼

Didn't quite see this variation in the answers yet. 尚未在答案中看到这种变化。 I had an additional constraint where I needed, given an initial seed, to select the same set of rows each time. 在给定初始种子的情况下,我需要一个额外的约束来每次选择相同的行集。

For MS SQL: 对于MS SQL:

Minimum example: 最小例子:

select top 10 percent *
from table_name
order by rand(checksum(*))

Normalized execution time: 1.00 标准化执行时间:1.00

NewId() example: NewId()示例:

select top 10 percent *
from table_name
order by newid()

Normalized execution time: 1.02 标准化执行时间:1.02

NewId() is insignificantly slower than rand(checksum(*)) , so you may not want to use it against large record sets. NewId()rand(checksum(*))慢得多,所以你可能不想在大型记录集中使用它。

Selection with Initial Seed: 选择初始种子:

declare @seed int
set @seed = Year(getdate()) * month(getdate()) /* any other initial seed here */

select top 10 percent *
from table_name
order by rand(checksum(*) % @seed) /* any other math function here */

If you need to select the same set given a seed, this seems to work. 如果您需要选择给定种子的相同集合,这似乎有效。


#5楼

This link have a interesting comparison between Orderby(NEWID()) and other methods for tables with 1, 7, and 13 millions of rows. 此链接在Orderby(NEWID())和具有1,7和13百万行的表的其他方法之间进行了有趣的比较。

Often, when questions about how to select random rows are asked in discussion groups, the NEWID query is proposed; 通常,当在讨论组中询问有关如何选择随机行的问题时,建议使用NEWID查询; it is simple and works very well for small tables. 它很简单,适用于小桌子。

SELECT TOP 10 PERCENT *
  FROM Table1
  ORDER BY NEWID()

However, the NEWID query has a big drawback when you use it for large tables. 但是,当您将NEWID查询用于大型表时,它有一个很大的缺点。 The ORDER BY clause causes all of the rows in the table to be copied into the tempdb database, where they are sorted. ORDER BY子句将表中的所有行复制到tempdb数据库中,并对它们进行排序。 This causes two problems: 这会导致两个问题:

  1. The sorting operation usually has a high cost associated with it. 分拣操作通常具有与之相关的高成本。 Sorting can use a lot of disk I/O and can run for a long time. 排序可以使用大量磁盘I / O并可以运行很长时间。
  2. In the worst-case scenario, tempdb can run out of space. 在最坏的情况下,tempdb可能会耗尽空间。 In the best-case scenario, tempdb can take up a large amount of disk space that never will be reclaimed without a manual shrink command. 在最好的情况下,tempdb会占用大量的磁盘空间,如果没有手动收缩命令,它们将永远不会被回收。

What you need is a way to select rows randomly that will not use tempdb and will not get much slower as the table gets larger. 您需要的是一种随机选择不使用tempdb的行的方法,并且随着表变大而不会慢得多。 Here is a new idea on how to do that: 这是一个关于如何做到这一点的新想法:

SELECT * FROM Table1
  WHERE (ABS(CAST(
  (BINARY_CHECKSUM(*) *
  RAND()) as int)) % 100) < 10

The basic idea behind this query is that we want to generate a random number between 0 and 99 for each row in the table, and then choose all of those rows whose random number is less than the value of the specified percent. 这个查询背后的基本思想是我们想要为表中的每一行生成0到99之间的随机数,然后选择随机数小于指定百分比值的所有行。 In this example, we want approximately 10 percent of the rows selected randomly; 在这个例子中,我们希望随机选择大约10%的行; therefore, we choose all of the rows whose random number is less than 10. 因此,我们选择随机数小于10的所有行。

Please read the full article in MSDN . 请阅读MSDN中的完整文章。


#6楼

It appears newid() can't be used in where clause, so this solution requires an inner query: 看来newid()不能在where子句中使用,所以这个解决方案需要一个内部查询:

SELECT *
FROM (
    SELECT *, ABS(CHECKSUM(NEWID())) AS Rnd
    FROM MyTable
) vw
WHERE Rnd % 100 < 10        --10%
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值