Java开发者写SQL时常需要注意的8+13个地方

最新推荐文章于 2022-03-26 17:10:25 发布

rosyouth

最新推荐文章于 2022-03-26 17:10:25 发布

阅读量1.7k

点赞数

文章标签：数据库

Java开发者写SQL时常需要注意的8+13个地方

(1)Understanding NULL

Is this predicate correct with respect to NULL?
Does NULL affect the result of this function?

Misunderstanding NULL is probably the biggest mistake a Java developer can make when writing SQL. This is also (but not exclusively) due to the fact that NULL is also called UNKNOWN. If it were only called UNKNOWN, it would be easier to understand. Another reason is that JDBC maps SQL NULL to Java null when fetching data or when binding variables. This may lead to thinking that NULL = NULL (SQL) would behave the same way as null == null (Java).

Train yourself. There’s nothing but explicitly thinking about NULL, every time you write SQL:

SELECT l.*
FROM t_left l
LEFT JOIN
t_right r
ON r.value = l.value
WHERE r.value IS NULL

SELECT l.*
FROM t_left l
WHERE l.value NOT IN
(
SELECT value
FROM t_right r
)

SELECT l.*
FROM t_left l
WHERE NOT EXISTS
(
SELECT NULL
FROM t_right r
WHERE r.value = l.value
)

(2)Processing data in Database
Every time you implement a data-centric algorithm in Java, ask yourself: Is there a way to let the database perform that work for me? SQL databases support advanced (and SQL standard!) OLAP features that tend to perform a lot better and are much easier to write. A (non-standard) example is Oracle’s awesome MODEL clause. Just let the database do the processing and fetch only the results into Java memory. Because after all some very smart guys have optimised these expensive products. So in fact, by moving OLAP to the database, you gain two things:

Simplicity. It’s probably easier to write correctly in SQL than in JavaPerformance. The database will probably be faster than your algorithm. And more importantly, you don’t have to transmit millions of records over the wire.

(3)Using UNION ALL instead of UNION
Every time you write a UNION, think if you actually wanted to write UNION ALL. Not only is the removal of duplicates rarely needed (or sometimes even wrong), it is also quite slow for large result sets with many columns, as the two subselects need to be ordered, and each tuple needs to be compared with its subsequent tuple.

Note that even if the SQL standard specifies INTERSECT ALL and EXCEPT ALL, hardly any database implements these less useful set operations.

(4)Using JDBC Pagination to paginate large results

Most databases support some way of paginating ordered results through LIMIT .. OFFSET, TOP .. START AT, OFFSET .. FETCH clauses. In the absence of support for these clauses, there is still the possibility for ROWNUM (Oracle) or ROW_NUMBER() OVER() filtering (DB2, SQL Server 2008 and less), which is much faster than pagination in memory. This is specifically true for large offsets!

(5)Using the MERGE statement

If you’re UPSERTING by chaining INSERT and UPDATE or by chaining SELECT .. FOR UPDATE and then INSERT or UPDATE, think again. Apart from risking race conditions, you might be able to express a simpler MERGE statement.

This isn’t really a mistake, but probably some lack of knowledge or some fear towards the powerful MERGE statement. Some databases know other forms of UPSERT statements, e.g. MySQL’s ON DUPLICATE KEY UPDATE clause. But MERGE is really so powerful, most importantly in databases that heavily extend the SQL standard, such as SQL Server.

(6)Using window functions instead of aggregate functions

When you write a GROUP BY clause in a subquery, think again if this cannot be done with a window function. Before the introduction of window functions, the only means to aggregate data in SQL was by using a GROUP BY clause along with aggregate functions in the projection. This works well in many cases, and if aggregation data needed to be enriched with regular data, the grouped query can be pushed down into a joined subquery.

But SQL:2003 defined window functions, which are implemented by many popular database vendors. Window functions can aggregate data on result sets that are not grouped. In fact, each window function supports its own, independent PARTITION BY clause, which is an awesome tool for reporting.Using window functions will:

Lead to more readable SQL (less dedicated GROUP BY clauses in subqueries)
Improve performance, as a RDBMS is likely to optimise window functions more easily

A window function in SQL:2003 is an aggregate function applied to a partition of the result set.For example,
sum(population) OVER( PARTITION BY city )
calculates the sum of the populations of all rows having the same city value as the current row.

(7)Using in-memory sorting for sort indirections
If you sort any SQL data in memory, think again if you cannot push sorting into your database. This goes along well with pushing pagination into the database.

The SQL ORDER BY clause supports many types of expressions, including CASE statements, which can be very useful for sort indirections.

(8)Always batch-insert large sets of data.
JDBC knows batching, and you should use it. Do not INSERT thousands of records one by one, re-creating a new PreparedStatement every time. If all of your records go to the same table, create a batch INSERT statement with a single SQL statement and multiple bind value sets. Depending on your database and database configuration, you may need to commit after a certain amount of inserted records, in order to keep the UNDO log slim.

(1)Always Using PreparedStatements
默认情况下，总是使用PreparedStatements来代替静态声明语句，而永远不要在你的SQL语句嵌入内联绑定值。

You can omit syntax errors originating from bad string concatenation when inlining bind values.
You can omit SQL injection vulnerabilities from bad string concatenation when inlining bind values.
You can avoid edge-cases when inlining more “sophisticated” data types, such as TIMESTAMP, binary data, and others.
You can keep open PreparedStatements around for a while, reusing them with new bind values instead of closing them immediately (useful in Postgres, for instance).
You can make use of adaptive cursor sharing (Oracle-speak) in more sophisticated databases. This helps prevent hard-parsing SQL statements for every new set of bind values.

there are some rare cases when you actually want to inline bind values in order to give your database’s cost-based optimiser some heads-up about what kind of data is really going to be affected by the query. Typically, this results in “constant” predicates such as:

DELETED = 1
STATUS = 42
But it shouldn’t result in “variable” predicates such as:

FIRST_NAME LIKE “Jon%”
AMOUNT > 19.95

(2)永远不要用select *（这样的查询），它不光会影响你的数据库执行计划，也会对你的Java应用造成不好的影响。也不要在执行不同请求时复用相同的映射。尽量尝试减少映射到你所真正需要的数据。

需要注意的是，想在对象-关系映射（ORMs）上达成这个目标有些难。

(3)SQL developers should be aware of the fact that the JOIN clause is not part of the SELECT statement per se. Always think of your FROM clause to expect a table reference as a whole. If you write a JOIN clause, think of this JOIN clause to be part of a complex table reference:

SELECT c.first_name, c.last_name, o.amount
FROM customer_view c

INNER JOIN order_view o ON c.cust_id = o.cust_id

update dept d

inner join employee e on d.dno=e.dno

set d.empcnt = d.empcnt-1

where e.empname='johnny'

delete e from employee e

inner join dept d on e.dno = d.dno

where d.dname='hr'

基于表引用（table references）的复杂性，一些数据库也接受其它类型的复杂的表引用（table references），像INSERT、UPDATE、DELETE、MERGE。一定要用ANSI 92标准的JOIN语句。

(4)使用LIKE判定时记得ESCAPE

当允许用户对你的SQL查询进行参数输入时，就应该使用ESCAPE关键字。尽管数据中含有百分号（%）的情况很罕见，但下划线（_）还是很常见的：
SELECT * FROM t
WHERE t.x LIKE 'some!_prefix%' ESCAPE '!'

SELECT * FROM t
WHERE REGEXP_LIKE (t.x, '^some_prefix')

(5)NOT (A IN (X, Y)) 和 IN (X, Y) 的布尔值相反?

对于NULLs，这是一个举足轻重的细节！让我们看看 A IN (X, Y) 真正意思吧：

A IN (X, Y)
is the same as A = ANY (X, Y)
is the same as A = X OR A = Y

同样的，NOT (A IN (X, Y))的真正意思：

NOT (A IN (X, Y))
is the same as A NOT IN (X, Y)
is the same as A != ANY (X, Y)
is the same as A != X AND A != Y

看起来和之前说的布尔值相反一样？其实不是。如果X或Y中任何一个为NULL，NOT IN 条件产生的结果将是UNKNOWN，但是IN条件可能依然会返回一个布尔值。

或者换种说话，当 A IN (X, Y) 结果为TRUE或FALSE时，NOT(A IN (X, Y)) 结果为依然UNKNOWN而不是FALSE或TRUE。注意了，如果IN条件的右边是一个子查询，结果依旧。

不信？你自己看SQL Fiddle 去。它说了如下查询给不出结果：

SELECT 1
WHERE 1 IN (NULL)
UNION ALL
SELECT 2
WHERE NOT(1 IN (NULL))

解决方案：当涉及到可为NULL的列时，注意NOT IN条件。

(6)使用行值表达式
不管干什么都可以使用行值表达式。它们会让你的SQL语句更加简洁高效。
行值表达式是SQL一个很棒的特性。SQL是一个以表格为中心的语言，表格又是以行为中心。通过创建能在同等级或行类型进行比较的点对点行模型，行值表达式让你能更容易的描述复杂的判定条件。一个简单的例子是，同时请求客户的姓名:

SELECT c.address
FROM customer c,
WHERE (c.first_name, c.last_name) = (?, ?)

可以看出，就将每行的谓词左边和与之对应的右边比较这个语法而言，行值表达式的语法更加简洁。特别是在有许多独立条件通过AND连接的时候就特别有效。行值表达式允许你将相互联系的条件放在一起。对于有外键的JOIN表达式来说，它更有用：

SELECT c.first_name, c.last_name, a.street
FROM customer c
JOIN address a
ON (c.id, c.tenant_id) = (a.id, a.tenant_id)

不幸的是，并不是所有数据库都支持行值表达式。但SQL标准已经在1992对行值表达式进行了定义，如果你使用他们，像Oracle或Postgres这些的复杂数据库可以使用它们计算出更好的执行计划。在Use The Index, Luke这个页面上有解析。

(7)NOT (A IS NULL)和A IS NOT NULL不一定是一样的

处理NULL值的时候，SQL实现了三值逻辑。这就是我们能用NULL条件来检测NULL值的原因。对么？没错。

但在NULL条件容易遗漏的情况下。要意识到下面这两个条件仅仅在行值表达式（row value expressions）为1的时候才相等：

NOT (A IS NULL)
is not the same as
A IS NOT NULL

如果A是一个大于1的行值表达式（row value expressions），正确的表将按照如下方式转换：

如果A的所有值为NUll，A IS NULL为TRUE
如果A的所有值为NUll，NOT(A IS NULL) 为FALSE
如果A的所有值都不是NUll，A IS NOT NULL 为TRUE
如果A的所有值都不是NUll，NOT(A IS NOT NULL) 为FALSE

Note: For all R, "R IS NOT NULL" has the same result as "NOT
R IS NULL" if and only if R is of degree 1. Table 12, "<null
predicate> semantics", specifies this behavior.

A, B) IS NOT NULL -- equivalent to...
A IS NOT NULL AND B IS NOT NULL -- equivalent to...
NOT(A IS NULL) AND NOT(B IS NULL) -- equivalent to...
NOT(A IS NULL OR B IS NULL)

-- whereas...
NOT((A, B) IS NULL) -- equivalent to...
NOT(A IS NULL AND B IS NULL)

The truth table also nicely documents this:

+----------------+-------+-------------+------------+--------------+
| | R IS | R IS NOT | NOT R IS | NOT R IS NOT |
| Expression | NULL | NULL | NULL | NULL |
+----------------+-------+-------------+------------+--------------+
| degree 1: null | true | false | false | true |
| degree 1: not | false | true | true | false |
| null | | | | |
| degree > 1: | true | false | false | true |
| all null | | | | |
| degree > 1: | false | false | true | true |
| some null | | | | |
| degree > 1: | false | true | true | false |
| none null | | | | |
+----------------+-------+-------------+------------+--------------+

(8)定义足够的限制条件（constraints[PK, FK, UK, NN, CK, DEFAULT]）
尽可能定义足够多的限制条件（constraints）。它们将帮你更好的执行数据库请求。对你的元数据使用限制条件不能更赞了。首先，限制条件可以帮你防止数据质变，光这一点就很有用。但对我来说更重要的是，限制条件可以帮助数据库进行SQL语句转换，数据库可以决定。

Now let’s have another look at NOT NULL constraints. If you’re using Oracle, NULL values will not be part of your index. This doesn’t matter if you’re expressing an IN constraint, for instance:

SELECT * FROM table
WHERE value IN (
SELECT nullable_column FROM ...
)

But what happens with a NOT IN constraint?

SELECT * FROM table
WHERE value NOT IN (
SELECT nullable_column FROM ...
)

Due to SQL’s slightly unintuitive way of handling NULL, there is a slight risk of the second query unexpectedly not returning any results at all, namely if there is at least one NULL value as a result from the subquery. This is true for all databases that get SQL right.

But because the index on nullable_column doesn’t contain any NULL values, Oracle has to look up the complete content in the table, resulting in a FULL TABLE SCAN. Now that is unexpected! Details about this can be seen in this article.

Carefully review all your nullable, yet indexed columns, and check if you really cannot add a NOT NULL constraint to those columns.

哪些值是等价的
哪些子句是冗余的
哪些子句是无效的（例如，会返回空值的语句）

(8)50ms的查询速度并不快
Relational databases are hyper fast. In fact, they’re so fast, they can parse your 20kb query text, calculate 2000-line execution plans, and actually execute that monster in less than a millisecond, if you and your DBA get along well and tune your database to the max.它们会变慢的原因有两种：一是你的应用滥用流行的ORM；二是ORM无法针对你复杂的查询逻辑产生快的SQL语句。遇到这种情况，你就要考虑选择像 JDBC、jOOQ 或MyBatis这样的更贴近SQL核心，能更好的控制你的SQL语句的API。

因此，不要认为查询速度50ms是很快或者可以接受的。完全不是！如果你程序运行时间是这样的，请检查你的执行计划。这种潜在危险可能会在你执行更复杂的上下文或数据中爆发。

(9)table expressions
While common table expressions obviously offer readability improvements, they may also offer performance improvements.

-- "payment" contains only a single payment
-- But it contains all the columns that we'll need afterwards
WITH payment AS (
SELECT cur_id, org_id, amount
FROM payments
WHERE id = :p_id
)
SELECT round(p.amount * e.bid / c.factor, 0)

-- Then, we simply don't need to repeat the
-- currencies / exchange_rates joins twice
FROM payment p
JOIN currencies c ON p.cur_id = c.id
JOIN exchange_rates e ON e.cur_id = p.cur_id
AND e.org_id = p.org_id
Note, that we’ve also replaced table lists with ANSI JOINs as suggested in our previous list.

If you’re lucky enough and you’re using one of those databases that supports window functions, chances are incredibly high (100%) that you also have common table expression support. This is another reason for you to migrate from MySQL to PostgreSQL, or appreciate the fact that you can work on an awesome commercial database.

Common table expressions are like local variables in SQL. In every large statement, you should consider using them, as soon as you feel that you’ve written something before.

(10)Using row value expressions for UPDATEs
Use row value expressions. Where ever you can. They make your SQL code incredibly more expressive, and chances are, they make it faster, as well.

We’ve advertised the use of row value expressions in our previous listing. They’re very readable and intuitive, and often also promote using certain indexes, e.g. in PostgreSQL.

But few people know that they can also be used in an UPDATE statement, in most databases. Check out the following query, which I again found in a customer’s PL/SQL package (simplified again, of course):

UPDATE u
SET n = (SELECT n + 1 FROM t WHERE u.n = t.n),
s = (SELECT 'x' || s FROM t WHERE u.n = t.n),
x = 3;
So this query takes a subquery as a data source for updating two columns, and the third column is updated “regularly”. How does it perform? Moderately:

-----------------------------
| Operation | Name |
-----------------------------
| UPDATE STATEMENT | |
| UPDATE | U |
| TABLE ACCESS FULL| U |
| TABLE ACCESS FULL| T |
| TABLE ACCESS FULL| T |
-----------------------------
Let’s ignore the full table scans, as this query is constructed. The actual query could leverage indexes. But T is accessed twice, i.e. in both subqueries. Oracle didn’t seem to be able to apply scalar subquery caching in this case.

To the rescue: row value expressions. Let’s simply rephrase our UPDATE to this:

UPDATE u
SET (n, s) = ((
SELECT n + 1, 'x' || s FROM t WHERE u.n = t.n
)),
x = 3;
Let’s ignore the funny, Oracle-specific double-parentheses syntax for the right hand side of such a row value expression assignment, but let’s appreciate the fact that we can easily assign a new value to the tuple (n, s) in one go! Note, we could have also written this, instead, and assign x as well:

UPDATE u
SET (n, s, x) = ((
SELECT n + 1, 'x' || s, 3
FROM t WHERE u.n = t.n
));
As you will have expected, the execution plan has also improved, and T is accessed only once:

-----------------------------
| Operation | Name |
-----------------------------
| UPDATE STATEMENT | |
| UPDATE | U |
| TABLE ACCESS FULL| U |
| TABLE ACCESS FULL| T |
-----------------------------

(11)Using PostgreSQL when you could use MySQL

To some, this may appear to be a bit of a hipster discussion. But let’s consider the facts:

MySQL claims to be the “most popular Open Source database”.
PostgreSQL claims to be the “most advanced Open Source database”.
Let’s consider a bit of history. MySQL has always been very easy to install, maintain, and it has had a great and active community. This has lead to MySQL still being the RDBMS of choice with virtually every web hoster on this planet. Those hosters also host PHP, which was equally easy to install, and maintain.

The same can be said about MySQL. MySQL has always worked somehow, but while commercial databases like Oracle have made tremendous progress both in terms of query optimisation and feature scope, MySQL has hardly moved in the last decade.

(12)Using the BOOLEAN type correctly

This is not really a mistake per se. It’s just again something that hardly anyone knows. When the SQL:1999 standard introduced the new BOOLEAN data type, they really did it right. Because before, we already had something like booleans in SQL. We’ve had <search condition> in SQL-92, which are essentially predicates for use in WHERE, ON, and HAVING clauses, as well as in CASE expressions.

SQL:1999, however, simply defined the new <boolean value expression> as a regular <value expression>, and redefined the <search condition> as such:

<search condition> ::=
<boolean value expression>
Done! Now, for most of us Java / Scala / etc. developers, this doesn’t seem like such an innovation. Heck it’s a boolean. Obviuosly it can be interchangeably used as predicate and as variable.

But in the mind-set of the keyword-heavy SQL folks who have taken inspiration from COBOL when designing the language, this was quite a step forward.

Now, what does this mean? This means that you can use any predicate also as a column! For instance:

SELECT a, b, c
FROM (
SELECT EXISTS (SELECT ...) a,
MY_COL IN (1, 2, 3) b,
3 BETWEEN 4 AND 5 c
FROM MY_TABLE
) t
WHERE a AND b AND NOT(c)
This is a bit of a dummy query, agreed, but are you aware of how powerful this is?

Luckily, again, PostgreSQL fully supports this (unlike Oracle, which still doesn’t have any BOOLEAN data type in SQL).

Every now and then, using BOOLEAN types feels very right, so do it! You can transform boolean value expressions into predicates and predicates into boolean value expressions. They’re the same. This makes SQL all so powerful.

(13)Please do take 10 minutes, download Guava, and use its excellent and easy to set up cache, that ships with various built-in invalidation strategies.

http://www.importnew.com/15291.html

http://blog.jooq.org/2013/07/30/10-common-mistakes-java-developers-make-when-writing-sql/

http://blog.jooq.org/2014/05/26/yet-another-10-common-mistakes-java-developer-make-when-writing-sql-you-wont-believe-the-last-one/

http://blog.jooq.org/2013/08/12/10-more-common-mistakes-java-developers-make-when-writing-sql/