50.Tips.and.Tricks.for.MongoDB.Developers --- Optimization Tips

最新推荐文章于 2024-09-09 22:50:15 发布

macyang

最新推荐文章于 2024-09-09 22:50:15 发布

阅读量861

点赞数

分类专栏： database/nosql 文章标签： optimization mongodb query disk tree returning

本文链接：https://blog.csdn.net/macyang/article/details/6598595

版权

database/nosql 专栏收录该内容

102 篇文章 0 订阅

订阅专栏

下面这些tips总结全部来自“50.Tips.and.Tricks.for.MongoDB.Developers ”，

有兴趣可以阅读以下。

Tip #21: Minimize disk access

这个tip主要说明的如何尽量减少对disk的访问，通常情况读内存的速度是读磁盘的100w倍，

（访问磁盘的速度大概是10ms，而访问内存则是10ns），很多人很快会有种方法去优化，

一种就是将disk换成访问速度快的ssd；另外一种就增大内存；但是这两种都不是最好的

解决之道，这个tip提出的一种想法如何让经常访问的documents/collection始终保存在内存

中，而不被踢到磁盘上面去。

Accessing data from RAM is fast and accessing data from disk is slow. Therefore, most
optimization techniques are basically fancy ways of minimizing the amount of disk
accesses.

Fuzzy Math
Reading from disk is (about) a million times slower than reading from memory.
Most spinning disk drives can access data in, say, 10 milliseconds, whereas memory
returns data in 10 nanoseconds. (This depends a lot on what kind of hard drive you
have and what kind of RAM you have, but we’ll do a very broad generalization that is
roughly accurate for most people.) This means that the ratio of disk time to RAM time
is 1 millisecond to 1 nanosecond. One millisecond is equal to one million nanoseconds,
so accessing disk takes (roughly) a million times longer than accessing RAM.
Thus, reading off of disk takes a really long time in computing terms.

Use SSDs
SSDs (solid state drives) are much faster than spinning hard disks for many things,
but they are often smaller, more expensive, are difficult to securely erase, and still
do not come close to the speed at which you can read from memory. This isn’t to
discourage you from using them: they usually work fantastically with MongoDB,
but they aren’t a magical cure-all.
Add more RAM
Adding more RAM means you have to hit disk less. However, adding RAM will
only get you so far—at some point, your data isn’t going to fit in RAM anymore.
So, the question becomes: how do we store terabytes (petabytes?) of data on disk, but
program an application that will mostly access data already in memory and move data
from disk to memory as infrequently as possible?
If you literally access all of your data randomly in real time, you’re just going to need
a lot of RAM. However, most applications don’t: recent data is accessed more than
older data, certain users are more active than others, certain regions have more customers
than others. Applications like these can be designed to keep certain documents
in memory and go to disk very infrequently.

Tip #22: Use indexes to do more with less memory

这个tip主要说的就是当你做查询操作的时候，而且是那种大海捞针的操作，可以针对待查询

的字段建立索引，这样可以避免全表扫描，通过索引树可以节省大量的内存空间并且加快

查找速度。

We’ll assume, for this book, that a page of memory is 4KB, although this is not universally
true.
So, let’s say you have a machine with 256GB of data and 16GB of memory. Let’s say
most of this data is in one collection and you query this collection. What does MongoDB
do?
MongoDB loads the first page of documents from disk into memory, and compares
those to your query. Then it loads the next page and compares those. Then it loads the
next page. And so on, through 256GB of data. It can’t take any shortcuts: it cannot
know if a document matches without looking at the document, so it must look at every
document. Thus, it will need to load all 256GB into memory (the OS takes care of
swapping the oldest pages out of memory as it needs room for new ones). This is going
to take a long, long time.
How can we avoid loading all 256GB into memory every time we do a query? We can
tell MongoDB to create an index on a given field, x, and MongoDB will create a tree of
the collection’s values for that field. MongoDB basically preprocesses the data, adding
every x value in the collection to an ordered tree (see Figure 3-2). Each index entry in
the tree contains a value of x and a pointer to the document with that x value. The tree

just contains a pointer to the document, not the document itself, meaning the index is

generally much smaller than the entire collection.
When your query includes x as part of the criteria, MongoDB will notice that it has an
index on x and will look through the ordered tree of values. Now, instead of looking
through every document, MongoDB can say, “Is the value I’m looking for greater than

or less than this tree node’s values? If greater, go to the right, if less, go to the left.” It
continues in this manner until it either finds the value it’s looking for or it sees that the
value it’s looking for doesn’t exist. If it finds the value, it then follows the pointer to
the actual document, loading that document’s page into memory and then returning
it .

So, suppose we do a query that will end up matching a document or two in our collection.
If we do not use an index, we must load 64 million pages into memory from disk:
Pages of data: 256GB / (4KB/page) = 64 million pages
Suppose our index is about 80GB in size. Then the index is about 20 million pages in
size:
Number of pages in our index: 80GB / (4KB/page) = 20 million pages
However, the index is ordered, meaning that we don’t have to go through every entry:
we only have to load certain nodes. How many?
Number of pages of the index that must be loaded into memory: ln(20,000,000) =
17 pages
From 64,000,000 down to 17!
OK, so it isn’t exactly 17: once we’ve found the result in the index we need to load the
document from memory, so that’s another size-of-document pages loaded, plus nodes
in the tree might be more than one page in size. Still, it’s a tiny number of pages compared
with traversing the entire collection!
Hopefully you can now picture how indexes helps queries go faster.

Tip #23: Don’t always use an index

这个tip正好和上面那个tip对应的，当我们做的不是大海捞针的查询的时候，索引给我们带来的

就不是优势了，因为索引需要占用内存空间，写操作的时候需要先对索引进行增删改查操作。

Now that I have you reeling with the usefulness of indexes, let me warn you that they
should not be used for all queries. Suppose that, in the example above, instead of
fetching a few records we were returning about 90% of the document in the collection

If we use an index for this type of query, we’d end up looking through most of the index
tree, loading, say, 60GB of the index into memory. Then we’d have to follow all of the
pointers in the index, loading 230GB of data from the collection. We’d end up loading
230GB + 60GB = 290GB—more than if we hadn’t used an index at all!
Thus, indexes are generally most useful when you have a small subset of the total data
that you want returned. A good rule of thumb is that they stop being useful once you
are returning approximately half of the data in a collection.
If you have an index on a field but you’re doing a large query that would be less efficient
using that index, you can force MongoDB not to use an index by sorting by {"$natu
ral" : 1}. This sort means “return data in the order it appears on disk,” which forces
MongoDB to not use an index:

> db.foo.find().sort({"$natural" : 1})

If a query does not use an index, MongoDB does a table scan, which means it looks
through all of the documents in the collection to find the results.
Write speed
Every time a new record is added, removed, or updated, every index affected by the
change must be updated. Suppose you insert a document. For each index, MongoDB
has to find where the new document’s value falls on the index’s tree and then insert it
there. For deletes, it must find and remove an entry from the tree. For updates, it might
add a new index entry like an insert, remove an entry like a delete, or have to do both
if the value changes. Thus, indexes can add quite a lot of overhead to writes.

Tip #24: Create indexes that cover your queries

这个tip很简单，当我们对多个field做查询操作，并且只返回部分fields的时候，可以考虑建

多个fields的组合索引。

If we only want certain fields returned and can include all of these fields in the index,
MongoDB can do a covered index query, where it never has to follow the pointers to
documents and just returns the index’s data to the client. So, for example, suppose we
have an index on some set of fields:
> db.foo.ensureIndex({"x" : 1, "y" : 1, "z" : 1})
Then if we query on the indexed fields and only request the indexed fields returned,
there’s no reason for MongoDB to load the full document:
> db.foo.find({"x" : criteria, "y" : criteria},
... {"x" : 1, "y" : 1, "z" : 1, "_id" : 0})
Now this query will only touch the data in the index, it never has to touch the collection
proper.
Notice that we include a clause "_id" : 0 in the fields-to-return argument. The _id is
always returned, by default, but it’s not part of our index so MongoDB would have to

go to the document to fetch the _id. Removing it from the fields-to-return means that
MongoDB can just return the values from the index.
If some queries only return a few fields, consider throwing these fields into your index
so that you can do covered index queries, even if they aren’t going to be searched on.
For example, z is not used in the query above, but it is a field in the fields-to-return
and, thus, the index.

Tip #25: Use compound indexes to make multiple queries fast

这个就是一个对上面tip的一个补充，使用compound index的时候也是有学问的，当对

多个fields做索引的时候，要将访问频繁的field放在前面。

If possible, create a compound index that can be used by multiple queries. This isn’t
always possible, but if you do multiple queries with similar arguments, it may be.
Any query that matches the prefix of an index can use the index. Therefore, you want
to create indexes with the greatest number of criteria shared between queries.
Suppose that your application runs these queries:
collection.find({"x" : criteria, "y" : criteria, "z" : criteria})
collection.find({"z" : criteria, "y" : criteria, "w" : criteria})
collection.find({"y" : criteria, "w" : criteria})
As you can see, y is the only field that appears in each query, so that’s a very good
candidate to go in the index. z appears in the first two, and w appears in the second
two, so either of those would work as the next option (see more on index ordering in
“Tip #27: AND-queries should match as little as possible as fast as possible”
on page 30 and “Tip #28: OR-queries should match as much as possible as soon
as possible” on page 31).
We want to hit this index as much and as often as possible. If a certain query above is
more important than the others or will be run much more frequently, our index should
favor that one. For example, suppose the first query is going to be run thousands of
times more than the next two. Then we want to favor that one in our index:
collection.ensureIndex({"y" : 1, "z" : 1, "x" : 1})
The the first query will be as highly optimized as possible and the next two will use the
index for part of the query.
If all three queries will be run approximately the same amount, a good index might be:
collection.ensureIndex({"y" : 1, "w" : 1, "z" : 1})
Then all three will be able to use the index for the y criteria, the second two will be able
to use it for w, and the middle one will be able to fully use the index.
You can use explain to see how an index is being used on a query:

collection.find(criteria).explain()

Tip #26: Create hierarchical documents for faster scans

对于不能创建索引来提高访问速度的情况，可以考虑 hierarchical document，其实就

是尽快迭代到我们要查看的field上面去。

Keeping your data organized hierarchically not only keeps it organized, but MongoDB
can also search it faster without an index (in some cases).
For example, suppose that you have a query that does not use an index. As mentioned
previously, MongoDB has to look through every document in the collection to see if
anything matches the query criteria. This can take a varying length of time, depending
on how you structure your documents.
Let’s say you have user documents with a flat structure like this:

{
"_id" : id,
"name" : username,
"email" : email,
"twitter" : username,
"screenname" : username,
"facebook" : username,
"linkedin" : username,
"phone" : number,
"street" : street
"city" : city,
"state" : state,
"zip" : zip,
"fax" : number
}

Now suppose we query:

> db.users.find({"zip" : "10003"})

What does MongoDB do? It has to look through every field of every document, looking
for the zip field

By using embedded documents, we can create our own “tree” and let MongoDB do
this faster. Suppose we change our schema to look like this:

{
"_id" : id,
"name" : username,
"online" : {
"email" : email,
"twitter" : username,
"screenname" : username,
"facebook" : username,
"linkedin" : username,
},
"address" : {
"street" : street,
"city" : city,
"state" : state,
"zip" : zip
}
"tele" : {
"phone" : number,
"fax" : number,
}
}

Now our query would look like this:

> db.users.find({"address.zip" : "10003"})

And MongoDB would only have to look at _id, name, and online before seeing that
address was a desired prefix and then looking for zip within that. Using a

Tip #27: AND-queries should match as little as possible as fast
as possible

这个tip主要说的就是当查看符合多个条件的document的时候，如果我们能实现预测到哪个

条件能够将documents圈定得最小，我们就将它们放在前头，下面这个例子就需要将条件

C放在前面，这样就可以首先淘汰掉很多不符合条件C的documents，然后再在这些剩下的

document找满足A,B的documents。

Suppose we are querying for documents matching criteria A, B, and C. Now, let’s say
that criteria A matches 40,000 documents, B matches 9,000, and C matches 200. If we
query MongoDB with the criteria in the order given, it will not be very efficient

Tip #28: OR-queries should match as much as possible as soon
as possible

这个正好和上面的相反，要做的就是将符合标准A的方面前面。

OR-style queries are exactly the opposite of AND queries: try to put the most inclusive
clauses first, as MongoDB has to keep checking documents that aren’t part of the result
set yet for every match.

macyang

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
50.Tips.and.Tricks.for.MongoDB.Developers --- Optimization Tips

下面这些tips总结全部来自“50.Tips.and.Tricks.for.MongoDB.Developers ”，有兴趣可以阅读以下。Tip #21: Minimize disk access这个tip主要说明的如何尽量减少对disk的访问，通常情况读内存的速度是读磁盘的10
复制链接

扫一扫