Column AND Row Posted by David Dobrin

最新推荐文章于 2024-06-23 06:41:29 发布

Daniel_GJ_TAN

最新推荐文章于 2024-06-23 06:41:29 发布

阅读量460

点赞数

分类专栏： sap hana 文章标签： hana sap

sap hana 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

Let me start with stuff that we all know. HANA is a database. In the database, there are tables. The tables are essentially blank spaces where you can load collections of fields called records. In a row-oriented database, each row in the table contains a record, and each field in the record occupies space, even if there is no actual data there.

Here’s an example. Imagine a table of home addresses, which contains places to put the last name, street, city, and telephone number for a number of people. (Each of these “places” is called a field.) In a row-oriented table, the filled-in table would contain many lines (or “rows”), one for each address. Each line would contain all the values you had for that person, but if, for instance, you didn’t have a telephone number, the space for a telephone number would be left blank.

In a column-oriented table, each field occupies a separate column. When you load an address in, you don’t put everything on the same row. You just add any actual value you have to the end of the column that it belongs to. So you add the last name to the “Last Name” column, the city to the “City” column, etc. And if you don’t have a value, you don’t add it.

So how does the column-oriented table know that data which is now in different rows actually “belongs” to certain other values that you loaded in? (The technical term is “record.”) Well, essentially [handwaving alert!!] each value is tagged, and there’s another table where you use the tags to look up which record the value belongs to.

You can see that a row-oriented table is much easier to understand conceptually (it’s like rows in a spreadsheet), but a column-oriented table is much more compressed and, as it turns out, much faster if you want to get data back out again and work on it.

In a really interesting paper on HANA (Sikka, et al., “Efficient Transaction Processing in SAP HANA Database – The End of a Column Store Myth,”), Sikka characterizes row-oriented tables as “write-optimized,” and column-oriented tables as “read-optimized,” and this is certainly a fair description of the two kinds of tables’ essential strengths. Use rows for transactions, where you’re trying to enter data. Use columns for analytics, where you’re trying to look at data.

It would seem that this choice (row <it>or</it> column) is just built into the nature of the world. It’s either in rows or it’s in columns. And perhaps for that reason, most databases make a choice about how they’re going to operate. Either all the tables in the database are going to be row-oriented, or all the tables in the database are going to be column-oriented.

But not HANA. In HANA, a table can be row-oriented or column-oriented; the programmer (or data designer) simply decides which it is to be.

But that’s not all there is to it. Because, for a programmer, inside HANA, the choice of row <it>or</it> column is not the same as it would be elsewhere. In HANA, if you decide that a table is to be column-oriented, you don’t necessarily give up on write-optimization (what row tables are good at). HANA has set up an internal system for column-oriented tables that essentially [hand-waving alert] allows you to enter data into a buffer that is row-oriented, then, in a series of steps, push the data into the column-oriented table, without impairing the performance of the column-oriented table.

So, in two quite different senses, HANA is both row-oriented <it>and</it> column-oriented. A. It allows you to create either row-oriented tables or column-oriented tables. B. If you create column-oriented tables, the column-oriented table has a row-oriented buffer for data entry which is write-optimized, so effectively [hand-waving alert] the column table behave as if it were a row table.

Write-Optimizing the Column

It’s worthwhile to understand how this is done, because it gives you some insight into how HANA is using in-memory.

The following diagram is drawn from the aforementioned paper.

Figure 4. Overview of the unified table concept

Source: “Efficient Transaction Processing in SAP HANA Database – The End of a Column Store Myth” Vishal Sikka, et al., pp. 731-741.

The columnar table in HANA can act like a row store for one simple reason. It is a row store. As you can see from the diagram above, there is a buffer store (L1) which is row-oriented. Data coming in goes into L1.

Every so often, an L1 buffer is effectively closed, and the data in it is inserted into a column-oriented buffer (L2). This involves the typical column-oriented things, where the values are added to their respective columns and the fact that they belong to a particular record is recorded in another table.

Every so often, the L2 buffer is closed and the data is pushed into the primary columnar store.

To get the data into the columnar table without hurting performance, it first puts the data into a “write-optimized” [row] store (L1). It then takes that data and restructures it, pushing it into a second store (L2), which is column-oriented. Finally, it pushes the data in L2 into the main column-oriented table which was the intended destination all along.

This process would be weirdly byzantine and slow if it weren’t done in-memory (and in parallel). But since L1 and L2 and the final column store are in memory and all the processing needed to make this work is done in memory, it can all be done in the background. And because L1 and L2 are relatively small, the data in them can be made available to other database operations.

So, for instance, if you run a query on the columnar database while some data is being added, the query just checks L1 and L2, as well as the primary store Since L1 and L2 are always relatively small and since the query is done in-memory, the performance cost of doing this is small.

When Do You Use Row, When Column: One Man’s Opinion

You can see, now, why I chose to do some hand-waving when I was explaining the design idea behind HANA. L1, L2, write-optimized, blah, blah, Important to know if you want to know what it does. But hard to grasp if you’re just trying to get the idea.

Besides, a little learning is a dangerous thing. When you know things, you start asking questions. Why provide that much flexibility? What is it good for? How will people use it? What does it enable? If we were working at the hand-waving level, you just wouldn’t be asking questions like that.

But now that I’ve given you the detail, I have some responsibility for explaining what this row-and-column stuff is doing. I myself can’t answer from my own experience. So I did the next best thing. I asked people who have built products in HANA

I started with Adam Thier, whose job title at SAP used to include the words “database” and “architect.”

Why or when do you use row-oriented tables and when do you use columnnar tables?

“I am a financial software lifer [he worked on Essbase at Hyperion and is now working on EPM at SAP] in a long term platonic relationship with boring algorithms like “depreciation” and “allocations”; and I’ve ground my teeth to nubs trying to make basic algebra fly in a relational database. That’s why I jumped ship to Columnar back in the 1990s and never looked back. I typically always start there with the exception of long text or Blobs (for things like descriptions).

“However, even in the world of HANA, that’s not a constant because of the wonders of text analysis.”

So, if you have to do calculations on the data, you use columns, and if you have to do other things, like store text, then you use rows, except when you don’t.

That seems clear enough.

When Do You Use Column, When Row: Another Man’s Opinion

Not content with that, I also asked Vijay Vijayasankar, who has done a fair amount of HANA development and has served as an InnoJam judge with me. He’s the sort of guy who would probably prefer to post on this himself, so I’ll just pick and choose from his answer.

“HANA can have column or row stores; there is no technical limitation. My simple rule of thumb is ‘use a column table until proven wrong.’

“All data is going to be stored in more than one table, so you also need to think about how you will join the tables. The worst performing apps are the ones that try to join columnar and row stores. The reason for the drop in performance is that the result of such a join needs to be temporarily materialized, which kind of defeats the purpose of HANA.’

“There are other constraints. You cannot build a column view if the underlying table is a row table. A row store table is loaded into memory the moment the HANA engine starts, whereas a column gets loaded into memory when a query needs it.

So again, you use columns, except when you use rows, except for when columns are better.

So Does It Matter?

I went back to Adam with a bit of a puzzled look on my face, and said, “?”

“The beauty is that with HANA it doesn’t matter. I can start columnar, and if I stumble I can spend a few hours and reconfigure certain tables over to rows, with a smile on my face the whole time.”

So, in effect, not all that much has changed in HANA? You did columns in the past, and now you’re still doing columns?

“No. I’m doing columns in a different, more transactional way. Remember, I — and a whole bunch of people much smarter than me — who go back to the original MOLAP engines (essentially column stores) never cracked the code on a hybrid database — which meant we never cracked the code on how to do an insert in a MOLAP structure without then taking it off-line to re-index. This isn’t a big deal — it’s a huge deal.”

Simple and Fruitful

In my earlier blog, I said there were too tests to apply to the design of HANA. Is it simplifying? And is it fruitful?

This post might seem to suggest that there’s a problem with the “simplification” criterion. But I think that’s a deep misreading of the situation.

To begin with, it’s clear that even in the simple case, where columnar tables are being created because what you want is ultrafast analysis, the underlying design that I described in the first blog post makes a material difference.

* In this design, for instance, the columnar tables can be updated continuously, whereas they can’t be in most analytic databases. (You need to take time out to reindex.)

* In this design, too, the creation of views–like the multi-dimensional view you see in most analytic databases–is logically separate from the table structure, so you can do lots of different views, including views that do algebra, without rebuilding the tables, and you can change them as much as you want.

* In this design, as well, you can embed things like statistical functions right in the database, whereas in most analytic databases, you have to pull the data out to do a lot of statistical analysis.

And the more complex cases? I’m not sure we know the answer yet. What we do know is that at least two sharp database guys are pretty much saying, “It depends,” when you ask then how to use the “row and column” structure. That could be because the structure isn’t all that useful, but it could also be that the flexibility that Adam describes gives you a lot of options.

At this point, should one have an absolutely clear idea of how people should use this flexibility?

I don’t think so. Certainly, you saw a similar fluidity in original electric car example. The basic design idea was there in the battery-powered ignition system for the 1912 Cadillac. But at the time, no one was quite sure about where it was going to lead. The original designers were mostly just trying to get rid of one limitation in automotive design, the fact that you had to turn a crank to start the car. (According to legend, the inventor came up with the idea after a wayward crank had broken the jaw of a friend of his, who later died of gangrene.) They certainly saw that there was more to the idea than that. But it would take at least ten years of experimentation before the standard modern automotive electrical system would emerge full-blown.

About the Author

I run a small analyst firm in Cambridge, Massachusetts that does strategy consulting in most areas of enterprise applications. I am not a database expert, but for the past year, I have been doing a lot of work with SAP related to HANA, so I’m reasonably familiar with it.

SAP is a customer, but they did not commission this piece, and they did not edit or offer to edit it in any way.

Daniel_GJ_TAN

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Column AND Row Posted by David Dobrin

Let me start with stuff that we all know. HANA is a database. In the database, there are tables. The tables are essentially blank spaces where you can load collections of fields called records. In a
复制链接

扫一扫