Dremel made simple with Parquet

Dremel made simple with Parquet

By ‎@J_‎

Wednesday, 11 September 2013 

 

Columnar storage is a popular technique to optimize analytical workloads in parallel RDBMs. The performance and compression benefits for storing and processing large amounts of data are well documented in academic literature as well as several commercial analytical databases.

The goal is to keep I/O to a minimum by reading from a disk only the data required for the query. Using Parquet at Twitter, we experienced a reduction in size by one third on our large datasets. Scan times were also reduced to a fraction of the original in the common case of needing only a subset of the columns. The principle is quite simple: instead of a traditional row layout, the data is written one column at a time. While turning rows into columns is straightforward given a flat schema, it is more challenging when dealing with nested data structures.

We recently introduced Parquet, an open source file format for Hadoop that provides columnar storage. Initially a joint effort between Twitter and Cloudera, it now has many other contributors including companies like Criteo. Parquet stores nested data structures in a flat columnar format using a technique outlined in the Dremel paper from Google. Having implemented this model based on the paper, we decided to provide a more accessible explanation. We will first describe the general model used to represent nested data structures. Then we will explain how this model can be represented as a flat list of columns. Finally we’ll discuss why this representation is effective.

To illustrate what columnar storage is all about, here is an example with three columns.

                                   

In a row-oriented storage, the data is laid out one row at a time as follows:              

Whereas in a column-oriented storage, it is laid out one column at a time:

There are several advantages to columnar formats.

  • Organizing by column allows for better compression, as data is more homogenous. The space savings are very noticeable at the scale of a Hadoop cluster.
  • I/O will be reduced as we can efficiently scan only a subset of the columns while reading the data. Better compression also reduces the bandwidth required to read the input.
  • As we store data of the same type in each column, we can use encodings better suited to the modern processors’ pipeline by making instruction branching more predictable.

 

The model

To store in a columnar format we first need to describe the data structures using a schema. This is done using a model similar to Protocol buffers. This model is minimalistic in that it represents nesting using groups of fields and repetition using repeated fields. There is no need for any other complex types like Maps, List or Sets as they all can be mapped to a combination of repeated fields and groups.

The root of the schema is a group of fields called a message. Each field has three attributes: a repetition, a type and a name. The type of a field is either a group or a primitive type (e.g., int, float, boolean, string) and the repetition can be one of the three following cases:

  • required: exactly one occurrence
  • optional: 0 or 1 occurrence
  • repeated: 0 or more occurrences

For example, here’s a schema one might use for an address book:

message AddressBook {
    required string owner;
    repeated string ownerPhoneNumbers;
    repeated group contacts {
        required string name;
        optional string phoneNumber;
    }
}

Lists (or Sets) can be represented by a repeating field.

A Map is equivalent to a repeating field containing groups of key-value pairs where the key is required.

 

Columnar format

A columnar format provides more efficient encoding and decoding by storing together values of the same primitive type. To store nested data structures in columnar format, we need to map the schema to a list of columns in a way that we can write records to flat columns and read them back to their original nested data structure. In Parquet, we create one column per primitive type field in the schema. If we represent the schema as a tree, the primitive types are the leaves of this tree.

AddressBook example as a tree:

To represent the data in columnar format we create one column per primitive type cell shown in blue.

The structure of the record is captured for each value by two integers called repetition level and definition level. Using definition and repetition levels, we can fully reconstruct the nested structures. This will be explained in detail below.

Definition levels

To support nested records we need to store the level for which the field is null. This is what the definition level is for: from 0 at the root of the schema up to the maximum level for this column. When a field is defined then all its parents are defined too, but when it is null we need to record the level at which it started being null to be able to reconstruct the record.

In a flat schema, an optional field is encoded on a single bit using 0 for null and 1 for defined. In a nested schema, we use an additional value for each level of nesting (as shown in the example), finally if a field is required it does not need a definition level.

For example, consider the simple nested schema below:

message ExampleDefinitionLevel {
    optional group a {
        optional group b {
            optional string c;
        }
    }
}

It contains one column: a.b.c where all fields are optional and can be null. When c is defined, then necessarily a and b are defined too, but when c is null, we need to save the level of the null value. There are 3 nested optional fields so the maximum definition level is 3.

Here is the definition level for each of the following cases:

The maximum possible definition level is 3, which indicates that the value is defined. Values 0 to 2 indicate at which level the null field occurs.

A required field is always defined and does not need a definition level. Let’s reuse the same example with the field b now required:

message ExampleDefinitionLevel {
    optional group a {
        required group b {
            optional string c;
        }
    }
}

The maximum definition level is now 2 as b does not need one. The value of the definition level for the fields below b changes as follows:

Making definition levels small is important as the goal is to store the levels in as few bits as possible.

Repetition levels

To support repeated fields we need to store when new lists are starting in a column of values. This is what repetition level is for: it is the level at which we have to create a new list for the current value. In other words, the repetition level can be seen as a marker of when to start a new list and at which level. For example consider the following representation of a list of lists of strings:


The column will contain the following repetition levels and values:

The repetition level marks the beginning of lists and can be interpreted as follows:

  • 0 marks every new record and implies creating a new level1 and level2 list
  • 1 marks every new level1 list and implies creating a new level2 list as well.
  • 2 marks every new element in a level2 list.

On the following diagram we can visually see that it is the level of nesting at which we insert records:

A repetition level of 0 marks the beginning of a new record. In a flat schema there is no repetition and the repetition level is always 0. Only levels that are repeated need a Repetition level: optional or required fields are never repeated and can be skipped while attributing repetition levels.

Striping and assembly

Now using the two notions together, let’s consider the AddressBook example again. This table shows the maximum repetition and definition levels for each column with explanations on why they are smaller than the depth of the column:

In particular for the column contacts.phoneNumber, a defined phone number will have the maximum definition level of 2, and a contact without phone number will have a definition level of 1. In the case where contacts are absent, it will be 0.

AddressBook {
    owner: "Julien Le Dem",
    ownerPhoneNumbers: "555 123 4567",
    ownerPhoneNumbers: "555 666 1337",
    contacts: {
        name: "Dmitriy Ryaboy",
        phoneNumber: "555 987 6543",
    },
    contacts: {
        name: "Chris Aniszczyk"
    }
}

AddressBook {
    owner: "A. Nonymous"
}

We’ll now focus on the column contacts.phoneNumber to illustrate this.

Once projected the record has the following structure:

AddressBook {
    contacts: {
        phoneNumber: "555 987 6543"
    }
    contacts: {
    }
}

AddressBook {

}

The data in the column will be as follows (R = Repetition Level, D = Definition Level)

To write the column we iterate through the record data for this column:

  • contacts.phoneNumber: “555 987 6543”
    • new record: R = 0
    • value is defined: D = maximum (2)
  • contacts.phoneNumber: null
    • repeated contacts: R = 1
    • only defined up to contacts: D = 1
  • contacts: null
    • new record: R = 0
    • only defined up to AddressBook: D = 0

 

The columns contains the following data:

 

Note that NULL values are represented here for clarity but are not stored at all. A definition level strictly lower than the maximum (here 2) indicates a NULL value.

To reconstruct the records from the column, we iterate through the column:

  • R=0, D=2, Value = “555 987 6543”:
    • R = 0 means a new record. We recreate the nested records from the root until the definition level (here 2)
    • D = 2 which is the maximum. The value is defined and is inserted.
  • R=1, D=1:
    • R = 1 means a new entry in the contacts list at level 1.
    • D = 1 means contacts is defined but not phoneNumber, so we just create an empty contacts.
  • R=0, D=0:
    • R = 0 means a new record. we create the nested records from the root until the definition level
    • D = 0 => contacts is actually null, so we only have an empty AddressBook

 

Storing definition levels and repetition levels efficiently

In regards to storage, this effectively boils down to creating three sub columns for each primitive type. However, the overhead for storing these sub columns is low thanks to the columnar representation. That’s because levels are bound by the depth of the schema and can be stored efficiently using only a few bits per value (A single bit stores levels up to 1, 2 bits store levels up to 3, 3 bits can store 7 levels of nesting). In the address book example above, the column owner has a depth of one and the column contacts.name has a depth of two. The levels will always have zero as a lower bound and the depth of the column as an upper bound. Even better, fields that are not repeated do not need a repetition level and required fields do not need a definition level, bringing down the upper bound.

In the special case of a flat schema with all fields required (equivalent of NOT NULL in SQL), the repetition levels and definition levels are omitted completely (they would always be zero) and we only store the values of the columns. This is effectively the same representation we would choose if we had to support only flat tables.

These characteristics make for a very compact representation of nesting that can be efficiently encoded using a combination of Run Length Encoding and bit packing. A sparse column with a lot of null values will compress to almost nothing, similarly an optional column which is actually always set will cost very little overhead to store millions of 1s. In practice, space occupied by levels is negligible. This representation is a generalization of how we would represent the simple case of a flat schema: writing all values of a column sequentially and using a bitfield for storing nulls when a field is optional.

Get Involved

Parquet is still a young project; to learn more about the project see our README or look for the “pick me up!” label on GitHub. We do our best to review pull requests in a timely manner and give thorough and constructive reviews.

You can also join our mailing list and tweet at @ApacheParquet to join the discussion.

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
毕业设计,基于SpringBoot+Vue+MySQL开发的纺织品企业财务管理系统,源码+数据库+毕业论文+视频演示 在如今社会上,关于信息上面的处理,没有任何一个企业或者个人会忽视,如何让信息急速传递,并且归档储存查询,采用之前的纸张记录模式已经不符合当前使用要求了。所以,对纺织品企业财务信息管理的提升,也为了对纺织品企业财务信息进行更好的维护,纺织品企业财务管理系统的出现就变得水到渠成不可缺少。通过对纺织品企业财务管理系统的开发,不仅仅可以学以致用,让学到的知识变成成果出现,也强化了知识记忆,扩大了知识储备,是提升自我的一种很好的方法。通过具体的开发,对整个软件开发的过程熟练掌握,不论是前期的设计,还是后续的编码测试,都有了很深刻的认知。 纺织品企业财务管理系统通过MySQL数据库与Spring Boot框架进行开发,纺织品企业财务管理系统能够实现对财务人员,员工,收费信息,支出信息,薪资信息,留言信息,报销信息等信息的管理。 通过纺织品企业财务管理系统对相关信息的处理,让信息处理变的更加的系统,更加的规范,这是一个必然的结果。已经处理好的信息,不管是用来查找,还是分析,在效率上都会成倍的提高,让计算机变得更加符合生产需要,变成人们不可缺少的一种信息处理工具,实现了绿色办公,节省社会资源,为环境保护也做了力所能及的贡献。 关键字:纺织品企业财务管理系统,薪资信息,报销信息;SpringBoot
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值