SSAS Partition Slicing

最新推荐文章于 2024-09-09 14:40:58 发布

weixin_30675247

最新推荐文章于 2024-09-09 14:40:58 发布

阅读量96

点赞数

文章标签：数据库

原文链接：http://www.cnblogs.com/yanbinboy/archive/2010/01/13/1646720.html

版权

Summary:
SSAS uses partitions to contain cube data. Small cubes might use only one, but for non trivial cubes, cube designers will create partitions based on ease of managing data and to split groups of data. This document discusses how the server uses "slices" to examine partitions at query time. The slices are stored as a range of internal surrogate keys, and can be thought of as a very high level index.

Using more attributes in the partition definition can potentially improve performance by the ability to exclude partitions. To benefit user queries successfully requires some consideration of the internal way partition slices are used. Note that even though slices might not exclude partitons, we still expect the map indexes to be efficient at retrieving only neccessary data.

Slice description:
By "slice" we mean one or more members that limit the multidimensional space. Queries may have a "slice" as a result of including a set of members on an axis, or by placing members or a set in the where clause. Query slices are always a finite set of members. The formula engine may decompose MDX queries into many storage engine queries, which have a simple specification of granularity of each attribute, and a set of slices that limit the contents.

Cube partitions each have a slice for each attribute. This slice is either set by the administrator as part of the partition definition, or for MOLAP partitions is created during creation of the bitmap index for the partition. There is one slice per attribute in the partition. (Similarly there is one bitmap index per attribute in the partition.) Because the server automatically discovers the slice it is sometimes referred to as the "auto slice".

The partition slice value is stored in a file in the partition directory, with the name "Info.*.xml". The information stored is the minimum and maxiumum member IDs that are present in the partition, for each attribute. The member IDs are internal to the olap server, and are basically surrogate keys, created as a contiguous set of 32-bit integers starting with 1. This can have an impact when considering the min/max range way that partitions are created. It might be beneficial to change the SQL query for the dimension attributes so that the order in which attributes are returned matches the partitioning slicing scheme. Then the alignment of the internal surrogate keys helps the server at query time eliminate some partitions from being considered. See below the section Design Example.

The slice is created during index creation. Index creation is controlled by the number of rows, specified in the config file as <IndexBuildThreshold>, with a default of 4096. Partitions with fewer rows will not have indexes built -- and therefore will not have an "auto slice".

Optimal Number and Size of Partitions

There is not necessarily an optimal number or size of partitions, but some general guidelines exist. In Analysis Services 2005 Performance Guide (see link below), we suggest at most about 2000 partitions, and maximum of about 2 GB or 10-15 million rows in size. Some sites successfully use much larger partitions. For number of partitions, too few can miss opportunities to use slicing or can cause larger time windows for processing. Too many partitions might take more work to manage, cause some slowdowns of operations during processing (improved after SP2 was released), and can slow service startup time due to many more files on disk. Too small partitions might not offer any benefit. Too large partitions might increase the time window for processing past what is available.

Query Time Behavior

During query time all partitions are scanned for possible inclusion, by checking the query slices and partition slices for intersection. This is sequential, but is fast because it involves only integer range comparisons. Partitions that are included will be queried in parallel, and partitions that are excluded are not touched further.

There are SQLProfiler trace events for each partition that is used. Which trace event depends on whether the fact table or an aggregation, and each partition could have a separate aggregation definition.

Troubleshooting Information

It is important for very large cubes that a partition is not touched unless the query depends upon it. The information generally used to troubleshoot is:

Trace event Query Subcube Verbose. Check to see which attributes have slices.
Info.*.xml file, located in the partition. Make sure slices are defined for the attributes used in the query.
Trace events, can be used to check which partitions are being used. See below.

In SSAS2005 SP2 and earlier, the following behaviors may have a negative impact on the number of partitions touched.

Aggregations having fewer than <IndexBuildThreshold> rows are not excluded based on slice, even if a slice range exists for the fact table data. One possible work around is to lower the value of the <IndexBuildThreshold> configuration parameter in the server configuration file. However, this is not recommended as it may result in a large number of additional indexes, slowing processing performance.
Queries using "OR" slices do not use partition slices. The OR slice can be identified by a "+" in the trace event Query Subcube Verbose.

Both of these issues are expected to be addressed in a future Analysis Services release.

Trace Events

These are the relevant trace events to examine.

Query Subcube Verbose. This trace event is emitted after the internal storage engine subcube query has completed. The duration shows the total time for creating the datacache. Note that a query will only have one occurring at a time, and many partitions can be read from concurrently. So the start and end times will not overlap other Query Subcube Verbose events. Here is an example part of the output.

D:3 (Time) [1 8 52 * 0 0 0 0 0] => ((All)):[All] (Season):[Winter 2007] (Period):[JANUARY_2007] (Week):* (Day):0 (Dimdate):0 (Year):0 (Half):0 (Qtr):0

The numbers in square brackets [1 8 52 * 0 ...] have these meanings:

Single number - The internal surrogate key ID for the member participating in the slice. Note that the first element is often the All level, which by definition only has 1 member, and therefore its ID will always be 1. The member name is displayed.

'*' - The answer must have granularity for this attribute, and all members are used. There is no slice on this member.

0 - The answer should not include granularity for this attribute.

'+' - A slice is defined with more than 1 member. This is sometimes called an "OR" slice, because the equivalent SQL query would be of the form: where city = 'Seattle' or city = 'Redmond'. The list of members is not displayed. For simple MDX queries it is sometimes possible to infer the list.

Query Subcube. This trace event is not as readable for a human as Query Subcube Verbose. The format is a bitmap of attribute granularities, with dimensions delimited by a comma (",").

The same format of information is emitted by the Get Data From Aggregation trace event, and is also inserted into the QueryLog relational database table, if enabled in the configuration file. You can cut and paste this string directly into the AggManager sample to help create aggregations that match your queries, without or in addition to using the QueryLog relational table. (We wish to point out the format can be transferred -- not to suggest creating aggregations by just cutting and pasting. Designing the set of of aggregations is beyond the scope of this document.)

TextData from Query Subcube event : 00000000,000000000000000000000000,01110000

In this example, there are 3 dimensions, and the query only has granularity on attributes in the 3rd dimension. (Which for this example happened to be the Time dimension.)

Get Data From Aggregation. This trace event is emitted when reading from a partition, in an aggregation (not from the lowest level fact data). It is not emitted if the partition can be excluded based on the partition's "auto slice".

Note that the start and end time will typically overlap the start end end time of other events, because storage engine reads partition data in parallel, and the time will be contained within the time range for Query Subcube and Query Subcube Verbose.

The ObjectPath trace column tells what partition is being read. Note that Query Subcube and Query Subcube Verbose also have ObjectPath, but are missing the partition information since the query represents the union of all its partitions. The format is as follows. Note that for named instances the server name might be of the form MachineName\ServerInstanceName.

ServerName . DatabaseID . CubeID . MeasureGroupID . PartitionName

The TextData column contains the name of the aggregation used and the list of attribute granularity for the aggregation used. It will always "cover" (be larger than) the granularity of the Query Subcube. Here is an example:

TextData from Query Subcube event : 00000000,000000000000000000000000,01110000

TextData from Get Data From Aggregation event: Aggregation 1 00000000,000000000000000000000000,01110111

Progress Report Begin, Progress Report End These trace events are emitted when reading from a partition, in the fact data (not from an aggregation). It is not emitted if the partition can be excluded based on the partition's "auto slice".

The ObjectName trace column tells what partition is being read. The TextData column also tells the same information. Here is an example:

TextData from Progress Report Begin event : Started reading data from the 'MyPartitionName' partition.

Example of Partition Slice Information

The exact MDX query is unimportant, so we will just look at the Query Subcube Verbose and some partition information.

Query Subcube Verbose, from SQLProfiler trace:

Dimension 0 [Location] (0 0 0 0 0 0 0 0) [Dimlocation]:0 [Chain]:0 [Division]:0 [Region]:0 [Area]:0 [Store]:0 [Location Type]:0 [Grouped Stores Ind]:0
Dimension 1 [Product] (0 2 2 14 137 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0) [Dimproduct]:0 [Company]:[C SAMPLE] [Brand]:[B 1 SAMPLE] [Group]:[G 105 SAMPLE] [Dept]:[D 1125 SAMPLE] [Category]:0 [Range]:0 [Style]:0 [Colour]:0 [Size 1]:0 [Size 2]:0 [Size]:0 [Season Ind]:0 [Replenishment]:0 [Clearance]:0 [RSP]:0 [Primary Supplier]:0 [Supplier Color]:0 [SKU]:0 [Size Indicator1]:0 [VPN]:0 [Sku Description]:0 [Latest Cost]:0 [Original RSP]:0
Dimension 2 [Time] (0 * * * + 0 0 0) [Dimdate]:0 [Week]:* [Period]:* [Season]:* [Day]:+ [Year]:0 [Qtr]:0 [Half]:0

The "0" means no granularity for that attribute.

The "*" means include granularity for that attribute, with all members.

The "+" means include granularity for that attribute, with a set of members.

A number means include granularity for that attribute, with one specific member. The number is the internal surrogate key.

This subcube query included a slice on a set of days. Though the set members are not included in the trace event, for the actual query the set of [Time].[Day] for this example was { 1465, 1466, ... 1492 }.

So we can expect for this query, that partitions slicing by [Product] or [Time] would be effective at reducing query time by using slice information as a high-level index and reducing the work for the query.

SQLProfiler Trace information:

For this example, aggregations were able to answer the subcube query. So the relevant trace events are Progress Report Begin, and Progress Report End. Here are the relevant columns for one trace event. (Note that in SQLProfiler, this is displayed horizontally as one row.)

EventClass: Progress Report Begin

EventSubclass: 14 - Query

TextData: Started reading data from the 'Aggregation 9' aggregation.

DatabaseName: Sample

ObjectName: Aggregation 9

ObjectID: Aggregation 9

ObjectPath: MyMachine.Sample.Sample SMDB.Fact Sales SKU Store Day 0.Sales_W200639.Aggregation 9

Note that the partition name is part of the ObjectPath, a trace column that is disabled by default.

Here is a subset of info.*.xml for a customer partition year 2005 week 36. We can see that if the query slice contains any of the Product or Time properties, partitions should be excluded. Note that some ranges are narrow (few members), and some ranges are broad. For example, the range for [Product].[Colour] is 2...736, which is not surprising since the product sold in this week probably contains every color. If there was a goal of partitioning also by color or some other product attribute, that would help queries that sliced by that attribute.

For the example query subcube, we expect it to be excluded based on [Time].[Day]. For this example, it actually was excluded due to the set of days not in the inclusive range 1101...1107 for the property Day.

      <MapDataIndex>
        <DimensionID>Product</DimensionID>
        <PropertyID>ID Company</PropertyID>
        <m_MinIndex>2</m_MinIndex>
        <m_MaxIndex>2</m_MaxIndex>
      </MapDataIndex>
      <MapDataIndex>
        <DimensionID>Product</DimensionID>
        <PropertyID>ID Brand</PropertyID>
        <m_MinIndex>2</m_MinIndex>
        <m_MaxIndex>2</m_MaxIndex>
      </MapDataIndex>
      <MapDataIndex>
        <DimensionID>Product</DimensionID>
        <PropertyID>ID Group</PropertyID>
        <m_MinIndex>2</m_MinIndex>
        <m_MaxIndex>21</m_MaxIndex>
      </MapDataIndex>
      <MapDataIndex>
        <DimensionID>Product</DimensionID>
        <PropertyID>ID Style</PropertyID>
        <m_MinIndex>22</m_MinIndex>
        <m_MaxIndex>325486</m_MaxIndex>
      </MapDataIndex>
      <MapDataIndex>
        <DimensionID>Product</DimensionID>
        <PropertyID>ID SKU</PropertyID>
        <m_MinIndex>122</m_MinIndex>
        <m_MaxIndex>1307861</m_MaxIndex>
      </MapDataIndex>
        <DimensionID>Product</DimensionID>
        <PropertyID>Colour</PropertyID>
        <m_MinIndex>2</m_MinIndex>
        <m_MaxIndex>736</m_MaxIndex>
      </MapDataIndex>
      <MapDataIndex>
        <DimensionID>Product</DimensionID>
        <PropertyID>Size Description1</PropertyID>
        <m_MinIndex>2</m_MinIndex>
        <m_MaxIndex>2205</m_MaxIndex>
      </MapDataIndex>
      <MapDataIndex>
        <DimensionID>Time</DimensionID>
        <PropertyID>(All)</PropertyID>
        <m_MinIndex>1</m_MinIndex>
        <m_MaxIndex>1</m_MaxIndex>
      </MapDataIndex>
      <MapDataIndex>
        <DimensionID>Time</DimensionID>
        <PropertyID>ID Season</PropertyID>
        <m_MinIndex>11</m_MinIndex>
        <m_MaxIndex>11</m_MaxIndex>
      </MapDataIndex>
      <MapDataIndex>
        <DimensionID>Time</DimensionID>
        <PropertyID>ID Period</PropertyID>
        <m_MinIndex>39</m_MinIndex>
        <m_MaxIndex>39</m_MaxIndex>
      </MapDataIndex>
      <MapDataIndex>
        <DimensionID>Time</DimensionID>
        <PropertyID>ID Week</PropertyID>
        <m_MinIndex>159</m_MinIndex>
        <m_MaxIndex>159</m_MaxIndex>
      </MapDataIndex>
      <MapDataIndex>
        <DimensionID>Time</DimensionID>
        <PropertyID>ID Day</PropertyID>
        <m_MinIndex>1101</m_MinIndex>
        <m_MaxIndex>1107</m_MaxIndex>
      </MapDataIndex>
      <MapDataIndex>
        <DimensionID>Time</DimensionID>
        <PropertyID>Dimdate</PropertyID>
        <m_MinIndex>1101</m_MinIndex>
        <m_MaxIndex>1107</m_MaxIndex>
      </MapDataIndex>
      <MapDataIndex>
        <DimensionID>Time</DimensionID>
        <PropertyID>ID Year</PropertyID>
        <m_MinIndex>6</m_MinIndex>
        <m_MaxIndex>6</m_MaxIndex>
      </MapDataIndex>
      <MapDataIndex>
        <DimensionID>Time</DimensionID>
        <PropertyID>Id Trdhalf</PropertyID>
        <m_MinIndex>8</m_MinIndex>
        <m_MaxIndex>8</m_MaxIndex>
      </MapDataIndex>
      <MapDataIndex>
        <DimensionID>Time</DimensionID>
        <PropertyID>ID Qtr</PropertyID>
        <m_MinIndex>14</m_MinIndex>
        <m_MaxIndex>14</m_MaxIndex>
      </MapDataIndex>

Design Example

In this example, it is known that for a cube, users often want to query for a particular city and time range. It is common to partition by time, and to improve query time, we can also partition by geographical region. For the real customer cube, there is one city that represents more than 50% of the records. So we want to create a partition that looks graphically something like this:

2001, BigCity	2002, BigCity	2003, BigCity	2004, BigCity	2005, BigCity
2001, OtherCities	2002, OtherCities	2003, OtherCities	2004, OtherCities	2005, OtherCities

Remember that improving query time is by excluding partitions, and that is done by ensuring discontiguous ID ranges. The IDs are surrogate keys created during dimension processing. So for this example, it is required that during dimension processing, the query for Country, Province, and City, will result in BigCity being either before or after the other cities (and provinces and countries). For example, if it is in the middle, we will get these ranges:

BigCity partition: MinIndex = 500, MaxIndex = 500

OtherCites partition: MinIndex = 1, MaxIndex = 1000

Therefore a query with slice on SmallCity with ID = 123 would exclude the BigCity partitions, but a query with slice on BigCity would not be able to exclude OtherCities partitons, and would take longer.

The preferred result is a set of exclusive ranges (without overlap). Thus a query with slice on BigCity will exclude 50% of the partitions, and a query with slice on SmallCity will exclude 50% of the partitions. This is optimal.

BigCity partition: MinIndex = 1, MaxIndex = 1

OtherCites partition: MinIndex = 2, MaxIndex = 1000

So we know the goal, and should check the info.*.xml files after processing, to ensure they form exclusive ranges, and check representative MDX queries, and check the Query Subcube Verbose trace events.

Ordering the rows during dimension processing can be done using a named query as the source of the dimension in the DSV. One could define a different named query for each attribute. There is a chance this could hurt performance of dimension processing if any operation needed a join. Another approach is to substitute a view for the dimension table, to perform the appropriate sorting. Note that at present we have not actually done this, but we believe this approach will work.

Further Reading

Analysis Services 2005 Performance Guide

http://download.microsoft.com/download/8/5/e/85eea4fa-b3bb-4426-97d0-7f7151b2011c/SSAS2005PerfGuide.doc

AggManager, part of the SP2 samples.

http://www.microsoft.com/downloads/details.aspx?FamilyID=e719ecf7-9f46-4312-af89-6ad8702e4e6e&DisplayLang=en

[end]

Posted: Monday, March 05, 2007 2:10 PM by Eric Jacobsen

Filed under: Performance and Scalability, Development & Programming, Resource Utilization, SQL Analysis Service, analysis services

Comments

garyf123 said:

Does this mean it is using the slice or not?

The "0" means no granularity for that attribute.

I have my cube partitioned but it is still scanning the partitions...

Dimension 0 [Dim Num Bedroom] (0) [Num Bedrooms]:0

Dimension 1 [Dim Property Type] (0) [Property Type]:0

Dimension 2 [Dim Square Feet] (0) [Square Feet]:0

Dimension 3 [Dim Year Built] (0) [Year Built]:0

Dimension 4 [Dim Year Sold] (0) [Year Sold]:0

Dimension 5 [Dim Num Bath] (0) [Num Baths]:0

Dimension 6 [Dim LotSize SQFT] (0) [LotSize SQFT]:0

Dimension 7 [Dim Geo] (0 * 0 0 0 0 0 0 0 0 0 0) [Zip Plus4]:0 [State]:* [State Full Name]:0 [State Median]:0 [Nation Median]:0 [County Name]:0 [County Median]:0 [City]:0 [City Median]:0 [Zip Code]:0 [Zip5 Median]:0 [Zip Plus4 Median]:0

Dimension 8 [Dim Lat 4] (-) [Lat Level4]:-

Dimension 9 [Dim Long 4] (-) [Long Level4]:-

Dimension 10 [Dim Lat] (0 0 0 0 0) [Lat Level4]:0 [Lat Level2]:0 [Lat Level1]:0 [Lat Level0]:0 [Lat Level3]:0

Dimension 11 [Dim Long] (0 0 0 0 0) [Long Level4]:0 [Long Level2]:0 [Long Level1]:0 [Long Level0]:0 [Long Level3]:0

转载于:https://www.cnblogs.com/yanbinboy/archive/2010/01/13/1646720.html

weixin_30675247

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
SSAS Partition Slicing

Summary: SSAS uses partitions to contain cube data. Small cubes might use only one, but for non trivial cubes, cube designers will create partitions based on ease of managing data and...
复制链接

扫一扫