mtb6_Cube Data Densification_Domain completion_padding谷歌BigQuery_excel row limit_Chicago Crime_mpp并行

Many data-related questions that Tableau newbies have are not related to data preparation, joins, unions, data blending, or data structures. Some of those questions are as follows:

  • • I just created a table calculation and observed that the view displays numbers that don't exist in the underlying data. Why?
  • • We use SAP Business Warehouse (SAP BW) in my organization. What should I know about how Tableau works with cubes?
  • • How does Tableau work with big data? 

     This chapter will continue the data discussion from the previous chapters by addressing the topics these three preceding questions target, respectively: data densification, working with cubes, and Tableau and big data.

     There are a few people who have worked diligently[ˈdɪlɪdʒəntli]勤勉地 to provide resources that were very helpful while writing this chapter. Joe Mako has championed data densification—be sure to check out his video on Vimeo (https://vimeo.com/107354333) for a deep dive into this challenging topic. The Tableau Online Help (https://help.tableau.com/current/pro/desktop/en-us/cubes.htm) has an
invaluable series of articles and accompanying workbooks for understanding the ins and outs of working with OLAP cubes, and lastly, the Tableau help pages also provide an excellent resource on Tableau and big data: https://help.tableau.com/current/pro/desktop/en-us/cubes.htm.

In this chapter, we will discuss the following topics:

  • • Using the OData connector
  • • Introducing data densification
  • • Domain completion
  • • Domain padding
  • • Working with cubes
  • • Tableau and big data 

     We will start this chapter by introducing the OData connector. It is a great way to practice data visualization because it allows you access to many different data sources. We will use it in the Introducing data densification section to access and manipulate the Chicago Crime dataset.

Using the OData connector

     The exercises in this chapter can be followed along with by downloading the workbook associated with this chapter on https://public.tableau.com/app/profile/marleen.meier or by using the OData connector directly. The following steps explain how datasets from the web can be loaded directly into Tableau by using an OData endpoint:

     Now that we have the data connection all set up, we can continue with our first real exercise about data densification. Knowing about data densification will help you to troubleshoot[ˈtrʌblʃuːt] if you see unexpected data output in your dashboard and set up your visualization the right way to achieve your goals without having to troubleshoot at all.

Introducing data densification

     Data densification is a largely undocumented aspect of Tableau that can be useful in many circumstances but can also be confusing when encountered unexpectedly. This section will provide information about data densification with the intent of dispelling[dɪˈspelɪŋ]消除 confusion[kənˈfjuːʒn]困惑,混乱局面 and providing the Tableau author with sufficient knowledge to use this feature to their advantage.

     To begin understanding data densification, four terms should be defined: data densification, sparse data, domain completion, and domain padding. In addition to the definitions, each term will be discussed in detail by using examples to help improve understanding: 

  • Data densification: A behavior wherein[werˈɪn] 在其中;在何处 Tableau displays marks in the view for which there is no corresponding underlying data
         Data densification is a broad term that indicates that missing values or records are "filled in".
     
  • Sparse data: An intersection of one or more dimensions and one measure for which there is no value.
         Data with missing values (such as data that doesn't have a record for every single date
    or
    only contains records for products that have been ordered, as opposed to all products in inventory) is referred to as sparse data.
     
  • Domain completion: The addition of marks添加标记 on a sparsely populated view that cause all possible dimension/measure combinations to display results
    (
    filling in missing intersections or dimensional values)
     
  • Domain padding: The addition of marks to the visualization through range-aware dimensions (that is, date and bin) even when there is no underlying data associated with those marks
    (filling in missing dates or bin values)

The definitions should be clear now; let's continue with hands-on exercises.

Domain completion

     There are two types of data densification: domain completion and domain padding. Domain completion is the more complex of the two and can be deployed cleverly to solve sparse data issues but may also appear unexpectedly and prove a challenge to address. 

     Grasping domain completion requires a good understanding of dimensions and measuresdiscrete and continuous—and partitioning and addressing within table calculations. The first two sets of terms, dimensions/measures and discrete/continuous, are discussed in https://blog.csdn.net/Linli522362242/article/details/124207205, Getting Up to Speed – A Review of the Basics. The second set of terms, partitioning and addressing, was discussed in detail inhttps://blog.csdn.net/Linli522362242/article/details/124550022, Table Calculations.

     Now, let's consider how domain completion can be deployed, when it's helpful, and when it can be a problem.

Deploying domain completion

     Domain completion can be activated in numerous and sometimes surprising and confusing ways. Adjusting the arrangement of pills on shelves, toggling[ˈtɑːɡl]切换,转换 dimensions between discrete and continuous, switching view types on the Marks card, adjusting partitioning, addressing, and other changes can impact domain completion activation. Although examples for every activation possibility will not be covered in this book, a review of typical domain completion scenarios should prove helpful.

Activating domain completion in a crosstab

The following steps will guide you through an example of domain completion.

  • 1. Navigate to the worksheet entitled DC - Crosstab.
  • 2. Ensure thatAnalysis | Table Layout | Show Empty Rows and Show Empty Columns are both deselected.
  • 3. In the Chicago Crime dataset, create a calculated field named Index with the code
    INDEX()
  • 4. Add a Location Description filter to view only a few locations, for example, all the ones beginning with AIRPORT. Then place Location Description on the Rows shelf.
  • 5. Place Date on the Columns shelf(right click, Change Data Type | Date). Leave it at the year aggregation and change it to a discrete value. Note, as shown in the following screenshot, that the view is sparsely populated被增添数据:     Tableau has enabled the densification because of the discrete dimensions on Rows and Columns
  • 6. Place Index on the Detail shelf on the Marks card. Note that in the following screenshot the view now reflects domain completion(The addition of marks添加标记 on a sparsely populated view that cause all possible dimension/measure combinations to display results), that is, the view is fully populated:Note that data densification is activated:
         But adding a Table Calculation such as Index to the Detail causes Tableau to fill in the missing intersections,
  • 7. Right-click on YEAR(Date) and select Continuous. Note that data densification is deactivated:
  • 8. Reset YEAR(Date) to Discrete and then right-click on Index and select Edit Table Calculation.
  • 9. In the resulting dialog box, select Specific Dimensions and then observe the results for each of the following selections:

    • Location DescriptionActivated domain completion(it is addressed)
           
      Then the remaining dimension Year of Date is partitioned
    • Year of DateActivated domain completion(it is addressed)
           
      Then the remaining dimension Location Description is partitioned
    • Location Description and Year of DateDeactivated domain completion (they are addressed)
    • No selectionDeactivated domain completion(Location Description and Year of Date are partitioned)

The preceding exercise illustrates the following rule for deploying domain completion:

• Given a crosstab with discrete dimensions on the Rows and Columns shelves, utilizing a table calculation (in this case, the Index field) in which at least one dimension (but not all dimensions) is addressed activates domain completion 

     A key term in the preceding rule may have been confusing: addressed. Partitioning and addressing were covered inhttps://blog.csdn.net/Linli522362242/article/details/124550022, Table Calculations, but will be defined again here to ensure understanding of this rule. Consider the following, from the Tableau documentation (https://help.tableau.com/current/pro/desktop/en-us/calculations_tablecalculations.htm): 

     The dimensions that define how to group the calculation, that is, that define the scope of the data it is performed on, are called partitioning fields. The table calculation is performed separately within each partition. The remaining dimensions, upon which the table calculation is performed, are called addressing fields(Compute Using | ?dimension), and determine the direction of the calculation. 

     When editing a table calculation, you can choose to select/deselect specific dimensions.

  • When a dimension is selected(Compute Using | ?dimension), that dimension is used to address the table calculation.
  • When a dimension is not selected, the dimension is used to partition the table calculation.
  • The following screenshot of a Table Calculation editing dialog box demonstrates addressing on Location Description and partitioning on Year of Date:

We will now look at the remaining steps involved in activating domain completion in a crosstab: 

  • 1. Duplicate the worksheet from the previous exercise, DC – Crosstab. Name the new worksheet DC – Crosstab II.
    Duplicate creates a new version of a worksheet, dashboard, or story you can modify without effecting the original sheet.
  • 2. Right-click on Index and select Compute Using | Cell. Note that the view is sparsely populated.
  • 3. Select Analysis | Table Layout | Show Empty Columns. The view is now fully populated:

This exercise illustrates the following rules for deploying domain completion:

  • • Given a crosstab with discrete dimensions on the Rows and Columns shelves, selecting Compute Using | Cell deactivates domain completion
  • • Given a crosstab with discrete dimensions on the Rows and Columns shelves, selecting Analysis | Table Layout | Show Empty Rows/Columns activates domain completion

     Setting Compute Using | Cell may raise a question: what about the other Compute Using options, such as Table (across) and Table (down)? These options are actually all variations of partitioning and addressing

Activating domain completion through view types

We will now look into activating domain completion through view types.

  • 1. Duplicate the worksheet from the previous exercise, DC – Crosstab II. Name the new worksheet DC – View Types.
  • 2. Remove Index from the Marks view card and deselect Analysis | Table Layout | Show Empty Columns. The view is now sparsely populated.
    The result is a view without data densification showing only 165 marks.
  • 3. Change the Marks type from Automatic to Line. The view is now fully populated with more marks:
    The result is a view with data densification(activates domain completion) showing 260 marks.
  • 4. Choose each view type option on the Marks card and observe which view types activate domain completion and which do not.

The preceding exercise illustrates the following rule for deploying data completion:

  • • Given a view with discrete dimensions on the Rows and Columns shelves, selecting the Line, Area, and Polygon view types from the Marks view card activates domain completion 

The usefulness of domain completion 

     Domain completion can be useful in many circumstances. In fact, you may have gleaned some uses from the previous exercises even though they were designed merely for illustration purposes. The following exercise demonstrates using domain completion to display no data for cells without a value in a sparsely populated view

Labeling nulls

Let us look at the following steps to begin the exercise:

  • 1. Duplicate the worksheet from the previous exercise, DC – View Types. Name the new worksheet DC – Labelling Nulls.
  • 2. Adjust the duplicated worksheet so that the view type is set to Text. Also ensure that only Location Description and Year are deployed on the view. Be sure to leave Location Description on the Filters shelf so that a few locations are displayed. 
  • 3. Create a calculated field named No Data with the following code:
    IF  ISNULL( COUNT([Case Number]) )
        THEN 'No Data'
    ELSE
        'Data'
    END
  • 4. Place Ward[wɔːrd]<美>牢房 and No Data on the Text shelf. Note that the text 'No Data' does not display:
  • 5. Place Index on the Detail shelf. Note that the text 'No Data' does display. The domain completion portion of the exercise is now complete but consider making the visualization more appealing by utilizing a shape:
    • Change Mark type to Shape
    • Click 'T' icon on the right of  and Select Shape
    • Click Shape on Mark Card

Adjustment:

  • Make sure the entire Location Description field is displayed
  • Resize shape==>
  • center the shape

     I hope this exercise has given you a practical example of how domain completion works. But what if you want to get rid of it altogether? We'll discuss this in the next section. 

Removing unwanted domain completion 

      After being made aware of domain completion, a user will no longer be confused when unwanted marks display in the view. But knowing how to remove those unwanted marks from the display can be a challenge. The following exercise shows a scenario of undesired domain completion and how to address the problem. The goal of the exercise is to display each year that a given crime has been reported, with an accompanying filter to adjust the crimes that display based on the number of years data is available

  • 1. In the workbook associated with this chapter, navigate to the DC - Year Count worksheet
  • 2. Select the Chicago Crime dataset in the Data pane.
  • 3. Optionally, in order to get the exact same view as in Figure 6.13, place Description on the Filter shelf and select AGG PO HANDS NO/MIN INJURY.
  • 4. Place Description and Date on the Rows shelf, make YEAR(Date) discrete, and put Ward on the Text shelf. Format as desired. Note the missing value for 2001 as shown in the following screenshot:This is not an instance of data densification这不是数据密集化的一个实例 since the dataset actually has a row for 2001, despite the value for Ward being empty in that year. Therefore 2001 has a null value.
  • 5. In order to remove the fields containing null values, click on the drop-down menu associated with SUM(Ward) and select Dimension. In the resulting dialog box, select Special | Non-null values:
    ==>
  • 6. Create a Calculated Field named Year Count with the following code:
    SIZE()
  • 7. If you put the Description pill on the filters shelf in step 3, please remove it for the following steps.
  • 8. Place a continuous instance of Year Count on the Columns shelf(then convert to Discrete). Note that the resulting number, 6411, represents every column in the view:
    all dimensions were default addressed(Deactivated domain completion)
  • 9. Right-click on Year Count and select Compute Using | Date(it is addressed). It appears as if the scenario
    (the dataset actually has a row for 2001, despite the value for Ward being empty in that year. Therefore 2001 has a null value)
    discussed in step 4 has returned. However, although the issue looks the same in the view, the underlying problem differs. Year Count is a table calculation and has caused domain completion
    (
    with discrete dimensions on the Rows and Columns shelves, utilizing a table calculation  in which at least one dimension (but not all dimensions) is addressed activates domain completion
    )
    . So this time we see one row for each year in the whole dataset, even though, for example, the combination of $300 AND UNDER and 2001 does not exist in the dataset. We see every year because of the table calculation Year Count:
  • 10. Right-click on Year Count and select Edit Table Calculation.
  • 11. In the resulting dialog box, select Specific Dimensions. Make sure that Description and Year of Date are both checked and, by dragging and dropping, placed in the order shown in Figure 6.17.
    Leave At the level at Deepest(Year of Date) and
    set Restarting every to Description.
    Sort order can remain as Specific Dimensions and if you check the Show calculation assistance box, you will see yellow highlighters in the viz, indicating the data points related to the selection:
    all dimensions were addressed(Deactivated domain completion)
  • 12. If you have more than one value for Year Count, complete the exercise by moving Year Count from the Columns shelf to the Filters shelf. Format as desired (this allows you to filter on dimensions with the same number of rows in the partition: here is 4):display each year that a given crime has been reported, with an accompanying filter to adjust the crimes that display based on the number of years data is available(Year Count)

     The formatting of the "This exercise demonstrates..." section is inconsistent throughout. Please make consistent. 

     The relevant section in the preceding rule for this exercise is "a table calculation in which at least one dimension is addressed (but not all dimensions) activates domain completion." The domain completion occurred when first deploying Year Count, which is a table calculation.

     Upon changing the addressing and partitioning of Year Count so that all dimensions were addressed (that is, no dimensions were partitioned), the issue was resolved

     Congratulations, you can cross domain completion off your "to learn" list. As mentioned in the introduction to this chapter, next in line is domain padding. You will want to know how to use domain padding because it helps you when working with dates and bins. 

Domain padding域填充

     The second type of data densification is known as domain padding. It is fairly straightforward and, unlike domain completion, is discussed in the Tableau documentation. We will now consider how domain padding is deployed and when it's useful. 

Deploying domain padding

     You may recall that one of the ways to deploy domain completion is Analysis menu | Table Layout | Show Empty Rows/Columns(turn off domain padding/domain completion). The same is true of domain padding, as is illustrated in the following exercise. This exercise demonstrates how to toggle[ˈtɑːɡl]切换 domain padding on and off: 

  • 1. In the workbook associated with this chapter, select Data | New Data Source and connect to the Sample Superstore Excel workbook that ships with Tableau. It is located in My Tableau Repository | Datasources+(\OneDrive\Documents\My Tableau Repository\Datasources\2022.1\en_US-US).OR
  • 2. In the resulting instance of the Data Source page, double-click on Orders table and Returns table. This will cause a relationship(https://blog.csdn.net/Linli522362242/article/details/124335986) to be created on the field Order ID.
  • 3. Name the data source Superstore – Returns.
  • 4. Navigate to the worksheet entitled DP – Show Missing Values and select the Superstore – Returns data source that was just created. 
  • 5. Place Ship Mode and State on the Rows shelf. Next, place Region on the Filters shelf and select West. Lastly, put SUM(Sales) on the Text shelf. Note that, for example, Wyoming only appears for Standard Class shipping mode:(just hides nulls)
  • 6. Select Analysis menu| Table Layout | Show Empty Rows(turn off domain padding):Empty values are caused by filters

     And you can see domain padding in action. All states are showing now, even the ones that were filtered by Region: West. Now, the filter simply removes the sales amount but does show all possible states

     It was necessary to point to a live data source, such as the instance of Superstore that ships with Tableau, because using an extract would not, in this case, activate domain padding even if Analysis menu | Table Layout | Show Empty Rows was selected. The functionality gets lost as soon as you activate an extract(In fact, I didn't encounter this situation)

The usefulness of domain padding

     Domain padding is often useful when working with dates with gaps. Such gaps occur when some dates have associated values and some dates do not. As shown in the following example, returns do not occur every day in the Superstore dataset. Since a visualization that displays dates with gaps空白 could be confusing, it might be helpful to fill in those gaps. 

Using domain padding to fill in date gaps 

We will now try to fill in date gaps: 

  • 1. In the workbook associated with this chapter, navigate to the worksheet entitled DP – Filling Date Gaps.
  • 2. Select the Superstore – Returns data source that was created in the previous exercise.
  • 3. Place a discrete instance of MDY(Order Date) on the Columns shelfCustom==>==> and place Sales on the Rows shelf. Note that every mark in the view is equally spaced regardless of the length of time between dates请注意,无论日期之间的时间长度如何,视图中的每个标记都是等距的:
  • 4. Right-click on MDY(Order Date) and select Show Missing Values(turn on domain padding).
  • 5. Right-click on SUM(Sales) on the Rows shelf and select Format. In the resulting format window, choose the Pane tab and select Marks: Show at Default Value(padding):      Note that the distance between marks is now based on the length of time between dates. All dates with no value are displayed with Sales values of zero.

Problems of domain padding 

     Since domain padding can be toggled on or off through specific commands in Tableau (that is, Show Missing Values and Show Empty Rows/Columns), it's typically not a problem. There are a few scenarios, however, when domain padding may cause confusion, one of which is covered in the following example. 

From a domain-padded visualization to a crosstab

Let's have a look at creating a crosstab from a domain-padded visualization:

  • 1. In the workbook associated with this chapter, navigate to the worksheet entitled DP – From Viz to Crosstab.
  • 2. Select the Superstore – Returns data source.
  • 3. Right-click on Discount in the data pane and select Create | Bins.... In the resulting dialog box, choose a bin size of 0.05.
  • 4. Place the newly created Discount (bin) dimension on the Columns shelf.
  • 5. Right-click on Discount (bin) and ensure that Show Missing Values is selected. (turn on domain padding)
  • 6. Right-click and drag the Discount field from the Data pane to the Rows shelf. Select CNT(Count) as the measure. Note that some of the bins have no values. For example, as shown in the following screenshot, the 0.35 bin has no associated value:
  • 7. Duplicate the sheet as a crosstab by right-clicking on the worksheet tab and selecting Duplicate as Crosstab.

         A crosstab (sometimes referred to as a Pivot Table) is a table that summarizes data in rows and columns of text. It is a convenient way to display the numbers associated with the data view.

    Duplicate as Crosstab : command inserts a new worksheet or worksheets into your workbook and populates the sheet with a cross-tab view of the data from the original worksheet.

    Note that ( Discount (bin) Show Missing Values is still activated:
  • 8. Complete the exercise by right-clicking on Discount (bin) and deselecting Show Missing Values:

     Utilizing Show Missing Values for bins or dates is often helpful in a visualization but may not be helpful in a crosstab view. This is especially true if there are many rows or columns without values

     A special case of data densification数据密集化 will appear with cubes. That's why we will discuss this next. Even if you don't work with cubes now, I would recommend you follow along now to have a full picture of data densification in Tableau. 

Working with cubes 

     For the most part, Tableau's behavior is uniform across data sources. For example, the experience of working with an Access database and an Oracle database is very similar. Of course, different data sources will have their nuances but, in general, Tableau attempts to make working with different data sources a seamless experience. However, working with cubes (that is, multidimensional data sources) is quite different. Major differences include the inability to alias, create calculated fields on dimensions, create groups or hierarchies, change aggregation types, and generate extracts. 

As of the time of writing, the central cube-related article is located at https://help.tableau.com/current/pro/desktop/en-us/cubes.htm

     The preceding article provides detailed information and is worth studying in detail if you work with cubes. Provided with the article are many links and examples on how to reproduce typical Tableau behavior when working in a cube environment. 

     Although the Tableau Online Help provides many workarounds for these shortcomings, in this section, we want to mention the topic of using data blending to work more effectively with cubes. 

     In order to complete the following cube-related exercises, you must have access to Microsoft Analysis Services with an instance of the AdventureWorks cube (AdventureWorks is the sample cube that ships with Microsoft Analysis Services). Detailed instructions for installing SQL Server, Analysis Services, and the accompanying AdventureWorks cube are available at MSDN ( https://docs.microsoft.com/en-us/sql/samples/adventureworks-install-configure?view=sql-server-ver15&tabs=ssms). A search engine query on hh403424 will also return the link to these instructions. 

Data blending for continuous months

     Typically, a cube includes one or more date hierarchies. When Tableau is connected to a cube, the members of a date hierarchy that display in the Data pane behave like strings. Thus, Tableau's built-in hierarchy capabilities that are usually available when working with dates in relational data sources are not available. This limitation can be partially overcome by coding Multidimensional Expressions (MDX) queries in Tableau, but to achieve all the Tableau date capabilities, a data blend is necessary.  

We will now look into data blending and cubes: 

  • 1. In the workbook associated with this chapter, navigate to the worksheet entitled Cube – Continuous Days.
  • 2. Connect to the AdventureWorks data source through Data | New Data Source | Microsoft SQL Server (refer to the introduction to this section, Working with cubes, to learn how to install Analysis Services and the AdventureWorks cubehttps://docs.microsoft.com/en-us/sql/samples/adventureworks-install-configure?view=sql-server-ver15&tabs=ssms andhttps://docs.microsoft.com/en-us/analysis-services/multidimensional-tutorial/lesson-1-defining-a-data-source-view-within-an-analysis-services-project?view=asallproducts-allversions).
  • 3. In the dimension portion of the Data pane, expand Date and locate the Date field.
  • 4. Note that different versions of the AdventureWorks cube have slightly different naming conventions. So, don't worry too much about the naming convention in the figures.
  • 5. Right-click on Date and select Change Data Type | Date. Drag Date to the Columns shelf, and place Internet Sales Amount on the Rows shelf:
         The current viz looks a bit overwhelming. We have lots of data points and can't really see the difference between days. Let's continue by aggregating the data to a single value per month.
  • 6. Select the worksheet entitled Cube – Preparing Dates.
  • 7. Expand Date and then Calendar, and lastly Date.Calendar:
  • 8. Drag Calendar Year and Month to the Rows shelf. Also, drag Internet Sales Amount to the Text shelf. Note that including Internet Sales Amount forces every month to display.
  • 9. Select Worksheet Menu| Export | Crosstab to Excel.
  • 10. Adjust the Excel spreadsheet to look like the following screenshot. Be sure to replace the Internet Sales Amount column with a column named Blended Month, which should include first-of-the-month date information for each row, for example, 06/01/2011. Also, copy and paste Calendar Year so that every cell is populated:
  • 11. Save the Excel spreadsheet as AWDates.xlsx
    Besides, we can create AWDates.xlsx by ourselves

    How to extract text before/after space or comma only in Excel?
    Calendar Year ==> C:
    #         num_chars : number of Characters, we only want 1st char
    =LEFT($A$1,1)

    find the space position in Calendar Year​​​​​​​: 9
    #    char,     start index
    =FIND(" ",$A$1,1)
    Calendar Year ==> 
    extract the char at specified positionCalendar Year​​​​​​​ ==> index=10 : Y
    # ( ,start index,number of chars) 
    =MID($A$1, 9+1,1)
    =MID($A$1,FIND(" ", $A$1)+1,1)
     
    extract year from mm/d/YYYY
    =YEAR(C2)

     String concatenation

    =CONCATENATE( LEFT($A$1,1), MID($A$1,FIND(" ", $A$1)+1,1), " ", YEAR(C2) )

    How to Convert Month Number to Month Name in Excel - Spreadsheet Planet
    Convert Month number to Month Name

    #          format 
    =TEXT( C2, "mmmm YYYY" )

  • 12. In Tableau, select the worksheet entitled Cube – Continuous Months.
  • 13. Select the AdventureWorks ( AdventureWorksMultiDimensional ) dataset and drag Internet Sales Amount to the Rows shelf.
  • 14. Add AWDates.xlsx as a new dataset.
  • 15. Select Data | Edit Data Relationships and set the relationships so that Date.
    Date.Calendar.Calendar Year
    blends to Calendar Year and Date.Date.
    Calendar.Month
    blends to Month:

    vs
    Note: this blending operation cannot be done on the tableu desktop. So we need to use Microsoft Analysis Services
  • 16. In the Data pane, select the AWDates data source and blend on Calendar Year and Month by clicking the gray broken chainlink icons.
  • 17. Right-click and drag Blended Month to the Columns shelf and select Month continuous:

     In steps 1-5, we demonstrated a simple way to create a view with a continuous date when connected to a cube. No MDX code was necessary. However, these same steps also reveal the shortcomings of working with a cube, including that many of the choices normally available for dates in Tableau are not accessible. This is a problem that cannot be overcome by restructuring the cube or writing MDX code.

     The remaining steps in the exercise demonstrate how this problem can be overcome through data blending. Although the data blend created in the exercise only provides month-level granularity, it would not be difficult to include day-level granularity. The end result is a demonstration that the date capabilities normally expected in Tableau can be made available when connected to a cube. Furthermore, if the blended data source remains small, the impact on performance should be negligible. 

     The last missing piece for working with cubes is the data blending for hierarchies, aliasing, and grouping. Follow along with the steps in the next section to understand how Tableau handles those for you. 

Data blending for hierarchies, aliasing, and grouping 

     MDX can be used to provide some of the functionality normally available in Tableau that is otherwise missing when connected to a cube. For example, although you cannot create groups in Tableau when accessing a cube data source, MDX can be passed to the cube by using a calculated member to create groups. Instructions for how to do this are provided here: https://help.tableau.com/current/pro/desktop/en-us/cubes.htm. Similarly, it is possible to use MDX to create hierarchies or to alias dimension members, but that requires knowledge of MDX that most Tableau authors do not possess. 

     This exercise will demonstrate how to use data blending to accomplish hierarchies, aliasing, and grouping when accessing a cube: 

  • 1. Create an Excel spreadsheet named Aliasing, with the following data:[ˈhemɪsfɪr]半球
    GroupHemispheres
    EuropeEastern Hemisphere
    North AmericaWestern Hemisphere
    PacificEastern Hemisphere
  • 2. In the workbook associated with this chapter, navigate to the Cube –Hierarchy worksheet.
  • 3. If you have not already added the AdventureWorks dataset, connect through Data | New Data Source | Microsoft Analysis Services (see the introduction to this section, Working with cubes, to learn how to install Analysis Services and the AdventureWorks cube).Install SQL Server Analysis Services | Microsoft Docs(local)​​​​​​​​​​​​​​https://www.youtube.com/watch?v=90DVrlzl8bs
  • 4. Select the AdventureWorks dataset and place Sales Amount on the Text shelf.
  • 5. Add the newly created Excel spreadsheet, Aliasing, as a data source.
  • 6. In the Aliasing source, blend on Group by clicking the gray broken chainlink icon next to the field Group:
  • 7. Create a hierarchy on Group and Hemispheres by selecting both in the Data pane, right-clicking, and selecting Hierarchy | Create Hierarchy. Name the hierarchy as desiredHierarchy
  • 8. Make Hemispheres the first dimension in the hierarchy by dragging it above
    Group:
    ​​​​​​​
  • 9. Place Hemispheres and Group on the Rows shelf:
  • 10. Create a worksheet called Cube – Aliasing and Grouping.
  • 11. Select the AdventureWorks dataset and place Internet Sales Amount on the Text shelf.
  • 12. Select the Aliasing dataset, right-click on Group, and select Duplicate. Name the duplicate field Regional Groups.
  • 13. Right-click on Regional Groups and select Aliases.... Adjust the aliasing so that Europe is aliased as EMEA, Pacific is aliased as AsiaPac, and North
    America is left unchanged.
  • 14. Blend on Group by clicking the gray broken chainlink icon.
    VS

    must append Group(chainlink) to the Rows
    VS
    relationship

     Typically, dimensions are not enabled for secondary data sources in a data blend; however, in the preceding case, since the secondary data source is blended on Group and there is a one-to-many relationship between Hemispheres and Group, both of these fields can be used as dimensions in the view. Nonetheless, for unrestricted use of dimensions, always use the primary data source. 

     One point of interest in the preceding exercise is the necessity of creating Regional Groups in the secondary data source by duplicating Group. This was required to successfully alias the dimension members. Attempting to alias members directly within Group breaks the data blend, as shown in the following screenshot: 

     This section has shown how we can leverage cube data sources for hierarchies, using fields of different data sources within the cube. We needed to duplicate the Group field to be able to alias it but other than that we were able to prove that aliasing and grouping are possible with cubes. An example could be that you have products stored in one dimension of the cube database and countries in the other. Tableau will allow you to combine them with grouping.

     Are cubes here to stay? Maybe not. The advent of big data solutions and the continued evolution of the Relational Database Management System (RDBMS) may make cubes obsolete. Cubes pre-aggregate data, which can make certain analytic and reporting needs very quick; however, as RDBMS and big data solutions become faster, easier to query, and more sophisticated, the pre-aggregation paradigm may become obsolete. Some argue that cubes are already obsolete. Also, cubes have serious drawbacks, including a lack of performance advantage for non-aggregated data (which is typical at lower levels of granularity), challenges with more complex analytic needs, difficulty mastering the MDX language, and the fact that many BI vendors do not cater to cubes. This last point is true of Tableau. Although Tableau can be used with cubes, there are challenges to overcome. In contrast, the topic we will discuss next, working with big data, is very straightforward.

Tableau and big data 

     Perhaps the first challenge of big data is defining it adequately. It's a term so widely used as to be almost meaningless. For example, some may refer to data exceeding 1,048,576 rows as big data (which is the row limit in Excel 2010 and 2013) while others would only apply the term to datasets in the multiple petabyte[ˈpetəbaɪt]千万亿字节 range. Definitions found on Wikipedia (https://en.wikipedia.org/wiki/Big_data) and Webopedia (What is Big Data? | Webopedia) are so broad as to encompass both of these examples. True, it is probably simplistic to consider data that merely exceeds Excel's row limitation as big data; nevertheless, from the perspective of an individual for whom Excel is the traditional data-processing application, the preceding definitions fit. 

     Rather than try to provide an adequately narrow definition of what is essentially a buzzword[ˈbʌzwɜːrd]流行行话, this section will primarily focus on one aspect of big data: massively parallel processing. However, before we begin, let's consider a couple of housekeeping[ˈhaʊskiːpɪŋ]家务管理;家用开支 items. First, when I have been asked about Tableau and big data, the intent of the question has invariably[ɪnˈveriəbli]始终如一地 been about Tableau's performance when working with large datasets. Since Chapter 13, Improving Performance, is dedicated to Tableau's performance, this section will not address performance thoroughly. Second, for the user who works predominately[pri'dɔmineitili]主要地 or exclusively with Excel, exceeding the row limit is a real problem for which a solution may prove helpful. We'll cover this next.

Addressing解决 Excel's row limitation

     As mentioned in the section Tableau and big data, exceeding Excel's row limitation should not be considered big data. Nevertheless, that limitation can be an issue, and telling a Tableau author to use a database is often not helpful. A Tableau author may indeed want to utilize a database but may not have direct access. For example, the data that resides in the database may be exported through a web application and then imported into an Excel workbook that Tableau utilizes as a data source. If this process is performed weekly or monthly and each import involves 10,000+ rows, it won't be long before the Excel row limit is hit. In the next exercise, we will see one strategy for addressing this issue. 

     Note that this exercise is Windows-centric, but similar tasks can be implemented in a macOS environment. Follow the steps to learn more about how to work around Excel's row limitation

  • 1. In the workbook associated with this chapter, navigate to the dashboard entitled Excel Row Limit. That dashboard includes a link (https://github.com/PacktPublishing/Mastering-Tableau-2021) to GitHub where you can find the files necessary to complete this exercise. 
  • 2. Download the files, and open Spreadsheet_2 in Excel or a text editor and remove the header, that is, the first row. Save the spreadsheet as a .csv file.
    https://media.githubusercontent.com/media/PacktPublishing/Mastering-Tableau-2019.1-Second-Edition/master/Chapter05/Spreadsheet_2.csv==>
  • 3. Place copies of the two files, Spreadsheet_1https://media.githubusercontent.com/media/PacktPublishing/Mastering-Tableau-2019.1-Second-Edition/master/Chapter05/Spreadsheet_1.csvand Spreadsheet_2 , in a new directory.
    I:\Tableau\1Mastering-Tableau-2021-main\New DIrectory
  • 4. Open Command Prompt. This can be quickly done by pressing Windows + R and then entering cmd.
  • 5. In Command Prompt, type the following, then press Enter:
    # cd [filepath to the newly created directory]
    cd I:\Tableau\1Mastering-Tableau-2021-main\New DIrectory
  • 6. In Command Prompt, enter the following, then press Enter:
    for %f in (*.csv) do type "%f" >> output.csv

    In the current directory(..\New DIrectory), a simple FOR loop that loops through all the files end with .csv and outputs their contents to a file called output.csv.https://www.online-tech-tips.com/free-software-downloads/combine-text-files/

    Note that the resulting .csv file (that is, output.csv ) will not open successfully in Excel but, as shown in the following steps, it can be used as a data source in Tableau. 

  • 7. In Tableau, press Ctrl + D to open a new data source, and in the resulting window, select Text file:
    ==>

  • 8. Connect to the output.csv data source. 
  • 9. Place Number of Records on the Text shelf and observe that the total equals the number of rows (-1(header)=849,616in Spreadsheet_1 plus the number of rows in Spreadsheet_2 :
  • This exercise could also be replaced by Tableau's built-in union function, which appends the same structured tables automatically. This has been described in detail in https://blog.csdn.net/Linli522362242/article/details/124335986, All About Data – Joins, Blends, and Data Structures.
    in step 2 we removed the header in Spreadsheet_2, so Better to add it back

     The scheduling of wildcard unions(*) through the command line is available in Tableau Prep. You can find documentation about it here: https://help.tableau.com/current/prep/en-us/prep_save_share.htm#commandline

     Talking about big data goes hand in hand with parallel processing. To make it easier for you to understand big data platforms and hence how you can build high-performing dashboards, we will take a quick turn into the world of massively parallel processing. 

Massively parallel processing大规模并行处理 

     Big data may be semi-structured or unstructured. The massively[ˈmæsɪvli]大量地 parallel processing (MPP) architecture structures big data to enable easy querying for reporting and analytic purposes. MPP systems are sometimes referred to as shared-nothing systems无共享系统. This means that data is partitioned across many servers (otherwise known as nodes) and each server processes queries locally.

Let's explore MPP in detail using the following diagram as a point of reference:

Please see the following explanation of the diagram:

  • 1. The process begins with the Client issuing a query that is then passed to the Master Node.
  • 2. The Master Node contains information, such as the data dictionary and session information, that it uses to generate an execution plan designed to retrieve the needed information from each underlying Node.
  • 3. Parallel Execution represents the implementation of the execution plan generated by the Master Node.
  • 4. Each underlying Node executes the unique query it has received and then passes the results to the Master Node. Each Node is actually a standalone server with its own RAM, disk space, and operating system.
  • 5. The Master Node assembles the results, which are then passed to the Client.

  • On the plus side, MPP systems are easily scalable. Once the initial architecture is set up, adding additional hard drive space and processing power can be as easy as adding additional servers.
  • On the downside, MPP systems can be costly to implement—requiring thousands or even tens of thousands of servers, along with associated failover[ˈfeɪləʊvə]故障转移 requirements and highly skilled, costly labor for support.

     For Tableau and working with big data, it is very important to know the partitions that your IT team has put in place for parallel processing. It could be, for example, the year, month, and day column. If you use those columns that are also used as partitions in the database, your Tableau dashboard will perform so much better than trying to query on a random field in the dataset. So,

  • first put filters in placepreferably partitions.
  • Then build your dashboard on a subset of the data.
  • Once you are all done, load the data you need and check the performance again.
  • More information regarding performance is given in Chapter 13, Improving Performance.

     As a result of the expense, many organizations opt for a cloud-based solution, such as Amazon Redshift or Google BigQuery. Amazon Redshift uses an MPP system, while Google BigQuery uses tree architecture, which is a little different but still takes advantage of MPP techniques.

Building a visualization with Google BigQuery

     One big data solution is BigQuery, next to many others. For this section of the book, we have chosen BigQuery just because readers can test it for free as long as you have a Gmail account. Secondly, you don't have to install anything, which makes it a good platform for an exercise in this book.

     In order to build a visualization with Google BigQuery, you will need to first set up access to BigQuery. The following exercise will point you in the right direction. Once you have set up access to BigQuery, you will be able to connect to the BigQuery sample datasets. In the remainder of the exercise, you will build a visualization while connected to BigQuery.

     Assuming you have a good internet connection, the performance will likely exceed what you experience when working with a local copy of an extracted data source of a similar size.

Let's have a look at how we can use Google BigQuery in our Tableau dashboard:

  • 1. Log into your Google account, navigate to https://cloud.google.com/bigquery/ , and follow the provided instructions to try BigQuery for free.
  • 2. In the workbook associated with this chapter, navigate to the BigQuery worksheet.
  • 3. Press Ctrl + D to connect to a data source. In the resulting window, select Google BigQuery and, when prompted, provide your Gmail login information.
  • 4. In the Data Source page, choose the publicdata project, the samples dataset, and the natality[neɪˈtæləti]出生率 table.
  • 5. The natality table provides birth demographics[ˌdeməˈɡræfɪks]人口统计的 for the United States from 1969 to 2008:
  • 6. In the BigQuery worksheet Data pane, double-click natality(Count).
  • 7. From the Data pane, drag Month to the Columns shelf, and set it to Discrete.
  • 8. Right-click on the y axis, natality(Count), and select Edit Axis. Deselect Include Zero.
  • 9. Format as desired. This visualization displays the number of infants born in each month from 1969 to 2008:

     Knowing the fields on which the big data engine is partitioned will help a lot when facing performance issues. If, for example, your data is partitioned by Year, always try to make use of this field in filters and calculations. Chapter 13, Improving Performance, contains more information on this topic. 

     As you can see, BigQuery allows us to visualize datasets containing millions of records quickly and easily. 

Summary 

     We began this chapter with a discussion of data densification and discovered that there are two types of data densification: domain completion and domain padding. When reviewing these two types of data densification, we learned how each can be deployed, when each is useful, and when each can be problematic. Next, we learned how to work with cubes. We discussed the Tableau Online Help articles as well as how to use data blending with cubes. Finally, we explored big data. We also surveyed MPP, and walked through an example of how to use Tableau to connect to Google BigQuery.

     In the next chapter, we will explore another functionality: level of detail calculations. Although table calculations remain an important part of everyday Tableau usage, we will discover how level of detail calculations can easily accomplish tasks that were previously only possible through complex table calculations.

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值