The Tableau Data-Handling EngineBy Data Science Online Training, Posted 06/25/19 00:28:06 Category » education
The preceding diagram shows that the DHE consists of a run time module (VizQL) and two layers of abstraction (Metadata and Connection). Let's begin at the bottom of the graphic by considering the first layer of abstraction, Connection.
The most fundamental aspect of the Connection is a path to the data source. The path should include attributes for the database, tables, and views as applicable. The Connection may also include joins, custom SQL, data-source filters, and more. In keeping with Tableau's philosophy of easy to modify and adjustable by business users (see the previous section), each of these aspects of the Connection is easily modifiable. For example, an author may choose to add an additional table to join or modify a data-source filter. Note that the Connection does not contain any of the actual data. Although an author may choose to create a data extract based on data accessed by the Connection, that extract is separate from the connection.
The next layer of abstraction is the metadata. The most fundamental aspect of the Metadata layer is the determination of each field as a measure or dimension. When connecting to relational data, Tableau makes the measure/dimension determination based on heuristics that consider the data itself as well as the data source's data types. Other aspects of the metadata include aliases, data types, defaults, roles, and more. Additionally, the Metadata layer encompasses author- generated fields such as calculations, sets, groups, hierarchies, bins, and so on. Because the Metadata layer is completely separate from the Connection layer, it can be used with other Connection layers; that is, the same metadata definitions can be used with different data sources.
VizQL is generated when a user places a field on a shelf. The VizQL is then translated into Structured Query Language (SQL), Multidimensional Expressions(MDX), or Tableau Query Language (TQL) and passed to the backend data source via a driver. The following two aspects of the VizQL module are of primary importance:
VizQL allows the author to change field attributions on the fly VizQL enables table calculations
Let's consider each of these aspects of VizQL via examples:
Changing field attribution example
An analyst is considering infant mortality rates around the world. Using data from http://data.worldbank.org/, they create the following worksheet by placing AVG(Infant Mortality Rate) and Country on the Columns and Rows shelves, respectively. AVG(Infant Mortality Rate) is, of course, treated as a measure in this case:
Next, they create a second worksheet to analyze the relationship between Infant Mortality Rate and Health Exp/Capita (that is, health expenditure per capita). In order to accomplish this, they define Infant Mortality Rate as a dimension, as shown in the following screenshot:
Studying the SQL generated by VizQL to create the preceding visualization is particularly insightful:
SELECT ['World Indicators$'].[Infant Mortality Rate] AS [Infant Mortality Rate],
AVG(['World Indicators$'].[Health Exp/Capita]) AS [avg:Health Exp/Capita:ok]
FROM [dbo].['World Indicators$'] ['World Indicators$'] GROUP BY ['World Indicators$'].[Infant Mortality Rate]
The Group By clause clearly communicates that Infant Mortality Rate is treated as a dimension. The takeaway is to note that VizQL enabled the analyst to change the field usage from measure to dimension without adjusting the source metadata. This on-the-fly ability enables creative exploration of the data not possible with other tools and avoids lengthy exercises attempting to define
all possible uses for each field.
You can view the code generated by Tableau that is passed to the data source via the Performance Recorder, which is accessible via Help | Settings, and Performance | Start Performance Recording. Improving Performance, for additional details at Tableau online training.
Table calculation example
In the following example, note that Sales on the right axis is set to Quick Table Calculation | Moving Average. Calculating a moving average, running total, or other such comparison calculations can be quite challenging to accomplish in a data source. Not only must a data architect consider what comparison calculations to include in the data source, but they must also determine dimensions for which these calculations are relevant. VizQL greatly simplifies such challenges via table calculations:
The relevant portion of SQL generated by the preceding worksheet shows that the table calculation is not performed by the data source. Instead, it is performed in Tableau by the VizQL module:
SELECT SUM([Orders$].[Sales]) AS [sum:Sales:ok],
DATEADD(month, DATEDIFF(month, 0, [Orders$].[Order Date]), 0) AS [tmn:Order Date:ok]
FROM [dbo].[Orders$] [Orders$]
GROUP BY DATEADD(month, DATEDIFF(month, 0, [Orders$].[Order Date]), 0)
To reiterate, nothing in the preceding call to the data source generates the moving average. Only an aggregated total is returned and Tableau calculates the moving average via VizQL.
This overview of the Tableau Data-Handling Engine (DHE) demonstrates a flexible approach to interfacing with data. Knowledge of the DHE is helpful for understanding the parameters for Tableau data readiness. Two major takeaways from this section are as follows:
It is not necessary to explicitly define data types and roles for optimal Tableau usage
Comparison calculations such as moving averages and running totals can be addressed by table calculations in Tableau and thus do not need to be calculated in advance
Knowledge of these two takeaways can reduce data preparation and data modeling efforts, and thus help streamline the overall data mining lifecycle.
We began this chapter with a discussion of the Tableau Data-Handling Engine (DHE). This illustrated the flexibility Tableau provides in working with data. It is important to understand the DHE in order to ensure that data-mining efforts are intelligently focused. Otherwise, the effort may be wasted on activities not relevant to Tableau training.
Next, we discussed data-mining and knowledge-discovery process models with an emphasis on CRISP-DM. The purpose of this discussion was to get an appropriate bird's-eye view of the scope of the entire data-mining effort. Tableau authors (and certainly end users) can become so focused on the reporting produced in a deployment that they forget or short-change the other phases, particularly Data Preparation.
Our last focus in this chapter was on the phase that can be the most time consuming and labor intensive, namely, Data Preparation. We considered using Tableau for surveying and cleansing data. The data-cleansing capabilities represented by the regular expression functions are particularly intriguing.