Thursday, February 27, 2014

Bridge tables and the purist's connundrum

Bridge tables are sometimes an essential requirement for dimensional modeling. Although bridge tables may not be the most ideal solution, certain situations warrant their usage. Simply put, bridge tables are required when a fact row might correspond to more than one row in the dimension table. There are however many cases where this could occur. Ragged hierarchies, multivalued attributes, many-to-many dimensions are some of the main scenarios where bridge tables are required. Another common pattern across bridge tables is that they are used when there is frequent changes or a possibility of frequent changes to the information contained in the dimension tables. Of these examples we will focus on the scenario of multivalued dimensions.

After choosing the grain of  the fact table, we sometimes discover that a single fact table row could contain multiple rows in the dimension table. In such situations, we "snowflake" the dimension and connect a bridge table to it. Let us take the example where there are multiple suppliers for an item. In order to build the bridge, we require to create a supplier "group" dimension table which contains a separate supplier group primary key and attributes of the group. After this, we link the supplier group bridge table to this newly created supplier group dimension containing all possible supplier key values for the specific supplier group key. The supplier group also links to a supplier dimension containing attributes of the supplier. The bridge creation process is complete. It should be noted that the same creation process can be applied for a many-to-many dimension relation.

Bridge tables relieve much complexity from the dimensional schema, but add some complexity to the query. From a dimensional modelling purists point of view however, this is not the preferred solution. In an ideal situation, all dimension tables must be connected to the fact table, and not other dimension tables. A purist therefore proposes that we create an artificial fact table resulting in a multi-star schema model (Reference 1). In our above example, we will have a fact table called item_supplier_fact containing a row for every supplier-item. Any calculations and derived attributes pertaining to this grouping will be placed in the new artificial fact table.


References:

1) Darmawikarta, D. (2009, December). Solving Many-to-Many Relationship in Dimensional Modeling. Retrieved from PACKT Publishing: http://www.packtpub.com/article/solving-many-to-many-relationship-dimensional-modeling


Thursday, February 13, 2014

Type of fact tables and their applications

We will now look at the different types of fact tables used for building data warehouses . There are three major fact table types that we will discuss today: transaction snapshot, periodic snapshot, and accumulating snapshot fact table. Each of these three kinds of fact tables are used for different purposes. Generally more than one type of fact table is used in conjunction with another to obtain a "holistic business picture".

Transaction fact tables record information at the lowest possible grain level: the transaction. Transaction fact tables are therefore more densely populated than the other types. They also capture the most amount of information pertaining to a particular transaction. A date key is associated with a transaction fact table and it represents the specific date on which the transaction was "processed" or "completed". Given below is an example of a basic e-commerce order fact table.



Periodic snapshot fact tables record information over a period of time. They are one of the two kinds of time-based fact tables that can be implemented. Periodic snapshots could aggregate information over a day, week, month, or even year. Since data is aggregated over a period of time, this fact table is clearly of a higher grain level than the transaction fact table. Periodic snapshot is useful to obtain summary level performance of a business process over a period of time to drive decision making. But to get complete information at the transaction level we will still need to couple this data with a transaction fact table.
Periodic snapshots are also unique because they contain semi-additive facts. Semi-additive facts can be aggregated across some dimensions but not across others; the latter is most often a time dimension. An example of a semi additive fact would be the final state of inventory in a warehouse at the end of the week. It would not make sense to add the inventory levels across different weeks but it would make sense to add this fact across different warehouses at the end of a particular week.




Accumulating snapshot fact tables are the third type of fact tables. They can be described similar to transaction facts tables or periodic snapshot fact tables depending on the way we look at them. They are similar to transaction tables because every row contains information of a single event and are similar to periodic snapshot tables because they record information over the entire lifetime of the event. Accumulating snapshots require multiple updates on the same row for every "milestone"  of the event. As a result, a single row has multiple date keys to capture the milestones for the event. The different stages of a particular customer order with dates corresponding to order entry, shipment processing, order execution and shipment completion could be an example for the e-commerce scenario.
Although accumulating snapshots have multiple date key facts on a single row, they still reference a single date dimension table which acts as a role-playing dimension.

With this blog piece hopefully we should have a better understanding of the different types of fact tables and their possible applications

Thursday, February 6, 2014

Delving deeper into grain, facts, and dimensions

I will focus on grain, facts, and dimensions in this blog, since these are important topics that provides the foundation to the rest our BI learning.

Choosing the right grain level:

As discussed in the earlier blog, the grain measures the business process or event in question. Choosing the grain is critical because it determines the level of analyses that you can do with the data in your data warehouse (DW). As a result, it is unwise to begin the design by selecting a higher level of detail as your grain. Storing the most atomic level possible will allow the application to "roll up" data or summarize based on requests from the end users. For example, in terms of time period of the business process, if we choose the grain as a month, it would not be possible for business users to select custom time frame represented by a particular number of days. Selecting the grain as a day-level transaction data however provides the ability to customize for this purpose. Therefore selecting the right grain is the first critical step in the design of a DW.

Constructing dimensions and facts:

Allocating the right attributes to the facts and dimensions tables is another important aspect of a correct DW design. As discussed in the earlier blog there can be attributes which can be ambiguous in terms of facts and dimensions. Simply put, facts are items which we can quantify. They can be categorized as: additive, non-additive, and semi-additive. The vast majority of facts that we encounter are completely additive. Non-additive facts are facts that cannot be aggregated under any dimension (because they don't make sense through aggregation). Examples of non-additive facts may be numerical attribute, such as unit price, that make sense only from the grain or transaction perspective. Ratios are classified as non-additive facts. There may also be facts that are additive only under certain dimensions and hierarchies of the model. These facts are called semi-additive facts. We also have a fact type called derived facts which are not explicitly present in the source operational database but are computed and stored in the fact table for convenience purpose. Derived facts avoid business users the hassle of having to compute the value through a query during runtime. Through derived facts, the risk of committing calculation errors is also eliminated.

Although we have said that numerical attributes would most probably be categorized as facts, what would we do in case the attribute is numerical but represents a classification? For example: 1- Freshman, 2- Sophomore, 3-Junior, 4-Senior. In this case, the attributes come under the dimension table. Attributes that are used to categorize items or attributes that describe the 'who', 'what', 'when', 'where', 'how, and 'why' can be placed under dimensions (Note: Attributes which are both categorical and quantifiable should be stored in both the fact and dimension tables). Dimensions often comprise of fields that are often used to "filter" the final result set. For dimension tables, the primary key is often not the natural key that is obtained from the operational source database. The source database stores only one record for changes over time. The source natural key may also be poorly administered. As a result, a new dimension surrogate key is created as the primary key of the dimension table.

Hopefully this blog piece further helps in the development of a BI dimensional model.