Saturday, July 25, 2009

Extracting, Transforming, and Loading Data

After designing the relational data warehouse to support the needs of the business, the next step is to identify the data sources, extract the data, transform data as needed, and then load the data into the relational warehouse. This process is generically called extraction, transformation, and loading, or ETL. There are many ETL tools available and while most can connect to all the common data sources, the real differences lie in their abilities to control the flow of data and provide transformations to the data.

Data transformation is required because data stored in disparate source systems may be stored in different formats, with different codes, or using different data types. For example, one system might store a Yes as 1 and a No as 0, while another system may store the strings 'Y' and 'N.' If data is pulled from two systems and a Yes/No field is needed, then the data must be transformed so that it is consistent in the warehouse. Data consistency is one of the cornerstones of a data warehouse; all data must be in the same format for analysis to be effective.

Other issues arise when considering items such as currencies. In a global business, there are often source systems in each country recording receivables, and normally the business systems store these receivables in the local currency. If a business has branches in London, Paris, Tokyo, and New York, it's likely that data is stored in Pounds, Euros, Yen, and US Dollars. In building the warehouse, sales measured in Yen cannot simply be added to sales in US Dollars in order to get non-European sales. Instead, the data must be transformed by applying some form of currency conversion to the data before storing it in the warehouse. Whether this is done by using a day-to-day exchange rate table, the average exchange rate over a period of time, or any other method, is strictly up to the business. Therefore, business requirements often come into play during the design phase of the ETL process.

While this section on ETL will be relatively brief, there is one point that is absolutely essential: ETL often consumes the majority of the effort in building a data mart or warehouse. Some projects spend up to 80 percent of the total time in the ETL phase. This is due to a number of factors, including the complexity of identifying and extracting back end data, applying complex transformations to data to meet business rules, handling data for slowly changing dimensions, and working with dirty data. Most warehousing projects continue to tweak and maintain their ETL process long after the warehouse is in production; be sure to include ongoing maintenance in any project and budget projections.

No comments:

Post a Comment