Thursday, August 20, 2009

Banking Data Warehousing

Large U.S. retail banks are building data warehouses and centralizing disconnected data marts. But warehouse-supported marts are proliferating, promising continued challenges to achieving a single enterprise customer view.

Despite widely reported disappointments, and some outright failures, of data warehouse initiatives, Gartner research shows that a data-warehouse-based architecture is the architecture of choice for customer information analysis and decision support among large U.S. retail banks. In a survey of U.S. retail banks with deposits of more than $1 billion, more than one-half of those with deposits of more than $4 billion said they use a data warehouse or data warehouse with associated data marts, with the percentage even higher for the very largest banks. In our research, the smaller banks (deposits between $1 billion and $4 billion) were more likely to say they use a series of unlinked data marts or rely on an operational customer information file (CIF) for analysis and decision support

What to install to use Data Mining

SQL Server 2005 comes with many components. This document describes what components are necessary to perform data mining once you have the SQL Server 2005 beta.

Analysis Services

Analysis Services is the only required component to install on the server. If you want to do data mining against existing SQL Server 2000 databases or other data sources (DB2, Oracle, Access, etc), this is the component you need to install.

Reporting Services

Install Reporting Services if you want to be able to create reports that work against your data mining models.

SQL Server Database Services

You only need to install the SQL Server relational engine if you want to use it as a data source, or if you want to use the Data Mining samples and tutorials.

Data Transformation Services

Installing DTS causes the DTS service to be installed on your server, allowing the running of scheduled packages. Install this if you want to use the integrated data mining tasks and transforms on your server.

Workstation Components, etc.

Install the Workstation Components, etc, on any client machine that will be creating mining models, authoring reports and DTS packages, or managing Analysis Services. The Workstation Components work equally well when installed on the same machine as the server.

Advanced

You will need to click the Advanced button to install the samples and sample databases. All the samples are located under Client Components/Documentation and Samples. The pertinent samples and databases to install are:

Databases:

AdventureWorksDW Sample Data Warehouse (requires you install SQL Server Database Services)

AdventureWorks Sample OLAP

Samples

Analysis Services Samples

Data Transformation Services Samples

Note that installing Samples installs the Samples installation packages. Links to run these are found under Start/Microsoft SQL Server 2005/Install Samples

Data Mining Dimensions

What is a data mining dimension?

A DM dimension is a dimension with a special parent-child hierarchy that's based on relationships discovered in your data by applying data mining, as opposed to a regular dimension where the hierarchies are user-defined. For example, you might discover interesting groups of customers by building a mining model that applies the Microsoft_Clustering algorithm on demographic data in your Customers dimension. A DM dimension based on this mining model can be used to browse your customer sales data and slice it by the customer groups found by the mining model.

How do I create and use a data mining dimension?

When you build a mining model based on an OLAP cube using the Data Mining Wizard in Business Intelligence Development Studio, the last dialog in the wizard allows you to create an associated data mining dimension as well a new cube that links to the measuregroups in the source cube and includes the DM dimension. When you browse the new cube, you can slice the data in the original cube using the new hierarchy discovered by the mining model.

You can also create a data mining dimension (and a cube that uses it) outside of the Data Mining wizard by selecting an existing OLAP mining model in the mining model editor and picking "Create a Data Mining Dimension" from either the Mining Model menu or the context (right-click) menu.

How does it work?

A data mining dimension is processed with a data source view that points to a DMX query which fetches data from an OLAP-specific view of the source mining model's content. You can run this query yourself to see what it returns:

SELECT * FROM .DIMENSION_CONTENT

As part of the data mining dimension processing, a special index is built that maps cases in the mining model's source OLAP dimension to members in the data mining dimension (which represent a hierarchical view of nodes in the mining model content). This index is used when querying fact data using the data mining dimension.

The data mining dimension and its source mining model have to reside on the same Analysis Server database.

Which algorithms support data mining dimensions?

You can build data mining dimensions based on OLAP mining models that use the Microsoft_Decision_Trees, Microsoft_Clustering, Microsoft_Association_Rules or Microsoft_Sequence_Clustering algorithms. In addition, third-party plug-in algorithms may choose to support data mining dimensions.


Predicting future steps when you do not have enough history

No History? No Worries!

Say you’re launching a new product and you want to predict what sales might look like in the next few months. Classical time series prediction does not work in this scenario because you don’t have historical sales data for the product. However, new SQL Server 2008 extensions to the Microsoft_Time_Series algorithm and DMX allow you to easily apply the patterns from another similarly-behaving time series to solve this conundrum.

Predicting Post-IPO Stock Values for VISA

In this example, we illustrate how to use the PredictTimeSeries method with the parameter REPLACE_MODEL_CASES to obtain predictions for a time series for which we do not have enough historic data. The power of this combination comes into play when we have a time series with not enough historic data points to build a realistic model, but we know that this series follows a pattern similar to another time series for which we have enough historic data to build a model.

Here’s an Excel 2007 workbook that has 73 historic data points representing post-IPO daily closing values for the MasterCard stock and just 8 corresponding values for Visa (since we’re doing this 8 days after the Visa IPO). Our goal is to use the MasterCard stock history to derive better predictions for the Visa stock.

We will use the SQL Server 2008 Data Mining Client Add-in for Excel to build and query the time series model:

1. Make sure you have Excel 2007 with the DM Client add-in installed.

2. Save the workbook with the MasterCard/Visa stock data to your local disk and open in Excel 2007.

3. To create a model for the MasterCard stock data on the first worksheet, click on the “Data Mining” tab and select the “Forecast” task.

4. Select “Next” on the first page of the forecast wizard “Getting Started with the Forecast Wizard”. (Note: This page might not appear if you previously selected the option to skip the welcome page.)

5. On the second page “Select Data Source”, select the table we created previously and click on “Next” button.

6. On the “Forecasting” page, select the time stamp column to be the first column, named “TimeStamp”.

7. In the input columns grid, de-select the “TimeStamp” column and select the “MasterCard” column, then click “Next”.

8. On the last page of the wizard, rename the structure “MasterCardStructure” and the model “MasterCardModel”, leave the default selections to browse the model after it is created and to allow drill through, and click “Finish” to end the wizard and proceed to build the model.

The MasterCard model historic data and the first 10 predicted values are illustrated in the following graph:

Now, use the same steps to create a time series model for the Visa stock using the 8 historical data points on the second workbook sheet. You will see right away that the model will not generate meaningful predictions due to the lack of sufficient historic data points. The VisaModel historic data and the next 10 predicted values are illustrated in the following graph:

Better Predictions Using REPLACE_MODEL_CASES

A better approach is to use the knowledge that the Visa and MasterCard stocks have a similar pattern and to use the model built for MasterCard to obtain predictions for the Visa stock values. Here’s how (again using the Data Mining Client Add-in for Excel):

1. Select the “Query” task from the “Data Mining” ribbon and click "Next" on the introductory page.

2. Select the “MasterCardModel” model and click the “Advanced” button.

3. On the “Data Mining Advance Query Editor” page, click on the button “Edit Query”, select Yes on the dialog asking to confirm that “Any changes to the query text will not be preserved when you switch back to the design mode.”

4. Type the following query:

SELECT

(SELECT $Time, [MasterCard] as [Visa] FROM

PredictTimeSeries([MasterCardModel].[MasterCard], 10, REPLACE_MODEL_CASES)) as Predictions

From [MasterCardModel]

NATURAL PREDICTION JOIN

(SELECT 1 AS [TimeStamp], 64.35 as [MasterCard]

UNION

SELECT 2 AS [TimeStamp], 62.76 as [MasterCard]

UNION

SELECT 3 AS [TimeStamp], 64.48 as [MasterCard]

UNION

SELECT 4 AS [TimeStamp], 66.11 as [MasterCard]

UNION

SELECT 5 AS [TimeStamp], 69 as [MasterCard]

UNION

SELECT 6 AS [TimeStamp], 75.1 as [MasterCard]

UNION

SELECT 7 AS [TimeStamp], 82.75 as [MasterCard]

UNION

SELECT 8 AS [TimeStamp], 82.86 as [MasterCard])

as t

5. Click “Finish” and select the results of the query to be copied into a new worksheet.

The results should look like this:

When the REPLACE_MODEL_CASES parameter is used, the PredictTimeSeries method will return the requested number of prediction obtained by replacing the last historic points of the given model with the new values provided in the query. In our case, the last 8 data points for the MasterCardModel are replaced with the values we generate on the fly using the SELECT and UNION options in the input set specified after the “NATURAL PREDICTION JOIN” keywords. Then, the MasterCardModel equations are used to predict the next 10 values for the Visa stock series.

To see the power of this method, we can compare the predictions obtained using the MasterCard model (Predictions.Visa), with the predictions generated by the VisaModel model obtained using only the limit sets of 8 data points of the Visa stock values (Predictions.Visa2). The results are illustrated in the following graph:

So there you go - you have a new tool in your arsenal when you don’t have enough data to make accurate time series predictions. Enjoy!


Wednesday, August 19, 2009

Quick Visualization of irs.gov Search Queries

Here is a quick visualization I did in honor of April 15th to investigate what people looking for on tax day…

This “query tree” shows the most frequent searches starting with the term “irs”. Each branch in the tree represents a query where the words are sized according to frequency of occurrence. I like how you can see at a glance what the most popular tax forms are by following the “irs tax form …” branch. Apparently form 8868, Application for Extension of Time To File, is in high demand.

It was created by uploading search queries from AOL users leading to clicks on irs.gov during Spring 2006 to Concentrate, which generated the query tree. This image is a snapshot of an interactive flash visualization in Concentrate, where the focus term was “irs”. Looking at query patterns like this can help you get an idea of what people are looking for and how to better organize your site so they can find it quickly.

The interactive flash visualization was developed by Chris Gemignani using Flare with some input from Zach Gemignani and myself and inspiration from the Many Eyes WordTree.

The raw data is from the released AOL Search data sample, and consists of the subset of unique queries leading to clicks on irs.gov from March to May 2006. The IRS queries used to make the visualization can be downloaded here: irs.gov.queries.csv (191K)

Here are the top 10 queries in the file:

Query Searches
irs 4787
irs.gov 2282
www.irs.gov 1975
internal revenue service 1154
irs forms 608
tax forms 361
irs tax forms 196
internal revenue 158
taxes 142
wheres my refund 139
federal tax forms 125
irs refunds 106

Research on Data Mining

It’s time for some holidays at DMR. I will be back in the blogosphere in the beginning of August. In the meanwhile, here are some links that may interest you:

Hope to see you soon on Data Mining Research.

Tuesday, August 11, 2009

SAS tutorial - generation of a sample business datawarehouse scenario

This tutorial shows how to use SAS to implement ETL process which generates a star schema datawarehouse architecture.

We assume that you already have basic SAS/BASE knowledge and are familiar with SAS environment, assigning libraries, running SAS programs.

The aim of this tutorial is to generate a datawarehouse structure which would help monitor performance of a sample business scenario described here:
A manufacturing company Data Warehouse Business Scenario - requirements from a palm and tropical plants nursery which implements an analysis and reporting system to track sales, costs, forecasts and business performance management figures.

Tutorial overview:
- First step will be to read dimensions and populate sample dimensions data
- Then a fact table will be created.
- In the next step we will randomly generate transactions for the fact table with sales data for three years. To generate the numbers we will use SAS random number generators with uniform and random distributions.
- The final tasks will be to validate and extract generated data and feed the reporting application.

Please also be aware of the fact that SAS is very powerful and flexible system and the things we show in this tutorial can be done in many different ways. It is just one way to get the expected results.

SAS tutorial chapters


  • 1. Load extracts into SAS - A couple of programs responsible for loading the dimensions extracts into SAS and an example on how to create dynamically additional dimension tables
  • 2. Populate dimensions - Example of how to populate random dimensions in a fact table
  • 3. Generate measures - Random generation of a given set of measures. The measures are generated randomly, however they apply business rules described in a business scenario
  • 4. Sales fact table - In that lesson we create a fact table with sales figures designed in a star schema datawarehouse architecture. Additionally, we perform a statistical analysis of the newly generated data
  • 5. Costs fact table - We randomly generate a fact table with costs. Costs are divided into fixed and variable costs and allocated on year and month detail level
  • 6. SAS ETL Process - Run the whole process in a sequence which may be considered as a representation of ETL Process in SAS. The process could also be set up in SAS ETL Studio or SAS Warehouse Administrator

  • Header and trailer structured textfile processing in SAS - Example from the Data Warehousing Tutorial

  • Data Warehousing ETL tutorial

    The ETL and Data Warehousing tutorial is organized into lessons representing various business intelligence scenarios, each of which describes a typical data warehousing challenge.
    This guide might be considered as an ETL process and Data Warehousing knowledge base with a series of examples illustrating how to manage and implement the ETL process in a data warehouse environment.

    The purpose of this tutorial is to outline and analyze the most widely encountered real life datawarehousing problems and challenges that need to be taken during the design and architecture phases of a successful data warehouse project deployment.

    The DW tutorial shows how to feed data warehouses in organizations operating in a wide range of industries.
    Each provided topic is thoroughly analyzed, discussed and a recommended approach is presented to help understand how to implement the ETL process.

    Going through the sample implementations of the business scenarios is also a good way to compare BI and ETL tools and get to know the different approaches to designing the data integration process. This also gives an idea and helps identify strong and weak points of various ETL and data warehousing applications.

    This tutorial shows how to use the following ETL and datawarehousing tools: Datastage, SAS, Pentaho, Cognos and Teradata.


    Data Warehousing & ETL Tutorial lessons

    Etl Tools Info portal

    ETL-Tools.Info portal provides information about different business intelligence tools and datawarehousing solutions, with a main focus on ETL process and tools. On our pages you will find both general articles with high-level information on various Business Intelligence applications and architectures, as well as technical documents, with a low-level description of the presented solutions and detailed tutorials.
    A great attention is paid to the Datastage ETL tool and we provide a number of Datastage examples, Datastage tutorials, best practices and resolved problems with real-life examples.
    There is also a wide range of information on a rapidly growing Open Source Business Intelligence market (OSBI), with emphasis on applications from the Pentaho BI family, including a Pentaho tutorial.
    We also provide a SAS Guide with tutorial, which illustrates the vision of SAS on Business Intelligence, Data Warehousing and ETL process.
    We have recently added a new ETL case study (ETL course with examples) section which represents a set of business cases, each of which illustrates a typical data warehousing problem. We analyze the cases thoroughly and propose the most efficient and appropriate approach to solving that problems by showing sample ETL process designs and DW architectures.
    Microsoft users may be very interested in exploring our Excel BI crosstabs section with FAQ and sample solutions.

    What is Business Intelligence?

    Business intelligence is a broad set of applications, technologies and knowledge for gathering and analyzing data for the purpose of helping users make better business decisions.
    The main challenge of Business Intelligence is to gather and serve organized information regarding all relevant factors that drive the business and enable end-users to access that knowledge easily and efficiently and in effect maximize the success of an organization.

    Business intelligence produces analysis and provides in depth knowledge about performance indicators such as company's customers, competitors, business counterparts, economic environment and internal operations to help making effective and good quality business decisions.

    From a technical standpoint, the most important areas that Business Intelligence (BI) covers are:

    Data Warehouse, Data Mart, Data Mining, and Decision Support Resources

    Featured Resources

  • First Place Learning : Data Warehouse, Data Mart, Data Mining, and Decision Support
  • Additional Resources

  • KDNuggets : Data Mining and Knowledge Discovery Resource center ****
  • Alacrity, Inc. : Integrated Data Intelligence Software
  • Allen, Davis, and Associates : Data Warehousing Career Newsletter
  • AlphaBlox : Data Analysis Software
  • AltaPlan : OLAP Links
  • Aonix : Object Oriented Modeling Tool and Cleansing Software
  • Attar Software : Data Mining / Neural Nets
  • Bill Inmon : Leading Data Management and Data Warehouse Speaker and Writer
  • Brio Technologies : Brio Web Warehouse and Decision Support Suite
  • Bull : Data Warehousing Solutions
  • Business Intelligence : The OLAP Report -- Richard Creeth
  • Business Objects, Incorporated : WebIntelligence for enterprise decision support
  • Cognos, Incorporated : Data Warehousing Software Tool Suite
  • CDI : Creative Data Inc : Data Warehouse Consulting and Training -- good Data Warehouse Links
  • D2K, Incorporated : "Turning Data into Knowledge"
  • DataFlux : Data Quality and Integration Software
  • DataMirror : Data Integration, Data Protection, Data Audit Solutions
  • datawarehouseconsulting.com : Data Warehouse Consulting
  • datawarehousing.com : Data Warehousing Portal
  • Data Warehousing Institute : Conferences and whitepapers
  • Decision Point Applications : Packaged Data Warehouse Solutions
  • Decision Technology : DecisionCentric® Server
  • Decision Works : Data Warehouse Consulting and Education
  • Decision Works Ltd : TRACK Objects software and consulting
  • Dimensional Insight, Inc. : Reporting and analysis software
  • DM Review : Leading Data Management Industry Publication
  • Don Meyer & Associates : Data Warehousing Consultants
  • DW Soft : Data Warehouse Software and Service using Microsoft Repository
  • Epsilon Data Management : Databased Marketing Services and Training
  • Evolutionary Technologies, Inc. : ETI*EXTRACT(r) Tool Suite for Data Warehousing and Data Migration
  • FileTek : Software for managing massive amounts of atomic data
  • First Logic : Customer Data Management Software
  • Hyperion : ESSBASE - High Speed OLAP Processor
  • IBM : DataGuide
  • Informatica : The Data Mart Company
  • Information Builders : Data management software
    Data Warehousing, Decision Support, Middleware, Data Access, ...
  • Intelligent Solutions, Inc. : Claudia Imhof / Data Warehousing and Data Modeling
  • IRI : CoSORT ETL Software
  • Kalido : Software for adaptive enterprise data warehousing and master data management
  • Kenan Systems Corporation : Market Analysis Software
  • Megaputer Intelligence : Data Mining and Warehousing
  • Micro Strategy : Relational OLAP (ROLAP) Software and Services
  • MiningCo : Data Mining plus excellent data management articles
  • Nautilus Systems, Inc. : Data Warehousing, Data Mining, and Data Visualization software
  • netcarve Technologies GmbH : Data Warehousing and Data Mining Solutions
  • NetScheme Solutions Inc.
  • Open Technologies, Inc. : Data Warehouse Consulting and Recruiting Specialists
  • OSMC : Data Warehousing Consultants
  • Paralogic, Inc. : Data Warehousing and Decision Support Consulting / Lexington, MA
  • Perl.Com : Perl is a scripting language useful for extracting and loading data
  • Pervasive : Data Integration
  • Pilot Software : Customer and Market Data Analysis Software
  • Poinpoint Solutions Inc. : Data Warehouseing solutions for the insurance industry
  • PLATINUM technology, inc. : Data management software and consulting
  • Princeton Softech : Database Management and Data Warehouse Software
  • Query Object Systems Corp : Business Solution Components
  • Ralph Kimball Associates : A pioneer and leader in the Data Warehouse field
  • Redbrick Systems : Multidimensional database software
  • Redbrick Whitepaper : Server Requirements
  • Retek : Data Warehousing for Retail Industry
  • Rocket Software : Business Intelligence
  • Rulequest Research : Data Mining Tools
  • saleslobby.com : Whitepaper - Building the Customer Data Warehouse
  • Salford Systems : CART software for tree-structure, non-parametric data analysis
  • SAS Institute : Data Warehouse and Data Mining Software
  • Seagate Software : Crystal Reports
  • Silvon Software : Supply Chain Data Warehousing
  • SolutionsIQ : Data Warehousing Solutions
  • Speedware Corp :Business Intelligence Software
  • Sybase, Inc. : Data Warehousing Database Software
  • Teleran : Data Warehousing and eCommerce Solutions
  • Teradata / NCR : Database machine
  • Software AG : Data modeling and data warehouse course outlines
  • Thinking Machines Corporation : Data mining software for loyalty management systems
  • Trillium Software : Data Cleansing and Data Reengineering
  • Universal Data Solutions, LLC : Len Silverston - Data Modeling and Data Warehouse - Coauthor of 'The Data Model Resource Book'
  • Thursday, August 6, 2009

    Datasets for Association Rule Mining

    A normal transaction consists of a transaction-id and a list of items in every row or sentence. Sometimes, the items are represented as boolean values 0 if the item is not bought, or 1 if the item is bought. But the commonly used format for Market Basket data is that of numeric values for items without any other information:
    1 3 5 9 11 20 31 45 49
    3 7 11 12 15 20 43...
    This format has to be converted in order to be used by ARMiner and ARtool, since those tools can only evaluate binary data. ARMiner and ARtool have a special converter for that purpose which have to be performed before analyzing the data. WEKA needs a special ASCII-Data format (*.arff) for data analysis containing information about the attributes and a boolean representation of the items. Since there is no unique format for input-data, it is impossible to evaluate the same dataset in one format with different tools. In this paper, we present a dataset generator that is able to generate datasets that are readable by ARMiner, ARtool,WEKA and other data mining tools. Additionally, the generator has the ability to produce large Market Basket datasets with timestamps to simulate transactions in both retail and e-commerce environments.

    Datasets for Market Basket Analysis

    Since Market Basket Analysis is an important tool for improved marketing, many software solutions are available for this purpose. Business tools like Microsoft SQL Server 2005, or IBM DB2 Data Warehouse focus on the analysis of relational business data. Software tools which are available for free download like ARMiner , ARtool , or WEKA , are more dedicated to research. They do not only analyze data but also give additional information on the effectiveness of algorithms performed. So, in order to generate data to be used by those tools, we have to investigate which kinds of datasets can be generated.