Saturday, July 25, 2009

Data Mining with C# and ADO.NET - ' Creating a Mining Model '

Data mining is used for prediction analysis by many big companies. The technology lets a retailer use collections of data gathered from customers with similar purchases to predict, for instance, which items a male smoker aged 25-30 will buy at his local store. The information gathered (and later mined) is contained in something called a trained data model, about which you'll learn more in this article.

Ostensibly, you could build all the tools yourself, but you don't need to do so. To help companies fish out data and make predictions on trained data models, Microsoft came up with a special query language for data mining, called DMX. The DMX query language uses a simplified wrapper of the ADO object model to browse and manipulate data that's contained in an SQL Server 2000/2005 Analysis Services database.

Most of the time, you create mining models using the built-in wizards in SQL Server Analysis Services. Your web applications will most often be used to glean prediction results from these trained mining models, and to report it back to the user. In this article, I introduce you to DMX usage with ADO.NET by connecting to an OLE-DB provided for SQL Server Analysis Services.

Note: I assume you have some knowledge of SQL Server, so I only touch on Analysis Services as it pertains to DMX queries and the ADO objects the queries use for data access.

At first glance, this type of database is similar to a relational database with rows and keys, but the schemas for describing the data is far different. I don't need to go into any major details on data types, or the different data mining algorithms you can apply to a model; these subjects could span an entire book. I'll just explain these things as they pertain to the examples shown in this article.

The best way to demonstrate this code is to create, train, and do predictions against a new mining model. You can use this code as a starting point for your own applications.

Creating a Mining Model

The first thing to do is include a reference to the OLE DB library for ADO.NET:

<%@ Import Namespace="System.Data.OleDb%>

Next, set up a connection to MS Analysis Services 2005 (MSAS):

OleDbConnection conn = new OleDbConnection();

conn.ConnectionString =
"Provider=MSOLAP.3; Data Source=localhost; " + "Initial Catalog=MyCatalog";
conn.Open();

To get this code working through a Web page, you need to grant the ASPNET user permissions to the catalog through MSAS.using Management Studio. Also, you cannot use an IP address for the DataSource property of the connection string. MSAS will give you an error, as it requires a machine name.

Databases in Analysis Services are often referred to as catalogs. In this example, we connect to a database named MyCatalog.

Next, we use a DMX Create statement to create a new mining model. The purpose of this model is to predict the percentage of customers who will choose a certain member card, based on some of their demographics. The DMX statement for model creation is similar to SQL's version for creating a table:

String CreateModel = "Create mining model MemberCard_Prediction" +

"(" +
"CustomerID long key," +
"Gender text discrete," +
"Age long continuous," +
"Profession text discrete," +
"Income long continuous," +
"Houseowner text discrete," +
"MemberCard text discrete predict" +
")" +
"Using Microsoft_Decision_Trees";
OleDbCommand CMD = new OleDbCommand(CreateModel, conn);
CMD.ExecuteNonQuery();

In the statement above, after the declaration of the data type for each column, we also add a content type. A content type, such as Continuous or Discrete, tells the algorithm we're applying to the mining model (in this example, Microsoft Decision Trees) how the data in the columns is to be distributed. Numbers, for example, are usually a continuous distribution, such as Age or Income. Names and demographics, such as Profession and Gender, are discrete; that is, there is no pattern. The last line of the Create statement uses the Predict keyword on the Membercard column, telling the algorithm that all other columns will predict the outcome of this column for the model.

Training a Mining Model

Training the model consists of the data mining algorithm testing input cases and looking for correlations in the data. Once these correlations are identified, the model is repopulated with these new patterns.

Model processing starts over as new data is introduced into the model; over time, the patterns are further revised, making predictions more precise.

To populate the model with data, we use the DMX Insert statement:

String PipeDataToModel = "INSERT  INTO MemberCard_Prediction"

+ "(CustomerId, Gender, Age, Profession, Income, HouseOwner, MemberCard)"
+ "OpenQuery(customerdbsource,"
+ "'Select CustomerId, gender, age, profession, income, houseowner, membercard
FROM customers')";
OleDbCommand CMD = new OleDbCommand(PipeDataToModel, conn);
CMD.ExecuteNonQuery();

You can see by this query that we're creating a roadmap between a table called Customers in a database on a SQL Server, defined by the datasource customerdbsource, and our mining model. The model is a reflection of that table (Customers), column for column, so it is easy to populate.

OpenQuery is a DMX function for performing DMX queries against relational databases from inside an OLE-DB session connection. The relational datasource (datasource to a relational database) is created inside MSAS for that catalog. The most common way is to use Business Intelligence Studio to create datasources and to manage your MSAS project. After the data is piped into the model, you can use your algorithm of choice to do test cases and to identify patterns.

No comments:

Post a Comment