Tuesday, July 7, 2009

Text Mining and regular expression

I've been spending quite a lot of time in the bowels of a text mining project recently, mostly in the text/concept extraction phase. We're using the SPSS Text Mining tool for the work so far. (As a quick aside, the text mining book I've enjoyed reading the most in recent months is the Weiss, Indurkhya, Zhang, and Damerau)

The most difficult part of the project has been that all of the text is really customized lingo--a language of its own as presented in the notes sections of the documents we are reading. Therefore, we can't use the typical linguistic extraction techinques, and rather are relying heavily on regular expressions. That certainly takes me back a few years! I used to use regular expressions mostly in shell programming (Bourne, CShell, Korn Shell and later BASH).

I must say it has been very productive, though it also makes me appreciate language rules that don't exist in any consistent way with our notes. As I am able, I'll post on more specifics on this project.

Regarding books on regular expressions, I found the unix books weren't quite so good on this topic. However, the O'Reilly Mastering Regular Expressions book is quite good.

No comments:

Post a Comment