Rube Goldberg Keyword Analysis Part 2

It has been a while since Part 1, and at this point, we have a series of scripts in place that are chugging away, autonomously extracting data from our source systems and uploading key text elements into a database table. The data gathering routines are stable and running on a daily basis without errors, doing the work so we do not have to. Now that we have our grand table of text data that grows with each passing day, it is time to do something with it. This post is going to demonstrate a very simple keyword analysis that identifies text elements that contain certain words.

The most simple solution to code a simple keyword analysis is to create a script that embeds our favorite keywords into a filter or computed field:

DEFINE FIELD WhoaLookAtThis COMPUTED
AS ‘WhoaLookAtThis’
‘BRIBES!!!’ IF FIND(‘BRIBE’ UPPER(TextField))
‘Maybe a bribe?’ IF FIND(‘FACILITATE’ UPPER(TextField))
‘Definitely a bribe…or a purchase of lubricants’ IF FIND(‘GREASE’ UPPER(TextField))

And that wraps up this post. Thanks for reading.

…Not so fast. This post is supposed to channel Rube Goldberg – it is time to add some complexity to our process. First off, I do not like embedding data into my scripts, and a list of keywords we want to identify is more data than it is script. If I want to add, remove, or change my list of keywords, I need to change the script. My philosophy on scripting is that once my scripts are submitted to production, I do not want to touch them again. Doing so risks introducing errors into the script. Thus, I need a way to update my keyword list without having to change a single line of code.

Time to channel Rube Goldberg.

What are some other options? We could save the list of keywords in a text file. Text files are easy to store and maintain. If we include it in our version controlled script library, we could even force updates to go through our change management processes. When we perform a keyword analysis, we simply import the list of keywords, generate a computed field (or filter) from that list, and extract out the matches.

But I generally do not like text files as my permanent means of storage for data sets that are expected to change over time. Instead, we could save the field definition in a workspace. If we reference the text field using variable substitution and define that variable before activating the workspace, we can cause the workspace to be applicable to any table. So if we had a workspace called ‘Keywords.wsp’ which had the following content:

WhoaLookAtThis COMPUTED
‘BRIBES!!!’ IF FIND(‘BRIBE’ UPPER(%v_TextField%))
‘Maybe a bribe?’ IF FIND(‘FACILITATE’ UPPER(%v_TextField%))
‘Definitely a bribe…or a purchase of lubricants’ IF FIND(‘GREASE’ UPPER(%v_TextField%))

We could then create our keyword analysis field using the following commands:

v_TextField = ‘TextField’
ACTIVATE WORKSPACE “Path\To\Library\Keywords.wsp”

I really like this option because it uses ACL infrastructure and is very simple to code and maintain. However, it has one disadvantage – I cannot easily mix and match separate lists of keywords. If I created 2 workspaces with different sets of keywords focused on different analyses and then wanted to perform a third analysis using a mix of keywords from both files, I would have to create a third workspace and duplicate my code. I am aiming for flexibility, so this is not ideal.

A third option, and the one I implemented, is to manage the keywords in a database. You might be getting the idea by now that I have a certain affinity for databases, which I will not deny. But part of the reason for that is the versatility they provide. This is how I structured the table of keywords to support my analysis:

009_01

  • Plain Word – This describes the word I am searching for in plain English, but it is NOT the actual search term. This field is helpful when you have a keyword identified in multiple languages.
  • Description – An extended description of what I am searching for
  • Language – Identifies the language of the search term
  • Category – Allows me to categorize my search terms for easy filtering
  • Search Term – This is the actual search term I use in ACL.

I decided to take advantage of AN11’s regular expression capabilities and make all of my keyword search terms quoted regular expressions. This has allowed me to define with great flexibility the actual keywords I want to find. For example, by using a single regular expression that looks like this:

‘\bGRE(E|A)SE?\s+(PAY(MENT)?|PMT)\b’

I can identify the following phrases, and also neatly filter out purchases of lubricants:

Grease Payment, Greese Payment, Grease Pay, Greese Pay, Grease Pmt, Greese Pmt, Greas Payment, Grees Payment, Greas Pay, Grees Pay, Greas Pmt, Grees Pmt

Thus, a single search term can encompass a variety of alternatives, including possible misspellings, while also excluding false positives. This is why I decided to use regular expressions and why they are worth learning if you want to perform any kind of textual analysis.

Now, the trick is to transform the list of keywords from a database into something that can be used in ACL. I decided to write a script that creates a computed field of regular expressions from a specified input table. That script, called ‘KeywordMatchField.aclscript’, is below:

COMMENT
** This script creates an external script that will perform a REGEXFIND of keywords on a specified field
**
** The required input values are:
** A table must be open containing keyword search terms. The table must contain the following columns:
** A column called ‘plainword’ with the word being searched for
** A column called ‘regexterm’ with the regex pattern to search for
** v_TextFieldName – The name of the text field in a different table to perform the search on
**
** The output is a file called FieldDef.aclscript that can be used to create a new field called
** ‘MatchedKeyword’ that will match all regexterms in the currently-open table.
END

[Some validation code has been removed for readability]

COMMENT Create the header of the field definition for the MatchedKeyword field
EXECUTE ‘cmd /u /c echo DEFINE FIELD MatchedKeyword COMPUTED > “FieldDef.aclscript”‘
EXECUTE ‘cmd /u /c echo AS ”MatchedKeyword” >> “FieldDef.aclscript”‘

COMMENT Export the regex terms in the current table to create search terms in the output script file
EXPORT ASCII FIELDS ‘”‘ + ALLTRIM(plainword) + ‘” IF REGEXFIND(UPPER(ALLTRIM(%v_TextFieldName%)) ‘ + ALLTRIM(regexterm) + ‘)’ TO “FieldDef.aclscript” APPEND UNICODE

COMMENT Create a default value equal to a blank text value in the output script file
EXECUTE ‘cmd /u /c echo “” >> “FieldDef.aclscript”‘

[Some clean-up code has been removed for readability]

This script creates a .aclscript file in the project folder called ‘FieldDef.aclscript’ containing the code needed to define a computed field which will search for all of the keywords we downloaded from the database. The initial EXECUTE commands make use of the ‘echo’ command line function to output text to a specified text file. In this case, we are outputting the DEFINE FIELD header. We do not have to specify the full path to ‘FieldDef.aclscript’ because ACL executes cmd from the current project directory, which is where any file references will default to if full paths are not provided. The EXPORT ASCII command exports our table of keywords in the format needed for the computed field conditions. Finally, the last EXECUTE command outputs an empty string for the field’s default value.

To perform the actual keyword analysis on our table, we can use the following commands, assuming we have already imported our keywords from the database:

OPEN MyTableOfKeywords
v_TextFieldName = ‘TextField’
DO “Path\To\Library\KeywordMatchField.aclscript”
OPEN MyTableOfTextData
DO “FieldDef.aclscript”

EXTRACT FIELDS ALL IF NOT ISBLANK(MatchedKeyword) TO MatchedKeywords OPEN

This approach, while complicated to set up, provides a great deal of flexibility and re-use. I have found it to be effective and well-fitting to our automated processes. You may find that a more simple method of keyword analysis meets your needs. Alternatively, you may find this method to have too many limitations. Ultimately, the simplest approach which meets your needs and future expectations is the approach you should implement. I do not present this methodology as the only ‘good’ way to perform a simple keyword analysis, but rather I do so in the hope that the concepts discussed here will provide you with ideas and show what is possible when you begin to build an analytics environment around a tool like ACL.

Advertisements

2 thoughts on “Rube Goldberg Keyword Analysis Part 2

  1. Another great post, Tom. Do you have any recommended beginner references for learning about regular expressions?
    Instead of EXECUTE ‘cmd /u /c echo, I use a stub file with one record, and just use Export. Personal preference. Maybe easier to understand for most people.

    Like

    1. Hi Chris, thanks for the feedback.

      I like ACL’s introduction in the language reference to REGEXFIND: http://docs.acl.com/acl/11x/topic/com.acl.language.help/lang_ref_functions/r_regexfind.html?cp=2_0_92

      I have also found this site to be very comprehensive for learning the basics and the internals of regex engines:
      http://www.regular-expressions.info/tutorial.html

      Finally, I have found this site to be very helpful in debugging my expressions, or even for just playing around:
      https://regex101.com/

      Using a stub file is definitely a good, valid option. I even have a script that creates stub files for me for situations such as this. But I decided to use the command line’s echo function here because it resulted in a few less lines of code. Definitely a matter of personal preference rather than best practice or ‘thou shalt do it this way’.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s