Rube Goldberg Keyword Analysis Part 1

The matter of simplicity in analytics begs the question, what is simplicity? Is simplicity the act of writing simple code that is easy to read and performs a task in a simple manner? Is simplicity the act of writing complex (albeit expressive and easy-to-read) code which takes a complex task and makes it simple? Perhaps it is something else entirely. I tend to believe that simplicity is about writing code to make complex tasks simple, but the code behind the simplification effort can be complex in order to provide the façade of simplicity. Code sometimes has to be complex so it can handle sufficient variation to be reusable and extensible. So without further ado, here is part 1 of how I simplify the task of keyword analysis using a complex process.

Getting Data

Every analysis requires data, and for keyword analysis, the data we need is text. I decided to start with text data from our ERP system, SAP. This is primarily because we have DirectLink and can easily access most of the text fields stored by SAP. Some of you may have access to other data on which you want to perform keyword analysis, perhaps e-mails, CRM, or SRM data. Where you begin is a strategic decision you have to make. I recommend starting with an easy-to-access source and expanding after the initial implementation.

One of the daunting challenges I faced was extracting the data. SAP systems generate hundreds of thousands of records (perhaps millions, depending on the company) of potential text data each day from accounting documents, sales documents, vendor purchases, etc. Pulling all of this data on a monthly schedule was not feasible. Performing a daily analysis was also unsatisfactory for two reasons:

  • We provide monthly reporting.
  • If a keyword is added in the middle of a month, we would have to re-run the analysis for the new keyword.

The answer I arrived at was to split the responsibilities of importing data and analyzing data into 2 different, and separately scheduled, routines (kind of sounds like that modularization technique someone once talked about). One set of scripts would import text data from SAP on a daily basis and another would perform keyword mining on a monthly basis. Splitting the responsibilities seems like a no-brainer solution now, but at the time was my first exposure to the value of separating data import scripts from data analysis scripts.

Because the data would not be analyzed on the same schedule as it would be extracted, I needed a place to store it. There were a number of solutions to consider: the data could be stored in flat text files, using AuditExchange’s data repository, or in a database. I ultimately decided on using a database to store the text data. Some of the reasons for this decision were:

  • Text files do not allow for indexing or filtering – the entire text file must be imported before filters can be applied. I wanted to store at least 2 rolling years of text data, which would allow me to easily analyze ad-hoc requests, should they materialize. Text files with 2 years of text data would be enormous and take too much time to import.
  • I wanted the text data to be accessible outside of ACL, which meant AuditExchange was not an ideal solution.
  • I wanted to become more familiar with a leading Relational DataBase Management System (RDBMS) and better understand the complexities of managing databases.

Because we have AuditExchange, we already had a database installed on our server which we could utilize: PostgreSQL. If we did not already have this database, I still would have chosen it as the platform for storing text data. Not only is PostgreSQL a powerful, enterprise-class relational database that is fully ACID compliant and strongly conforms to SQL standards, it is also open-source. For departments on a tight budget, it is a compelling choice.

I decided on using a single table to store text data for all of the transactions we extract. The field structure is something like this:

  • Record ID (auto-incrementing integer)
  • System identifier
  • A few fields we can use for the text metadata, such as document types, entity information, etc
  • A date field identifying the date the text value was created
  • The field that contains the text

If I was designing the process again, I might consider splitting the text from the metadata and take advantage of the relational aspect of the database. Such a structure might look something like this:

  • Metadata Table
    • Record ID (auto-incrementing integer)
    • System identifier
    • Metadata fields
    • Date field(s)
  • Text Table
    • Record ID
    • Text data

In many cases, a single record of data from our source systems can contain multiple fields of text. For example, an accounting document line item might contain a text field and a reference field. In these situations, each text field is uploaded into a separate record in the database. So a line item that looks like this:

008_01

Becomes this in the database:

008_02

I found this structure to be the easiest way to merge text from any source into a single table. I do not have to worry about the structure of data from any single source. All I have to do is create an Extract, Transform, and Load (ETL) routine for any source I desire that converts the data into the database’s structure. Extending the analysis to additional sources of data is as simple as creating the necessary ETL script – no additional infrastructure work has to be completed. This part of the keyword analysis has thus become plug and play.

The last remaining task of significance is to create a script that will load data from ACL to the database. I created a dedicated script that accomplishes this and can work with any ACL table. This is accomplished by parameterizing the script with the values that are required (another benefit of modularization). We first start by exporting the transformed data to a UTF8-encoded text file, as discussed in the prior post:

v_SaveTo = “SomePath\MyFile.txt”
DO “Path\To\Library\ExportToUtf8.aclscript”

Then we provide the necessary parameters for the script that imports the data to the PostgreSQL database:

v_TextFile = v_SaveTo
v_Server = “servername”
v_Db = “MyDatabase”
v_Username = “me”
v_PgTable = “texttable”
v_PgFields = “document,date,text”
DO “Path\To\Library\ImportToPostgre.aclscript”

The ImportToPostgre script contains a lot of validation commands, which I will exclude here. The most relevant piece of the script is an EXECUTE command that calls the PostgreSQL command line (this assumes the PostgreSQL command line is contained in the PATH environment variable. If it is not, you can either add it to the PATH environment variable or enter the full path in the EXECUTE command):

EXECUTE ‘psql -h %v_Server% -U %v_Username% -d %v_Db% -c “\copy %v_PgTable% (%v_PgFields%) FROM ”%v_TextFile%” WITH csv DELIMITER E”\037” HEADER ENCODING ”UTF8” QUOTE E”\b”;”‘

You will notice that no password is provided in the EXECUTE command. Because we are using EXECUTE, any password that is provided would be saved to the log in plain text. I try to avoid passwords in my ACL log files. Fortunately, PostgreSQL has a client access configurator which can allow password-less access. The syntax of the file, called pgpass.conf, is this:

HostName:Port:Database:Username:Password

You can have as many lines in pgpass.conf as is needed to meet your needs. In the example of our keyword analysis, the relevant entry in the pgpass.conf file might look like this:

servername:1234:MyDatabase:me:MyPassword

A couple of things to consider are:

  • The password is stored in plain text in the pgpass.conf file. Thus, you need to ensure it is stored in a secure directory
  • You can specify wildcard character * to allow access to any database or from any host. Use this judiciously, if at all. In the example provided, only user ‘me’ can get to MyDatabase via port 1234 from host servername without authenticating. This means ‘me’ either has to be a process running on the server or I have to be logged into the server using something like a remote desktop connection. The more specific you are with the pgpass entry, the more limited the unauthenticated login will be. Ideally, we would prefer every call to the database is authenticated, but until ACL implements an EXPORT ODBC command with password parameters, this is the only way I have found to fully automate the data transfer.

With the pgpass.conf file properly configured and the data transfer script created, the first part of the keyword analysis is completed. At this point, we are collecting text information from our ERP system on a daily basis and storing it for future use. The next post will focus on the execution of the actual keyword analysis.

Advertisements

2 thoughts on “Rube Goldberg Keyword Analysis Part 1

  1. I’m enjoying reading about all of your ACL experience, and your ideas on modularization and code reusability.

    We have thought about running daily imports from SAP, but we’re storing all of our data on our AX server and it’s too easy to corrupt an ACL history file if the fields don’t line up perfectly before you use APPEND.

    You mentioned that pulling all of your data on a monthly schedule was not feasible. Can you expand on your challenges there?

    We’ve been able to pull all of our data monthly without many problems. We separated our Direct Link imports into 10 main scripts (5 for Submit and 5 for Retrieve), with each main script calling multiple individual Submit or Retrieve scripts for individual SAP tables. Our 5 main scripts are for reference tables, smaller transactional tables, then larger transactional tables. We use 5 separate main scripts in case one fails, we can easily re-run that one script.

    We extract about 60Gb of data every month over a 3-night period, covering 3 years worth of data. Our largest table, BSEG, is 33Gb with almost 90 million records. BSEG takes about 4 hours for the Submit to finish, and 45 minutes for the Retrieve to finish.

    We run after our backups are finished, and we start running one night after our SAP monthly close, to avoid system slowdowns.

    Like

  2. Hi Chris,

    Thanks for your comment and I apologize for the delayed response (I took some much-needed vacation time).

    The challenge with pulling data monthly is twofold. First, the volume of data is rather large. On any given month we might have 4-5 million accounting document headers comprising 30-40 million line items. Second, our scripts are designed to run queries in the foreground rather than the background. This is the setup I was hired into and have continued using. I know there are methods to run queries in the background and then have automated scripts test their status. I have seen an example on ScriptHub and it looks trivial to implement, but it is also not a high priority for me.

    Because we run in the foreground, we have about 20 minutes to get our data before the query times out. This means we have to split our imports into chunks of time. Something like BKPF will usually get split into weekly or bi-monthly chunks. If you throw BSEG into the mix then we have to pull each day separately. Now, we have an additional challenge in that we are also limited to somewhere between 600 and 800 query calls for, I think, any given day. So as our CCM program expands I have to be very careful about how many times I am hitting our system. Thus, it made more sense for me to just pull the data daily and hit the system once per day rather than hit it 15 to 30 times once per month. An added benefit of running data-intensive queries daily is that it becomes ‘normal’ for the system. Rather than causing a spike in resource needs once per month, those queries kind of become background noise. I like to think it helps me fly under IT’s radar a bit.

    I find it interesting that you are able to pull 3 years of data each month. I wish I could do something like that, but I don’t think we could ever do that even through the background. We usually pull a single month of data each month, so we are only performing incremental analyses. For something like duplicate invoices, we will pull up to 12 months of historical data, and that takes a very long time to complete, so we only do that in exceptional cases.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s