The matter of simplicity in analytics begs the question, what is simplicity? Is simplicity the act of writing simple code that is easy to read and performs a task in a simple manner? Is simplicity the act of writing complex (albeit expressive and easy-to-read) code which takes a complex task and makes it simple? Perhaps it is something else entirely. I tend to believe that simplicity is about writing code to make complex tasks simple, but the code behind the simplification effort can be complex in order to provide the façade of simplicity. Code sometimes has to be complex so it can handle sufficient variation to be reusable and extensible. So without further ado, here is part 1 of how I simplify the task of keyword analysis using a complex process.
Every analysis requires data, and for keyword analysis, the data we need is text. I decided to start with text data from our ERP system, SAP. This is primarily because we have DirectLink and can easily access most of the text fields stored by SAP. Some of you may have access to other data on which you want to perform keyword analysis, perhaps e-mails, CRM, or SRM data. Where you begin is a strategic decision you have to make. I recommend starting with an easy-to-access source and expanding after the initial implementation.
One of the daunting challenges I faced was extracting the data. SAP systems generate hundreds of thousands of records (perhaps millions, depending on the company) of potential text data each day from accounting documents, sales documents, vendor purchases, etc. Pulling all of this data on a monthly schedule was not feasible. Performing a daily analysis was also unsatisfactory for two reasons:
- We provide monthly reporting.
- If a keyword is added in the middle of a month, we would have to re-run the analysis for the new keyword.
The answer I arrived at was to split the responsibilities of importing data and analyzing data into 2 different, and separately scheduled, routines (kind of sounds like that modularization technique someone once talked about). One set of scripts would import text data from SAP on a daily basis and another would perform keyword mining on a monthly basis. Splitting the responsibilities seems like a no-brainer solution now, but at the time was my first exposure to the value of separating data import scripts from data analysis scripts.
Because the data would not be analyzed on the same schedule as it would be extracted, I needed a place to store it. There were a number of solutions to consider: the data could be stored in flat text files, using AuditExchange’s data repository, or in a database. I ultimately decided on using a database to store the text data. Some of the reasons for this decision were:
- Text files do not allow for indexing or filtering – the entire text file must be imported before filters can be applied. I wanted to store at least 2 rolling years of text data, which would allow me to easily analyze ad-hoc requests, should they materialize. Text files with 2 years of text data would be enormous and take too much time to import.
- I wanted the text data to be accessible outside of ACL, which meant AuditExchange was not an ideal solution.
- I wanted to become more familiar with a leading Relational DataBase Management System (RDBMS) and better understand the complexities of managing databases.
Because we have AuditExchange, we already had a database installed on our server which we could utilize: PostgreSQL. If we did not already have this database, I still would have chosen it as the platform for storing text data. Not only is PostgreSQL a powerful, enterprise-class relational database that is fully ACID compliant and strongly conforms to SQL standards, it is also open-source. For departments on a tight budget, it is a compelling choice.
I decided on using a single table to store text data for all of the transactions we extract. The field structure is something like this:
- Record ID (auto-incrementing integer)
- System identifier
- A few fields we can use for the text metadata, such as document types, entity information, etc
- A date field identifying the date the text value was created
- The field that contains the text
If I was designing the process again, I might consider splitting the text from the metadata and take advantage of the relational aspect of the database. Such a structure might look something like this:
- Metadata Table
- Record ID (auto-incrementing integer)
- System identifier
- Metadata fields
- Date field(s)
- Text Table
- Record ID
- Text data
In many cases, a single record of data from our source systems can contain multiple fields of text. For example, an accounting document line item might contain a text field and a reference field. In these situations, each text field is uploaded into a separate record in the database. So a line item that looks like this:
Becomes this in the database:
I found this structure to be the easiest way to merge text from any source into a single table. I do not have to worry about the structure of data from any single source. All I have to do is create an Extract, Transform, and Load (ETL) routine for any source I desire that converts the data into the database’s structure. Extending the analysis to additional sources of data is as simple as creating the necessary ETL script – no additional infrastructure work has to be completed. This part of the keyword analysis has thus become plug and play.
The last remaining task of significance is to create a script that will load data from ACL to the database. I created a dedicated script that accomplishes this and can work with any ACL table. This is accomplished by parameterizing the script with the values that are required (another benefit of modularization). We first start by exporting the transformed data to a UTF8-encoded text file, as discussed in the prior post:
v_SaveTo = “SomePath\MyFile.txt”
Then we provide the necessary parameters for the script that imports the data to the PostgreSQL database:
v_TextFile = v_SaveTo
v_Server = “servername”
v_Db = “MyDatabase”
v_Username = “me”
v_PgTable = “texttable”
v_PgFields = “document,date,text”
The ImportToPostgre script contains a lot of validation commands, which I will exclude here. The most relevant piece of the script is an EXECUTE command that calls the PostgreSQL command line (this assumes the PostgreSQL command line is contained in the PATH environment variable. If it is not, you can either add it to the PATH environment variable or enter the full path in the EXECUTE command):
EXECUTE ‘psql -h %v_Server% -U %v_Username% -d %v_Db% -c “\copy %v_PgTable% (%v_PgFields%) FROM ”%v_TextFile%” WITH csv DELIMITER E”\037” HEADER ENCODING ”UTF8” QUOTE E”\b”;”‘
You will notice that no password is provided in the EXECUTE command. Because we are using EXECUTE, any password that is provided would be saved to the log in plain text. I try to avoid passwords in my ACL log files. Fortunately, PostgreSQL has a client access configurator which can allow password-less access. The syntax of the file, called pgpass.conf, is this:
You can have as many lines in pgpass.conf as is needed to meet your needs. In the example of our keyword analysis, the relevant entry in the pgpass.conf file might look like this:
A couple of things to consider are:
- The password is stored in plain text in the pgpass.conf file. Thus, you need to ensure it is stored in a secure directory
- You can specify wildcard character * to allow access to any database or from any host. Use this judiciously, if at all. In the example provided, only user ‘me’ can get to MyDatabase via port 1234 from host servername without authenticating. This means ‘me’ either has to be a process running on the server or I have to be logged into the server using something like a remote desktop connection. The more specific you are with the pgpass entry, the more limited the unauthenticated login will be. Ideally, we would prefer every call to the database is authenticated, but until ACL implements an EXPORT ODBC command with password parameters, this is the only way I have found to fully automate the data transfer.
With the pgpass.conf file properly configured and the data transfer script created, the first part of the keyword analysis is completed. At this point, we are collecting text information from our ERP system on a daily basis and storing it for future use. The next post will focus on the execution of the actual keyword analysis.