I call myself a data analyst. My company also calls me a data analyst. But for some time now I have had a nagging feeling of unworthiness of this title because I don’t do data analyst things such as regressions and machine learning. In fact, I really don’t touch anything resembling statistics in my analytical work. Yet, when I look at data analyst roles outside of audit departments, statistics plays a major role. Sure, I can understand complex ERP systems such as SAP, am well-versed in SQL and ACLScript, have built software tools using C#, and implemented sophisticated ETL and analytical routines to identify control weaknesses and drive improvement. But I have never done a regression, and that makes me feel like an inadequate, second-rate data analyst. There are days where I feel like a child playing in a sandbox while the ‘adults’ are off doing big, important things.
I don’t like that feeling, so I have been charting a course in my professional life to steer myself toward integrating statistical analytical tools into my toolbox. There is a stack of books on my desk covering statistics, data science using Python, and machine learning. I also try to read journal and magazine articles that will expose me to this new way of thinking and analyzing data. Finally, I am teaching myself Python and have pushed for my professional training to focus on practical implementations of statistical modelling using Python. However, despite all of this, I have never been able to visualize how we could integrate statistical modelling and machine learning into our risk and controls analyses. I understand the concepts, but had no roadmap and no understanding of what the final ‘physical’ product might look like. Without that roadmap, I have found it impossible to begin. I need to have an idea of what I am aiming for before I can take the first steps.
In my limited reading and experience, statistical datasets are usually collections of samples, with each record being a unique sample and each field a (usually numeric) value describing a measurement or property of the sample. Modelling and machine learning, if my understanding of them is not completely off-base, then looks for relationships and patterns between these measurements and uses them for predictive or scoring purposes. There is a lot of math that goes on to identify and describe those relationships, most of which I do not understand. However, programming languages such as R and Python do understand the math. This means we can leverage these languages to focus less on the math and more on understanding the implications of the inputs on the model outputs.
The problem I have always had is, the data we usually deal with when designing analytics for risks and controls looks and behaves nothing like the typical datasets for statistics and machine learning. Take an example of duplicate payments: we might take a table of invoices, clean it up, and compare the invoices to each other. Our output is a table of attributes describing the duplicated invoices. At most, we might have a handful of columns containing the invoice value. The majority of the fields contain text or date values describing various properties used by the system and business to manage the invoicing process:
By contrast, a typical dataset used for statistical purposes usually looks something like this:
How do we get data that looks and behaves like that when we deal with discrete transactions? What would such data even mean in our context of risk and controls? I had very few ideas and no clue where to even begin. And then I had a bit of an Archimedes moment in the shower. Assume we have 42 different analytics we run on a monthly basis to test various strategic controls around all of the major processes in our company. Also assume each of the exceptions in each of the analytics has some sort of organizational identifier, such as a profit center or a department code. Finally, assume we can also obtain financial information or other key metrics by this organizational identifier. We could use all of this data to craft a dataset which will be the starting point of our statistical models or machine learning algorithms. This dataset will look and behave like the statistical dataset above. The process to build this table is simple: we can simply summarize/aggregate the exceptions by organizational unit. Each summarized analytic would then be joined into a single table along with the summarized metrics and financial measurements. The result will be a table of samples representing each entity in our organization with the fields containing the results of our tests and the entity’s financial performance.
The output of this aggregation might look something like this, with 42 columns for analytic results, a column to identify the department (and perhaps a date), and any number of columns for the financial measures:
From this starting point, we can begin to build our models. I am thinking out loud at this point, but perhaps we can use this data to identify which of our analytics correlate to changes in gross margin, or lost customers, or some other highly-valued company metric. We can then use this information to assess the risk of various processes to the overall organization and make better selections for our audit schedule. We may be able to determine in near real time not only where our control weaknesses lie, but also what their impact will be. If we approach the data from the other direction, from the metrics or financial measurements to the analytic results, we could tie this data into our organization’s forecasts and start predicting where we might find control weaknesses in the near future. We also have some flexibility here; not only can we build our models at a strategic, organization level, but we can also build more focused models to assess specific processes or departments. Perhaps we even take it to the transactional level and build models around discrete transactions such as purchase orders, invoices, and inventory adjustments. These models could be adaptive and learn from prior false positives to get better over time at identifying risky or fraudulent transactions. The result could be a library of models, each one utilizing a subset of fields from our master table and providing a different assessment of the organization.
The models would be built on regression, machine learning, and other statistical algorithms utilized by more traditional data analysis disciplines. Even better, we can build those models in R and/or Python, which ACL has kindly integrated into their product. This allows us to work the models into our existing workflows. My current vision of the process involves having an ACL script that will perform the necessary summarizations and joins to build the master table. Each model will then have its own ACL master script which will export the table to a delimited text file. Then, using ACL’s new functions, it will call an R or Python script to perform the actual modelling. The results of the modelling can then be imported back into ACL and processed as needed.
Interestingly, the key to making all of this work is a suite of solid, informative, and accurate analytics that effectively measure the risks and controls of our organization’s processes. Fortunately, this is our bread and butter. We should all know risks and controls to some degree, and we should all have familiarity with our organization’s processes. This puts us in a unique position to effectively model our organizations’ risks and mitigations and extract key insights from our existing analytical efforts. We are well-positioned with our knowledge, access to data, and ability to control the flow all the way from the detailed analytics to the statistical models.
While it seems relatively simple in hindsight, this train of thought was a recent revelation to me, and it has given me hope and a sense of optimism. I finally feel like I know what the end product may look like and I have a path to get there. Most important of all, this revelation has given me a newfound appreciation of the importance of the non-statistical analyses we perform. If we want to model risks and controls, we need superb analytics to give us the data from which we can build our models. Without that solid foundation, the statistical models will suffer and there will be little on which machine learning can work its magic. The work and analyses we currently do are not second-rate; they are some of the most important pieces of the puzzle. Without them, the old adage of ‘Garbage In, Garbage Out’ will ring true and render our newfound tools and capabilities mostly meaningless.