I call myself a data analyst. My company also calls me a data analyst. But for some time now I have had a nagging feeling of unworthiness of this title because I don’t do data analyst things such as regressions and machine learning. In fact, I really don’t touch anything resembling statistics in my analytical work. Yet, when I look at data analyst roles outside of audit departments, statistics plays a major role. Sure, I can understand complex ERP systems such as SAP, am well-versed in SQL and ACLScript, have built software tools using C#, and implemented sophisticated ETL and analytical routines to identify control weaknesses and drive improvement. But I have never done a regression, and that makes me feel like an inadequate, second-rate data analyst. There are days where I feel like a child playing in a sandbox while the ‘adults’ are off doing big, important things.

I don’t like that feeling, so I have been charting a course in my professional life to steer myself toward integrating statistical analytical tools into my toolbox. There is a stack of books on my desk covering statistics, data science using Python, and machine learning. I also try to read journal and magazine articles that will expose me to this new way of thinking and analyzing data. Finally, I am teaching myself Python and have pushed for my professional training to focus on practical implementations of statistical modelling using Python. However, despite all of this, I have never been able to visualize how we could integrate statistical modelling and machine learning into our risk and controls analyses. I understand the concepts, but had no roadmap and no understanding of what the final ‘physical’ product might look like. Without that roadmap, I have found it impossible to begin. I need to have an idea of what I am aiming for before I can take the first steps.

In my limited reading and experience, statistical datasets are usually collections of samples, with each record being a unique sample and each field a (usually numeric) value describing a measurement or property of the sample. Modelling and machine learning, if my understanding of them is not completely off-base, then looks for relationships and patterns between these measurements and uses them for predictive or scoring purposes. There is a lot of math that goes on to identify and describe those relationships, most of which I do not understand. However, programming languages such as R and Python do understand the math. This means we can leverage these languages to focus less on the math and more on understanding the implications of the inputs on the model outputs.

The problem I have always had is, the data we usually deal with when designing analytics for risks and controls looks and behaves nothing like the typical datasets for statistics and machine learning. Take an example of duplicate payments: we might take a table of invoices, clean it up, and compare the invoices to each other. Our output is a table of attributes describing the duplicated invoices. At most, we might have a handful of columns containing the invoice value. The majority of the fields contain text or date values describing various properties used by the system and business to manage the invoicing process:

By contrast, a typical dataset used for statistical purposes usually looks something like this:

How do we get data that looks and behaves like that when we deal with discrete transactions? What would such data even mean in our context of risk and controls? I had very few ideas and no clue where to even begin. And then I had a bit of an Archimedes moment in the shower. Assume we have 42 different analytics we run on a monthly basis to test various strategic controls around all of the major processes in our company. Also assume each of the exceptions in each of the analytics has some sort of organizational identifier, such as a profit center or a department code. Finally, assume we can also obtain financial information or other key metrics by this organizational identifier. We could use all of this data to craft a dataset which will be the starting point of our statistical models or machine learning algorithms. This dataset will look and behave like the statistical dataset above. The process to build this table is simple: we can simply summarize/aggregate the exceptions by organizational unit. Each summarized analytic would then be joined into a single table along with the summarized metrics and financial measurements. The result will be a table of samples representing each entity in our organization with the fields containing the results of our tests and the entity’s financial performance.

The output of this aggregation might look something like this, with 42 columns for analytic results, a column to identify the department (and perhaps a date), and any number of columns for the financial measures:

From this starting point, we can begin to build our models. I am thinking out loud at this point, but perhaps we can use this data to identify which of our analytics correlate to changes in gross margin, or lost customers, or some other highly-valued company metric. We can then use this information to assess the risk of various processes to the overall organization and make better selections for our audit schedule. We may be able to determine in near real time not only where our control weaknesses lie, but also what their impact will be. If we approach the data from the other direction, from the metrics or financial measurements to the analytic results, we could tie this data into our organization’s forecasts and start predicting where we might find control weaknesses in the near future. We also have some flexibility here; not only can we build our models at a strategic, organization level, but we can also build more focused models to assess specific processes or departments. Perhaps we even take it to the transactional level and build models around discrete transactions such as purchase orders, invoices, and inventory adjustments. These models could be adaptive and learn from prior false positives to get better over time at identifying risky or fraudulent transactions. The result could be a library of models, each one utilizing a subset of fields from our master table and providing a different assessment of the organization.

The models would be built on regression, machine learning, and other statistical algorithms utilized by more traditional data analysis disciplines. Even better, we can build those models in R and/or Python, which ACL has kindly integrated into their product. This allows us to work the models into our existing workflows. My current vision of the process involves having an ACL script that will perform the necessary summarizations and joins to build the master table. Each model will then have its own ACL master script which will export the table to a delimited text file. Then, using ACL’s new functions, it will call an R or Python script to perform the actual modelling. The results of the modelling can then be imported back into ACL and processed as needed.

Interestingly, the key to making all of this work is a suite of solid, informative, and accurate analytics that effectively measure the risks and controls of our organization’s processes. Fortunately, this is our bread and butter. We should all know risks and controls to some degree, and we should all have familiarity with our organization’s processes. This puts us in a unique position to effectively model our organizations’ risks and mitigations and extract key insights from our existing analytical efforts. We are well-positioned with our knowledge, access to data, and ability to control the flow all the way from the detailed analytics to the statistical models.

While it seems relatively simple in hindsight, this train of thought was a recent revelation to me, and it has given me hope and a sense of optimism. I finally feel like I know what the end product may look like and I have a path to get there. Most important of all, this revelation has given me a newfound appreciation of the importance of the non-statistical analyses we perform. If we want to model risks and controls, we need superb analytics to give us the data from which we can build our models. Without that solid foundation, the statistical models will suffer and there will be little on which machine learning can work its magic. The work and analyses we currently do are not second-rate; they are some of the most important pieces of the puzzle. Without them, the old adage of ‘Garbage In, Garbage Out’ will ring true and render our newfound tools and capabilities mostly meaningless.

Great post Tom. I feel the same way as you about the fact we don’t really use statistics and other tools so I have taken a couple of specializations so far on Coursera.org (Python from University of Michigan and Statistics with R from Duke). You may want to check out those courses and others.

I have also been working on making a single table like the one shown in your diagram. We refer to it as our risk scoring. I haven’t had much time to work on it as I was trying to develop an advanced workbook in Tableau to represent the results of multiple analytics as not all of our analytics use ACL.

I do disagree that a typical dataset used for statistical purposes does not contain text and dates (but what do I know, I’m also a statistics newbie). In my limited use of statistics lately most of my data sets have included text and dates. For my last statistics project (homework), I analyzed home sales where each observation (i.e., record) was a home sale that occurred over a period of time. We had to perform several analyses on this data including create summary statistics, create plots and predict prices for future home sales using linear regression. My model included several text variables.

If you do manage to use statistics more in your work, I would like to hear about it!

LikeLike

Hi Barry, thanks for the comment! Statistical analysis is very much a new thing to me and I still have a lot to learn.

I appreciate your input about datasets used for statistical analysis. In my limited exposure the datasets I have seen were structured similarly to the one mentioned in my post. Obviously, that exposure was not comprehensive. It is good to know that I can, and will, be working with datasets structured similarly to the ones I am currently familiar with. So thanks for the correction.

I appreciate the information about Coursera. I have heard about them, but have not tried any of the specializations or courses. I will take a look.

I intend to write future posts about our statistical journey. I am anticipating lots of missteps and lessons learned, which I will try to document for posterity’s sake.

LikeLike

Hi Tom. Nice post. I’m looking forward to hearing about your statistical journal. Jeremy Brown has a few related posts on the ACL web site, and he’s a good resource for ideas. We’re also planning to use the results of our individual analytics as input into an overall model. You mentioned statistics on entities like Department. We’re planning to generate similar scorecards by Employee and by Vendor. Let us know if you find some interesting and/or unexpected results.

A few recommendations from Jeremy:

General R and Python things to check out:

R IDE: R Studio

Python IDE: PyCharm

R/Python social network analysis: Igraph

R Decision Tree package: Rpart

R Random Forest model package: RF

LikeLike

Hi Chris, thanks for letting me know about Jeremy. I will look out for future posts of his. It looks like he has already done what we are trying to do and knows how to speak the language of risks and controls.

I appreciate the recommendations. I have been going back and forth between Spyder and PyCharm and have not been able to decide which IDE I want to use. A recommendation from someone like Jeremy is definitely of value in the decision.

LikeLike