Software applications commonly make use of tree structures to arrange items in a hierarchical manner. File systems, database indexes, bills of material, HR organizational data, etc are all usually structured in a tree format. The persisted data that represents these structures are often stored in a dictionary of parent/child relationships, looking something like this:
If we were to expand this data into its full form, it would look like the following tree:
Alternatively, we could present this data as a list of paths, which is the format I find most useful in my analysis routines:
We may find it helpful in our data analyses to be able to expand hierarchical data so we can analyze various aspects of the tree. For example, we may wish to expand a bill of material from the final product into all of the components and routings that comprise that product. We could use this information to perform verifications of cost rollups. I once performed an analysis of forecasting models that were designed to be representative of a typical high-option-content product. To accomplish this, each option package was quantified as a ratio that approximated the historical take rate. So, if option A was purchased 30% of the time and option B was purchased 70% of the time, options A and B were entered into the bill of material at quantities of 0.3 and 0.7, respectively. You can imagine my surprise when the total of the options was less than 1. We were not fully costing our forecasting models, which was contributing to unfavorable variances in our monthly budget-to-actual reporting.
To help me accomplish the analysis, I had to explode our bill of materials. At least, that’s what I call the process of expanding the tree structure down to its deepest level. I am not sure if anyone else calls the process ‘exploding’, but I always have, and so I will continue to do so in this post.
Since that analysis I have needed to explode many datasets. Each time I would create a script, from the ground up, that was designed for each specific circumstance. About a month ago I finally got tired of (re-)writing these scripts and designed a generic explosion script. I will walk through that script and its logic. I always find these scripts to be delightfully challenging, which is perhaps why it took so long for me to finally create a general-purpose utility script. These scripts are also quite exciting because, rather than iterating through a single path of the tree as their owner applications do (and which, to be honest, is a very trivial exercise), we will be iterating through every branch starting at every level.
Because the generic script contains some complexities which may obfuscate the logic, let’s start by talking through the logic at a high level. We start with a table of parent-child relationships, as seen at the top of this post, but which I will present here again for ease of reference:
In essence, we want to repeatedly join the child field back to the parent field of the same table. With this dataset, we would do this twice:
Rather than specify how many times we need to iterate down the tree, we want the script to detect when the last level is reached. This will allow the script to adapt seamlessly to trees of any size, though we may want to consider implementing a maximum loop count in case the data contains circular references (generally, this should not happen, but I have seen it).
The script requires a few input parameters before it can run:
t_ParentChildTable – The name of the table containing the parent-child relationships
v_ParentField – The name of the field containing the parent value
v_ChildField – The name of the field containing the child value
v_ExplodeOut – The name of the new table that will contain the exploded dataset.
v_QtyPerField – The name of a field that describes how many children belong in the parent. You would most commonly find this in bills of material. This parameter should be optional because not all trees have a quantity-per relationship.
We start by first identifying all of the fields in the table that are not the parent, child, or quantity-per field. We compile a list of these fields so we can extract them with our exploded dataset.
OPEN %t_ParentChildTable% SORT ON %v_ParentField% TO Explode_T01 SAVE LAYOUT TABLE TO Explode_T02 OPEN Explode_T02 LOCATE IF field_name = v_ParentField v_ExplodeMax1 = field_length LOCATE IF field_name = v_ChildField v_ExplodeMax2 = field_length v_ExplodeKeyLength = MAXIMUM(v_ExplodeMax1 v_ExplodeMax2) EXTRACT FIELDS ALL IF NOT MATCH(field_name v_ParentField v_ChildField v_QtyPerField) TO Explode_T03 OPEN v_OtherFields = BLANKS(16000) GROUP v_OtherFields = ALLTRIM(v_OtherFields) + BLANKS(1) + ALLTRIM(field_name) END v_OtherFields = ALLTRIM(v_OtherFields)
This logic works by saving the list of fields in the table and then iterating through the fields in the GROUP command which are not the parent, child, or quantity-per fields. The result is a variable that contains the remaining field names, delimited by spaces. So, if our table looks like this:
…then v_OtherFields will be equal to ‘Attribute1 Attribute2’.
One other function of the above code is to provide a facility to normalize the parent and child fields. For the purposes of the explosion, the parent and child fields are interchangeable. This means they need to be the same length in the table layout. Lines 5 through 9 identify the length of the longer of the two fields so we can normalize them during the explosion.
As a side note, the SORT command on line 2 is there so we do not have to continually SECSORT in an upcoming join. For large tree datasets, this can significantly improve the script’s execution time.
Now that we have our list of all of the other fields, we can perform the first level explosion. This level is the easiest because we are essentially creating a copy of the existing table:
COMMENT Extract the first explosion level and create the collection table v_ExplosionLevel = 1 v_ExplodeExtractFields = "SUBSTRING(%v_ParentField% 1 %v_ExplodeKeyLength%) AS 'TopLevel'" v_ExplodeExtractFields = v_ExplodeExtractFields + " SUBSTRING(%v_ParentField% 1 %v_ExplodeKeyLength%) AS 'Parent'" v_ExplodeExtractFields = v_ExplodeExtractFields + " SUBSTRING(%v_ChildField% 1 %v_ExplodeKeyLength%) AS 'Child'" v_ExplodeExtractFields = v_ExplodeExtractFields + " SUBSTRING(ALLTRIM(%v_ParentField%)+'\'+ALLTRIM(%v_ChildField%) 1 1000) AS 'ExplosionPath'" v_ExplodeExtractFields = v_ExplodeExtractFields + " v_ExplosionLevel AS 'ExplosionLevel'" IF NOT ISBLANK(v_QtyPerField) v_ExplodeExtractFields = v_ExplodeExtractFields + " %v_QtyPerField%+0 AS '%v_QtyPerField%'" v_ExplodeExtractFields = v_ExplodeExtractFields + " %v_OtherFields%" OPEN %t_ParentChildTable% EXTRACT FIELDS %v_ExplodeExtractFields% TO Explode_T04 OPEN DELETE v_ExplodeExtractFields OK EXTRACT FIELDS ALL TO Explode_T20
The first part of this code builds the field listing for the EXTRACT command. It adds normalization routines for the parent and child fields, creates an expression to build each level’s path, optionally adds the quantity-per field, and then tacks on the rest of the fields from the original table to the end. Once we have our EXTRACT fields ready we perform an initial extraction to our working table, Explode_T04. We then finish by performing one final extraction to our aggregation table, Explode_T20.
Now that we have extracted the first level to both a working table and an aggregation table, we can call the loop that will explode the rest of the tree structure. I end up calling 1 of 2 scripts, depending on whether or not a quantity-per field was provided. I found that using two separate scripts rather than a single, combined script kept the logic cleaner, even though it duplicates the code.
COMMENT Iterate through each level until we run out of levels or reach level 20 DO Explode2_1 WHILE WRITE1 > 0 AND v_ExplosionLevel 0 AND v_ExplosionLevel < 20 AND ISBLANK(v_QtyPerField)
I will only review the logic of the script that runs when a quantity-per field is provided. The other script’s logic is identical with the exception that the quantity-per calculations are not performed.
The looping script starts by immediately increasing the explosion level counter. This variable keeps track of how many levels we have iterated through the tree data. In addition to providing informational value to the user, it is also used to exit the script if more than 20 levels is exceeded. We increment immediately because we have already extracted level 1; if we did not immediately increment we would have 2 levels being merged into level 1.
Once we increment the counter, a many-to-many join is run between the children of the current level and the parent of the next level. The results of this join represent all of the children of the current level who themselves have children. Notice in the join how the child of the parent table becomes the new parent, and the child of the child table becomes the new child. It is this shifting of key fields that allows the explosion to operate correctly.
COMMENT Increment the current explosion level v_ExplosionLevel = v_ExplosionLevel + 1 COMMENT Get the next set of parent/child relationships OPEN ExplodeSubset_T05 OPEN ExplodeSubset_T02 SECONDARY JOIN MANY PKEY Child FIELDS TopLevel Child AS 'Parent' ExplosionPath SKEY %v_ParentField% WITH %v_ChildField% AS 'Child' %v_QtyPerField% %v_OtherFields% PRESORT TO ExplodeSubset_T06 OPEN CLOSE SECONDARY
Once the new level is identified, we calculate the new quantity per. The script is designed to calculate extended quantities based on the values of all of the prior levels. So, if Level 1 uses 2 units, and level 2 uses 5 units, the extended value of level 3 is 10 units. After the new quantity-per value is calculated, the data is extracted back to table T05 for the next explosion and then extracted a second time to our aggregation table.
COMMENT Calculate the actual quantity per based on the values calculated through the entire explosion DEFINE FIELD NewQtyPer COMPUTED SourceQtyPer * %v_QtyPerField% COMMENT Extract the new set of relationships and append them to the collection table EXTRACT FIELDS TO ExplodeSubset_T05 OPEN SUBSTRING(TopLevel+BLANKS(v_ExplodeKeyLength) 1 %v_ExplodeKeyLength%) AS 'TopLevel' SUBSTRING(Parent+BLANKS(v_ExplodeKeyLength) 1 %v_ExplodeKeyLength%) AS 'Parent' SUBSTRING(Child+BLANKS(v_ExplodeKeyLength) 1 %v_ExplodeKeyLength%) AS 'Child' SUBSTRING(ALLTRIM(ExplosionPath)+'\'+ALLTRIM(Child) 1 1000) AS 'ExplosionPath' v_ExplosionLevel AS 'ExplosionLevel' %v_QtyPerField%+0 AS '%v_QtyPerField%' %v_OtherFields% EXTRACT FIELDS ALL TO ExplodeSubset_T20 APPEND
This looping script continues to run until the join produces no records or the loop counter exceeds 20. Once the script stops looping, our data is fully exploded. The only other transformation I like to perform is to sort the records on the explosion path so the table is easily readable by a human user:
COMMENT Sort the final table by explosion path OPEN ExplodeSubset_T20 SORT ON ExplosionPath TO "%v_ExplodeSubsetOut%" OPEN
And that is how to perform a basic explosion of tree data in ACL. You can use these scripts stand-alone or as a starting point for new scripts that perform more specialized operations. For example, if you have tree data comprising millions of records, it may be infeasible (i.e. time consuming) to explode all of the data. These scripts can be adjusted to incorporate a sub-population of parent values from which the explosion is performed. This can save a lot of time and allow you to focus your efforts on the data that is really important. I leave the actual coding of this change to you; but if you get stuck feel free to reach out.