Danger! Unexploded Data

Software applications commonly make use of tree structures to arrange items in a hierarchical manner. File systems, database indexes, bills of material, HR organizational data, etc are all usually structured in a tree format. The persisted data that represents these structures are often stored in a dictionary of parent/child relationships, looking something like this:

013_01

If we were to expand this data into its full form, it would look like the following tree:

013_02

Alternatively, we could present this data as a list of paths, which is the format I find most useful in my analysis routines:

A\B\C1\D
A\B\C2\E
B\C1\D
B\C2\E
C1\D
C2\E

We may find it helpful in our data analyses to be able to expand hierarchical data so we can analyze various aspects of the tree. For example, we may wish to expand a bill of material from the final product into all of the components and routings that comprise that product. We could use this information to perform verifications of cost rollups. I once performed an analysis of forecasting models that were designed to be representative of a typical high-option-content product. To accomplish this, each option package was quantified as a ratio that approximated the historical take rate. So, if option A was purchased 30% of the time and option B was purchased 70% of the time, options A and B were entered into the bill of material at quantities of 0.3 and 0.7, respectively. You can imagine my surprise when the total of the options was less than 1. We were not fully costing our forecasting models, which was contributing to unfavorable variances in our monthly budget-to-actual reporting.

To help me accomplish the analysis, I had to explode our bill of materials. At least, that’s what I call the process of expanding the tree structure down to its deepest level. I am not sure if anyone else calls the process ‘exploding’, but I always have, and so I will continue to do so in this post.

Since that analysis I have needed to explode many datasets. Each time I would create a script, from the ground up, that was designed for each specific circumstance. About a month ago I finally got tired of (re-)writing these scripts and designed a generic explosion script. I will walk through that script and its logic. I always find these scripts to be delightfully challenging, which is perhaps why it took so long for me to finally create a general-purpose utility script. These scripts are also quite exciting because, rather than iterating through a single path of the tree as their owner applications do (and which, to be honest, is a very trivial exercise), we will be iterating through every branch starting at every level.

Because the generic script contains some complexities which may obfuscate the logic, let’s start by talking through the logic at a high level. We start with a table of parent-child relationships, as seen at the top of this post, but which I will present here again for ease of reference:

013_01

In essence, we want to repeatedly join the child field back to the parent field of the same table. With this dataset, we would do this twice:

013_03

Rather than specify how many times we need to iterate down the tree, we want the script to detect when the last level is reached. This will allow the script to adapt seamlessly to trees of any size, though we may want to consider implementing a maximum loop count in case the data contains circular references (generally, this should not happen, but I have seen it).

The script requires a few input parameters before it can run:

t_ParentChildTable – The name of the table containing the parent-child relationships
v_ParentField – The name of the field containing the parent value
v_ChildField – The name of the field containing the child value
v_ExplodeOut – The name of the new table that will contain the exploded dataset.
v_QtyPerField – The name of a field that describes how many children belong in the parent. You would most commonly find this in bills of material. This parameter should be optional because not all trees have a quantity-per relationship.

We start by first identifying all of the fields in the table that are not the parent, child, or quantity-per field.  We compile a list of these fields so we can extract them with our exploded dataset.

OPEN %t_ParentChildTable%
SORT ON %v_ParentField% TO Explode_T01
SAVE LAYOUT TABLE TO Explode_T02
OPEN Explode_T02
LOCATE IF field_name = v_ParentField
v_ExplodeMax1 = field_length
LOCATE IF field_name = v_ChildField
v_ExplodeMax2 = field_length
v_ExplodeKeyLength = MAXIMUM(v_ExplodeMax1 v_ExplodeMax2)
EXTRACT FIELDS ALL IF NOT MATCH(field_name v_ParentField v_ChildField v_QtyPerField) TO Explode_T03 OPEN
v_OtherFields = BLANKS(16000)
GROUP
  v_OtherFields = ALLTRIM(v_OtherFields) + BLANKS(1) + ALLTRIM(field_name)
END
v_OtherFields = ALLTRIM(v_OtherFields)

This logic works by saving the list of fields in the table and then iterating through the fields in the GROUP command which are not the parent, child, or quantity-per fields. The result is a variable that contains the remaining field names, delimited by spaces. So, if our table looks like this:

013_04

…then v_OtherFields will be equal to ‘Attribute1 Attribute2’.

One other function of the above code is to provide a facility to normalize the parent and child fields. For the purposes of the explosion, the parent and child fields are interchangeable. This means they need to be the same length in the table layout. Lines 5 through 9 identify the length of the longer of the two fields so we can normalize them during the explosion.

As a side note, the SORT command on line 2 is there so we do not have to continually SECSORT in an upcoming join. For large tree datasets, this can significantly improve the script’s execution time.

Now that we have our list of all of the other fields, we can perform the first level explosion. This level is the easiest because we are essentially creating a copy of the existing table:

COMMENT Extract the first explosion level and create the collection table
v_ExplosionLevel = 1
v_ExplodeExtractFields = "SUBSTRING(%v_ParentField% 1 %v_ExplodeKeyLength%) AS 'TopLevel'"
v_ExplodeExtractFields = v_ExplodeExtractFields + " SUBSTRING(%v_ParentField% 1 %v_ExplodeKeyLength%) AS 'Parent'"
v_ExplodeExtractFields = v_ExplodeExtractFields + " SUBSTRING(%v_ChildField% 1 %v_ExplodeKeyLength%) AS 'Child'"
v_ExplodeExtractFields = v_ExplodeExtractFields + " SUBSTRING(ALLTRIM(%v_ParentField%)+'\'+ALLTRIM(%v_ChildField%) 1 1000) AS 'ExplosionPath'"
v_ExplodeExtractFields = v_ExplodeExtractFields + " v_ExplosionLevel AS 'ExplosionLevel'"
IF NOT ISBLANK(v_QtyPerField) v_ExplodeExtractFields = v_ExplodeExtractFields + " %v_QtyPerField%+0 AS '%v_QtyPerField%'"
v_ExplodeExtractFields = v_ExplodeExtractFields + " %v_OtherFields%"

OPEN %t_ParentChildTable%
EXTRACT FIELDS %v_ExplodeExtractFields% TO Explode_T04 OPEN
DELETE v_ExplodeExtractFields OK

EXTRACT FIELDS ALL TO Explode_T20

The first part of this code builds the field listing for the EXTRACT command. It adds normalization routines for the parent and child fields, creates an expression to build each level’s path, optionally adds the quantity-per field, and then tacks on the rest of the fields from the original table to the end. Once we have our EXTRACT fields ready we perform an initial extraction to our working table, Explode_T04. We then finish by performing one final extraction to our aggregation table, Explode_T20.

Now that we have extracted the first level to both a working table and an aggregation table, we can call the loop that will explode the rest of the tree structure. I end up calling 1 of 2 scripts, depending on whether or not a quantity-per field was provided. I found that using two separate scripts rather than a single, combined script kept the logic cleaner, even though it duplicates the code.

COMMENT Iterate through each level until we run out of levels or reach level 20
DO Explode2_1 WHILE WRITE1 > 0 AND v_ExplosionLevel  0 AND v_ExplosionLevel < 20 AND ISBLANK(v_QtyPerField)

I will only review the logic of the script that runs when a quantity-per field is provided. The other script’s logic is identical with the exception that the quantity-per calculations are not performed.

The looping script starts by immediately increasing the explosion level counter. This variable keeps track of how many levels we have iterated through the tree data. In addition to providing informational value to the user, it is also used to exit the script if more than 20 levels is exceeded. We increment immediately because we have already extracted level 1; if we did not immediately increment we would have 2 levels being merged into level 1.

Once we increment the counter, a many-to-many join is run between the children of the current level and the parent of the next level. The results of this join represent all of the children of the current level who themselves have children. Notice in the join how the child of the parent table becomes the new parent, and the child of the child table becomes the new child. It is this shifting of key fields that allows the explosion to operate correctly.

COMMENT Increment the current explosion level
v_ExplosionLevel = v_ExplosionLevel + 1

COMMENT Get the next set of parent/child relationships
OPEN ExplodeSubset_T05
OPEN ExplodeSubset_T02 SECONDARY
JOIN MANY PKEY Child FIELDS TopLevel Child AS 'Parent' ExplosionPath SKEY %v_ParentField% WITH %v_ChildField% AS 'Child' %v_QtyPerField% %v_OtherFields% PRESORT TO ExplodeSubset_T06 OPEN
CLOSE SECONDARY

Once the new level is identified, we calculate the new quantity per. The script is designed to calculate extended quantities based on the values of all of the prior levels. So, if Level 1 uses 2 units, and level 2 uses 5 units, the extended value of level 3 is 10 units. After the new quantity-per value is calculated, the data is extracted back to table T05 for the next explosion and then extracted a second time to our aggregation table.

COMMENT Calculate the actual quantity per based on the values calculated through the entire explosion
DEFINE FIELD NewQtyPer COMPUTED SourceQtyPer * %v_QtyPerField%

COMMENT Extract the new set of relationships and append them to the collection table
EXTRACT FIELDS TO ExplodeSubset_T05 OPEN
SUBSTRING(TopLevel+BLANKS(v_ExplodeKeyLength) 1 %v_ExplodeKeyLength%) AS 'TopLevel'
SUBSTRING(Parent+BLANKS(v_ExplodeKeyLength) 1 %v_ExplodeKeyLength%) AS 'Parent'
SUBSTRING(Child+BLANKS(v_ExplodeKeyLength) 1 %v_ExplodeKeyLength%) AS 'Child'
SUBSTRING(ALLTRIM(ExplosionPath)+'\'+ALLTRIM(Child) 1 1000) AS 'ExplosionPath'
v_ExplosionLevel AS 'ExplosionLevel'
%v_QtyPerField%+0 AS '%v_QtyPerField%'
%v_OtherFields%

EXTRACT FIELDS ALL TO ExplodeSubset_T20 APPEND

This looping script continues to run until the join produces no records or the loop counter exceeds 20. Once the script stops looping, our data is fully exploded. The only other transformation I like to perform is to sort the records on the explosion path so the table is easily readable by a human user:

COMMENT Sort the final table by explosion path
OPEN ExplodeSubset_T20
SORT ON ExplosionPath TO "%v_ExplodeSubsetOut%" OPEN

And that is how to perform a basic explosion of tree data in ACL. You can use these scripts stand-alone or as a starting point for new scripts that perform more specialized operations. For example, if you have tree data comprising millions of records, it may be infeasible (i.e. time consuming) to explode all of the data. These scripts can be adjusted to incorporate a sub-population of parent values from which the explosion is performed. This can save a lot of time and allow you to focus your efforts on the data that is really important. I leave the actual coding of this change to you; but if you get stuck feel free to reach out.

Advertisements

7 thoughts on “Danger! Unexploded Data

  1. Another great post, Thomas. Something that might cause a problem in your scripts….

    SUBSTR(string, start, length)

    SUBSTR will not pad with spaces on the right if the length passed to the function is greater than the original string length.

    The documentation for the SUBSTR command is incorrect, even though I reported it to ACL back in April 2015 and they said they would correct it.

    So I created a new enhancement request to fix the SUBSTR documentation:
    https://community.acl.com/s/idea/087C0000000Q84hIAC

    Like

    1. Hi Chris,

      Thanks for your comment! Now that you mention it I do remember the issue with SUBSTRING you mentioned. But something was not adding up, because my scripts have worked multiple times in my production environment. So I did some research, and this is what I found:

      When using SUBSTRING to assign to a variable, you are correct: spaces are not padded on the right. If we assign v_Len a value of SUBSTRING(“A” 1 10), the length of v_Len will be 1, not 10.

      However, when SUBSTRING is used in DEFINE FIELD or EXTRACT-like commands, the spaces do get padded. For example, if we have a field called ‘TestField’ that has a length of 1, both of these commands result in a field with a length of 10:

      DEFINE FIELD Test2 COMPUTED SUBSTRING(TestField 1 10)

      EXTRACT FIELDS SUBSTRING(TestField 1 10) AS ‘Test2’ TO NewTable

      So, it appears that SUBSTRING is being a bit inconsistent. But at the end of the day, the scripts as I have posted them will work because they are not using SUBSTRING in the context of assigning a value to a variable.

      Like

  2. Hi
    I finally made it to your blog and was impressed by the first post – can’t wait to read more. I did something similar in SAP where I wanted to expand the hierarchy for Cost centers. However, I used a series of RELATE commands until the 11 layers were all represented in a string showing parent child grandchild ….. etc.

    Like

    1. Thanks for taking the time to read a bit from my humble corner of the blogosphere!

      Using a relation never really entered my mind as being usable for this purpose, so thanks for sharing. It would likely be faster because the index only has to get created once and no temporary tables would be required. If memory serves, only 18 tables can be related in a single chain. Is this an inherent limitation of using a relation for this task or would you know of a way to circumvent it?

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s