When Unicode=Unicode Returns False

Every now and then I stumble across something I did not know that I did not know while working on an analytic project.  Because I come from a Finance/Accounting background rather than an IT background, this most often happens because of a technical aspect of the project.  Character encoding is one such example I stumbled across during one of my first projects as a professional data analyst. I was designing a database that would allow us to move our Continuous Controls Monitoring (CCM) results from Excel reports to a centralized storage location. I decided to piggy-back on the PostgreSQL installation from AuditExchange to create a database specifically for CCM results. To get the data into the database, I decided on a process that would export ACL tables as delimited text files and then import those text files to the database. All went well until I tried to import a text file into Postgres and I received an error message about an invalid byte sequence for encoding “UTF8”. My Unicode text file from the Unicode version of ACL would not import into a Unicode PostgreSQL database.

What?

Long story short, I had inadvertently discovered that ‘Unicode’ is a broad term that applies to a number of encodings, including UTF8, UTF16 (UCS2 Little Endian and UCS2 Big Endian are both UTF16 encodings), UTF32, etc. The Unicode edition of ACL uses UCS2 Little Endian encoding while PostgreSQL uses UTF8 encoding. This means any files generated by ACL are not compatible with PostgreSQL or any other program which uses UTF8 encoding.

So, how to get data from ACL into PostgreSQL? The answer is a rather simple PowerShell command that will perform the conversion for you:

PowerShell.exe (Get-Content “path_to_file”) | Set-Content “path_to_file” -Encoding UTF8

Set-Content creates a file with the encoding that is specified by the -Encoding parameter. This command needs input, which is provided by the Get-Content command that occurs before the pipe. The pipe tells PowerShell to use the object created by Get-Content in the Set-Content command. The really cool part about this sequence is that it overwrites the UCS2-encoded file with UTF8-encoded content in one single action – no secondary files are created which then have to be cleaned up later.

Because the PowerShell command is a single line, we can easily embed it into ACL’s EXECUTE command and call it from our scripts. If we replace path_to_file with a substituted variable, we can create re-usable code to convert any file:

EXECUTE ‘PowerShell.exe (Get-Content “%v_SaveTo%”) | Set-Content “%v_SaveTo%” -Encoding UTF8’

I only ever use this EXECUTE command in the context of exporting a file from ACL, so I took the concept one step further and created a script that will both export the data and convert it to UTF8. This script uses the modularization techniques discussed in a prior post which allow me to use it in place of a normal EXPORT command:

COMMENT
** This script exports the current open table to a delimited text file and converts it to UTF8 encoding
**
** The v_SaveTo variable is required and tells the script which file to export the table to
** A table must be open for this script to run
**
** The output of this script is a delimited text file encoded in UTF8
END

COMMENT Use a delimiter that will (realistically) never show up in text fields
v_Sep = CHR(31)

DELETE SCRIPT ExportToUtf8 OK
DELETE SCRIPT ExportToUtf82 OK
v_Library = “Path\To\Library” IF FTYPE(“v_Library”) = “U”

COMMENT Make sure the v_SaveTo variable is provided
PAUSE “v_SaveTo was not provided” IF FTYPE(“v_SaveTo”) = “U”
ESCAPE ALL IF FTYPE(“v_SaveTo”) = “U”

COMMENT Export the current table to a delimited file
EXPORT FIELDS ALL UNICODE DELIMITED TO “%v_SaveTo%” KEEPTITLE SEPARATOR “%v_Sep%” QUALIFIER NONE

COMMENT Convert the exported delimited file to a UTF8 encoded text file
EXECUTE ‘PowerShell.exe (Get-Content “%v_SaveTo%”) | Set-Content “%v_SaveTo%” -Encoding UTF8’

DELETE SCRIPT ExportToUtf8 OK
DELETE SCRIPT ExportToUtf82 OK

With this script, transfer of data from ACL to the database became possible. We are now able to store our CCM results for historical reporting and analysis. Best of all, the entire process is completely automated, rock-solid reliable, minimizes the system resource footprint, and it uses tools that are already available on the server or PC.

If you want a quick overview on character encodings and their history, this article is a good starting point.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s