À Does Not Equal À, A Unicode Story

I apologize for the delay to my monthly postings. January was a busy month and I needed a break to focus on work and home.

I had hoped my first post regarding Unicode would also be my last. Unicode and its many flavors tends to be a distracting annoyance while developing analytics. Our applications usually handle translating the various encodings in the background while we focus on getting our logic correct. Unfortunately, sometimes issues bubble up to the surface and we have to deal with the annoying details directly.

I recently ran across an example that started with regular expressions executed on non-Latin alphabets. I was using an expression something like this:

v_Text = REGEXREPLACE(FieldName '\W' BLANKS(1))

It is a pretty simple and standard regular expression to remove non-alphanumeric characters from a text string. I was using it to clear away punctuation during a process to normalize names from data sources I obtained from two disparate systems, with the intention of later joining those data sources to identify matching records.

I discovered that this was not working correctly when our monthly reports identified exceptions which were not actually exceptions. The common factor among all of the false positives was that all of the names were in non-Latin characters, in my case simplified Chinese.

For those of you not familiar with regular expressions, the ‘\w’ character class matches alphanumeric characters, while ‘\W’ matches non-alphanumeric characters. However, ACL’s implementation of regular expressions does not extend the \w character class to Unicode letters outside of the Latin alphabet. This means letters in alphabets such as simplified Chinese, Thai, Hindi, etc will all match the \W character class, and in my function were being replaced with blanks. As a result, completely different Chinese names were being identified as matches because they all resolved to blank strings.

In ACL’s defense, they state in their documentation that Unicode is not fully supported in the regex functions. I happened to stumble across a use-case where it was not supported and my analytic development tests did not include non-Latin samples. So the root cause of my issue was ignoring ACL’s documentation and not properly testing my code before release to production.

But for now, I had a problem to fix. After some digging I discovered a character class that does cover Unicode letters, and it looks very similar to the class we all know and love. Rather than using \w to identify letters, we can use the POSIX-like class [:w:]. So identifying a word characters goes from:

REGEXFIND(Text '\w')

to

REGEXFIND(Text '[:w:]')

And identifying characters that are not word characters goes from:

REGEXFIND(Text '\W')

to

REGEXFIND(Text '[^[:w:]]')

There are other POSIX-like classes that can be used in our regular expressions, which you can find here. I have not explored any besides [:w:] and cannot comment on their effectiveness.

Changing my regular expression to the POSIX-like character class allowed my script to correctly identify and clean non-word characters and eliminated the false positive issue. It also caused me to start thinking a bit more about improving our normalization techniques and led me to discover something interesting about Unicode: When normalizing and matching text, À may not always match À. In fact, any accented or marked Latin letter can be represented in one of 2 ways:

  • The letter can be hard-coded as a single character.
  • The letter can be composed of a base character with one or more diacritic marks, with each mark being its own character.

Attempts to match a hard-coded letter to a letter combined with diacritic marks will fail. Try this in your ACL command line if you have the Unicode version of ACL:

CALCULATE CHR(241)=HTOU("006E0303")

To further add complexity to the issue, it would be grand if we could match A to À, because the world is full of poor spellers, lazy spellers, or people like me who are 2% fluent in a foreign language and have no clue when to use accents. In an ideal world, the letters in the string produced by the following command (“ñnñ”) would all normalize to ‘n’:

CALCULATE HTOU("006E0303")+ALLTRIM(REGEXREPLACE(HTOU("006E0303") '[^[:w:]]' ' '))+CHR(241)

One interesting side-effect of using the POSIX-like class to match word characters is that diacritic marks end up getting stripped away. You can see this for yourself in ACL with this command:

CALCULATE REGEXREPLACE(HTOU("006E0303") '[^[:w:]]' ' ')

The hex code 006E0303 is the character ‘ñ’, but as you can see in ACL the mark is removed because it does not have the ‘Letter’ property in Unicode. This means our transition to [:w:] has already solved half of the problem of normalizing our letters. This leaves us with the hard-coded accented letters, which are finite in number and can be normalized like so:

v_NormalizeLettersLen = LEN(%v_NormalizeLettersField%)
GROUP
  v_NormalizeLettersText = UPPER(%v_NormalizeLettersField%)
  v_NormalizeLettersText = REGEXREPLACE(v_NormalizeLettersText '[^[:w:]\s]' '') + BLANKS(v_NormalizeLettersLen)
  v_NormalizeLettersText = REGEXREPLACE(v_NormalizeLettersText '[ÀÁÂÃÄÅĀĂĄǍǞǠǺȀȂȦȺ]' 'A')
  v_NormalizeLettersText = REGEXREPLACE(v_NormalizeLettersText '[ƁɃḂ]' 'B')
  v_NormalizeLettersText = REGEXREPLACE(v_NormalizeLettersText '[ÇĆĈĊČȻ]' 'C')
  v_NormalizeLettersText = REGEXREPLACE(v_NormalizeLettersText '[ĎĐḊ]' 'D')
  v_NormalizeLettersText = REGEXREPLACE(v_NormalizeLettersText '[ÈÉÊËĒĔĖĘĚȄȆȨɆ]' 'E')
  v_NormalizeLettersText = REGEXREPLACE(v_NormalizeLettersText '[Ḟ]' 'F')
  v_NormalizeLettersText = REGEXREPLACE(v_NormalizeLettersText '[ĜĞĠĢǤǦǴ]' 'G')
  v_NormalizeLettersText = REGEXREPLACE(v_NormalizeLettersText '[ĤĦȞ]' 'H')
  v_NormalizeLettersText = REGEXREPLACE(v_NormalizeLettersText '[ÌÍÎÏĨĪĬĮİǏȈȊ]' 'I')
  v_NormalizeLettersText = REGEXREPLACE(v_NormalizeLettersText '[Ĵ]' 'J')
  v_NormalizeLettersText = REGEXREPLACE(v_NormalizeLettersText '[ĶǨ]' 'K')
  v_NormalizeLettersText = REGEXREPLACE(v_NormalizeLettersText '[ĹĻĽĿŁ]' 'L')
  v_NormalizeLettersText = REGEXREPLACE(v_NormalizeLettersText '[Ṁ]' 'M')
  v_NormalizeLettersText = REGEXREPLACE(v_NormalizeLettersText '[ÑŃŅŇǸ]' 'N')
  v_NormalizeLettersText = REGEXREPLACE(v_NormalizeLettersText '[ÒÓÔÕÖØŌŎŐǑǪǬǾȌȎȪȬȮȰ]' 'O')
  v_NormalizeLettersText = REGEXREPLACE(v_NormalizeLettersText '[Ṗ]' 'P')
  v_NormalizeLettersText = REGEXREPLACE(v_NormalizeLettersText '[ŔŖŘȐȒɌ]' 'R')
  v_NormalizeLettersText = REGEXREPLACE(v_NormalizeLettersText '[ŚŜŞŠȘṠ]' 'S')
  v_NormalizeLettersText = REGEXREPLACE(v_NormalizeLettersText '[ŢŤŦȚȾṪ]' 'T')
  v_NormalizeLettersText = REGEXREPLACE(v_NormalizeLettersText '[ÙÚÛÜŨŪŬŮŰŲǓǕǗǙǛȔȖɄ]' 'U')
  v_NormalizeLettersText = REGEXREPLACE(v_NormalizeLettersText '[ŴẀẂẄ]' 'W')
  v_NormalizeLettersText = REGEXREPLACE(v_NormalizeLettersText '[ÝŶŸȲɎỲ]' 'Y')
  v_NormalizeLettersText = REGEXREPLACE(v_NormalizeLettersText '[ŹŻŽȤ]' 'Z')
  EXTRACT FIELDS ALL v_NormalizeLettersText AS '%v_NormalizeLettersNewField%' TO "%v_NormalizeLettersOut%" OPEN
END

These changes to our normalization process help to ensure we obtain the most accurate matches with the information we have today. But as anyone who has tried to normalize knows, normalization is a journey of continuous improvement rather than a reachable end goal. Unicode is still mostly a black box to me and I do not know if these changes will cause issues in other areas or if further improvements are possible. For now, our state today seems to be an improvement over our state yesterday. In a process that relies heavily on iterative improvement, that is the best anyone can ask for.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s