Pages

Friday, 24 May 2019

apl-reggie: Regular Expressions made easier in APL

What's APL?

Like GNU, APL is a recursive acronym; it's A Programming Language.

I first met APL at the IBM education centre in Sudbury Towers in London. I was a student reading Maths at Cambridge University, and IBM asked me to do a summer research project into a new technology called Computer Assisted Instruction. (I wonder what happened to that crazy idea?)

APL was one of the first languages to offer a REPL (Read Evaluate Print Loop), so it looked like good technology for exploratory programming.

APL was created by a mathematician. Its notation and syntax rationalise mathematical notation, and it was designed to describe array (tensor) operations naturally and consistently.

For a while in the '70s and '80s APL ruled the corporate IT world. These days it's used to solve problems that involve complex calculations on large arrays.
It's not yet used as widely as it should be by AI researchers or Data Scientists, but I think it will be, for reasons that deserve a separate blog post.

I use it a lot, as I have done throughout most of my professional life. These days APL is well integrated with current technology. There's a bi-directional APL to Python bridge and APL programs sit naturally in version control systems like GitHub.

The leading commercial implementation is Dyalog APL, and there's an official port that runs on the Raspberry Pi. It's free for non-commercial use. Dyalog APL's IDE is called RIDE; it runs in a browser and you can use it to connect to a local or remote APL session.

One feature of Dyalog APL is support for PERL-style regexes (regular expressions).

Regular expressions are useful but hard to read. A while ago I blogged about reggie-dsl, a Python library that allows you to write readable regular expressions. I mentioned that Morten Kromberg and I were experimenting with an APL version of reggie. apl-reggie is now ready to share.

APL already has great tools for manipulation of character data. Many text processing tasks can be solved simply and concisely using APL's array-processing primitives.

As a simple example, imagine that you want to sanitize some text in the way that 18th Century authors did, by replacing the vowels in rude words by asterisks.

I'll save your blushes by using 'bigger' as the word to be sanitized.

In APL you can find which characters are vowels by using ∊, the membership function.
    
     'bigger' ∊ 'aeiou'
0 1 0 0 1 0

The boolean result has a 1 corresponding to each vowel in the character vector, and a zero for each non-vowel.

You can compose the membership function with its right argument to produce a new vowel-detecting function:

     vi ← ∊∘'aeiou' ⍝ short for 'vowels in'
     vi 'bigger'
0 1 0 0 1 0


You can combine that with @ (the 'at' operator) to replace vowels with asterisks:

  
('*'@(∊∘'aeiou')) 'bigger'
b*gg*r

If you want to do more complex pattern matching, regular expressions are a good solution.

Here's APL-reggie code to recognise and analyse telephone numbers in
North American format:

d3←3 of digit
d4←4 of digit
local←osp('exchange'defined d3)dash('number'defined d4)
area←optional osp('area'defined lp d3 rp)
international←'i'defined optional escape'+1'
number←international area local


You can use it like this:
    
'+1 (123) 345-2192' match number

and here is the result:
i +1
area (123)
exchange 345
number 2192

The original idea for reggie (and apl-reggie) came from a real application that processed CDRs (call detail records).

CDRs are records created by Telcos; they describe phone calls and other billable services. There are standards for CDRs. The example given below is a slightly simplified version of the real format.

N,+448000077938,+441603761827,09/08/2015,07:00:12,2

That's a record of a normal (N-type) call from +448000077938 to
+441603761827, made on the 9th of August 2105. It was made just after 7 AM, and it lasted for just 2 seconds.

Here's the declarative code that defines the format of a cdr

r←cdr
call_type←'call_type'defined one_of'N','V','D'
number←plus,12 15 of digit
dd←2 of digit
year←4 of digit
date←'date'defined dd,slash,dd,slash,year
time←'time'defined dd,colon,dd,colon,dd
duration←'duration'defined digits
cc←'class'defined 0 50 of capital
r←csv call_type('caller'defined number)('called'defined optional number)date time duration cc
 

 and here's the result when you run it on that record:

    'N,+448000077938,+441603761827,09/08/2015,07:00:12,2,' match cd
call_type N
caller +448000077938
called +441603761827
date 09/08/2015
time 07:00:12
duration 2
class 



apl-reggie is now a public repository on GitHub.

Feel free to ask questions in the comments below. You may also get some help via the Dyalog support forums, although they didn't write the software and it's not officially supported.

If you want to experiment with this unique language you can do so at  tryapl.org.



No comments:

Post a comment