The Unix legend that owes us nothing continues to fix the underlying AWK code
A Princeton professor, finding some time for himself in a summer academic lull, sent an email to an old friend a couple of months ago. Brian Kernighan said hello, asked how their US visit was going, and sent hundreds of lines of code that could add Unicode support to AWK, a text analysis tool he helped build for Unix at Bell Labs in 1977.
“I’ve tested this quite a bit, but more testing is clearly needed,”Kernighan wrote in an email posted as a kind of pseudo-commit to the onetrueawk repository by longtime maintainer Arnold Robbins. “Once I figure out how… I’ll try to submit a pull request. I wish I could understand git better, but despite your help, I still don’t have the right understanding, so it might take a while.”
Kernighan is the “K”in AWK, a special-purpose extraction and control language that was key to Unix pipeline functions and interoperability between systems. A running awk
feature (AWK is a language, awk
a command to call it) is critical to both the UNIX standard specification and the IEEE POSIX certification in terms of interoperability. There are countless variations of awk
, but “One True AWK”, sometimes known as nawk
, is a version based on Kernighan’s 1985 book The AWK Programming Language and its subsequent input.
Kernighan is also the “K”in “K&R C,”the seminal 1978 book The C Programming Language he co-wrote with Dennis Ritchie, which remains with programmers, mentally and in battered paper form. Xi’s roots go much deeper. Kernighan trained C at Bell Labs and convinced its creator Dennis Ritchie to collaborate on a book to spread knowledge. This book spawned “the only true brace style”, the endless debate that goes along with it, and the structure that underpins every modern programming language.
Kernighan also named Unix and was the first to demonstrate the “Hello, world”code example. He spoke with Richard Jensen of Ars Technica about 50 years of Unix history.
The onetrueawk repository, where Kernighan appeared in late May, is a relatively quiet place with 21 contributors, 46 GitHub users watching, and commits appearing every few months. As noted by The Register, Kernighan’s Unicode fix became known mainly because it was mentioned in an interview with the professor on Computerphile’s YouTube channel.
“It’s always been embarrassing that AWK only works with ASCII, or maybe 8-bit input, but doesn’t really handle Unicode at all,”Kernighan told interviewer Professor Balesford. “A few months ago I spent some time working (laughs) with an incredibly old program. I have it at the moment where it actually handles UTF-8 input and output, so you can have regular expressions that, you know, pick up Japanese characters and stuff.”
Kernighan, now 80, casually mentions in an interview that he also fixed something “quick and dirty”to allow AWK to process CSV files.
Leave a Reply