Saturday, May 31, 2008

Regular Expressions

In a recent job interview (ahem), I was asked what I knew about regular expressions. I had to say, honestly, that I had probably heard the term before but that I didn't know what they were. So post-interview, I did some investigations in my Unix books and on the web to discover what was meant by the term. This turned out to be more difficult than I thought, because what I was reading was not making much sense to me, and the grep function in the Unix shell seemed more mysterious than ever. Finally, I came across a 10+-year-old edition of Mastering Regular Expressions in netLibrary, and several things clicked.

Having been a professional librarian for nearly two years, a graduate student for the previous two, and a paraprofessional who did reference for four years, I've been asked to do computer-based searching on a daily basis for over six years. Fortunately, thanks to algorithmic search engines (like Google), this task has become very simple. But I'm spoiled. I came to libraries after the miserable days of search engines that only matched text strings. In those medieval days, there was no forgiveness for spelling errors or misplaced spaces, no helpful "Did you mean . . . ?" features or "related searches" that got thrown up for your convenience if you typed "freinds" instead of "friends." When a library patron comes in and asks for a book with "snow flower" in the title, a useful shortcut (given the fact that our library catalog is still quite unforgiving) is to search those words in Amazon.com or Google, which nearly always works.

The way "traditional" computer-based library catalogs work is with "wildcard" characters. This way, if you're unsure of the spelling of "friends" or "weird," you can substitute a nonalphabetic character in place of one or more letters in the text string. Hence, "friends" can be rendered "fr??nds" or "fr*nds," and the computer will find:
  • for "fr??nds," all seven-character text strings that begin with "fr" and end with "nds"
  • for "fr*nds," all text strings (of any length) that begin with "fr" and end with "nds"
In this case, you use an "expression" of alpha and non-alpha characters to search for actual text strings inside a group of files.

In regular expressions proper, this concept is taken to extremes, and you are required to know many different symbols for extremely precise searches. I am just learning these, so I don't yet have them down, but I know that you place your expression between forward slashes like this:

/expression/

What goes between the slashes would be a group of symbols like ^/n\$ that would let you search something as specific as "all of the records that start with a number and have 'n' as the third character and end with a g." This is a very powerful way to do targeted searching for, say, thousands of lines of open source library information system computer code, or multiple databases of patron and MARC records. For instance. :-)

Of course for all of our other searching needs, we'll stick to Google!

No comments: