Letters to Mom: Regular Expressions

OK, we're going to take a trip to Geek World today and talk about Regular Expressions.

One of the things that computers are really good at is finding things right? You type in what you want to find and the computer goes off and finds "matches". The trick, of course, is being able to clearly tell the computer what you want to find. Most people have experience searching for things on the web where you simply throw as many terms at the search engine as you can to try and narrow down the results list.

Let's say you are searching for a dentist in your area who does laser teeth whitening or something like that. You could search for "Dentist" but the list of results would be rather large and mostly irrelevant to your area. You could refine the search by typing "Dentist laser" and that might narrow it down some. You could type "Dentist Laser Uxbridge teeth" and that would be very precise but that might be overly restrictive because you might not get Dentists in the next town over. So clearly, you have to find just the right combination of specificity and generality to get the results you want.

But I'm not going to talk about web search engines. I really want to talk about searching for text in a file - say looking for the word "shrimp" in a Word document. Obviously, that's pretty easy to do. You click Edit, Find..., type in "shrimp", hit OK, and Word shows you all the occurrences of shrimp in the document. OK, but what if you wanted to find both shrimp and shrimps? Or what if you only wanted to find shrimp if it was the first word in a sentence? What if you wanted to find the word shrimp only if it was the first word in a sentence or if it was followed by the word Gumbo and make sure the 's' in shrimp was capitalized? That's probably easy enough to do by hand but what if you had a file with 500,000 lines of purchase descriptions from the Bubba Gump Shrimp Factory that you had to process like that?

You need a way to precisely define a pattern that will match this relatively obscure combination of letters and possibly perform an operation on some of those characters automatically. You need a Regular Expression.

A Regular Expression - or RegEx - is a pattern matching language that is used extensively in Geekdom. It's an truly powerful and complex system that lets you perform amazing feats of text manipulation. Here is a simple Regular Expression that matches an email address. A web developer might use this to verify that someone registering is entering a proper email address:

^\w+@[a-zA-Z_]+?\.[a-zA-Z]{2,3}$

Yeah, so that's a bunch of gobblty-gook right? Yup, with great power comes great...difficulty. It's very difficult to express a text pattern that is precise where you need it to be but flexible enough to include variations that you want to be flexible about. An email address is something@something.something. You know the @ sign is in there but the other stuff is pretty much whatever you want right? Well, no. Email addresses can't have spaces in them so you have to throw those out. You can use dot and underscores but not dashes. There are a few other rules about a proper email address as well. So, specifying exactly what you want to match can get tricky. It's especially tricky because you have to use characters - the very things you are trying to find - to specify the pattern you are trying to specify. The result can be extremely cryptic as you can see above - and that is a simple one!

There is a funny saying that everyone learns when first starting with Regular Expressions. It goes something like this:

"So you have a problem you want to solve with a Regular Expression. Now you have two problems."

Indeed, you could easily spend more time writing and testing a regular expression to make a change than you would have if you had just paged through the entire document and made the change by hand. I know. I've done it.

As I kind of hinted, RegEx's are used to find and optionally replace so here's a simple example. Different text editors allow you to specify the search pattern and the replace pattern in different ways but for now we'll just show things like this:

Search: cat
Replace: dog

The characters 'cat' are a RegEx pattern. Same with dog. They are "literals" and do exactly what you would expect - they match themselves.

If I wanted to find all the occurrences of either 'cat' or 'Cat' and replace it with 'dog', I could do something like this:

Search: [c|C]at
Replace: dog

The '[c|C]' pattern says lowercase 'c' *or* (the | symbol) uppercase 'C', followed by a and t.

So, there are special symbols used in the expressions to add flexibility. Of course, this use of special symbols presents a problem. The vertical bar symbol - called a pipe symbol - is a special symbol that means "or" - this *or* that. But what if you really wanted to find a pipe symbol in your text. You need a way to tell the pattern to not treat it as meaning "or" but to actually match itself. To do that, you "escape" the symbol by putting a '\' in front of it like this \|. So if we wanted to find something like 'car|truck' and replace it with 'car and truck', we would say:

Search: car\|truck
Replace: car and truck

Yeah, I hear you. What if you want to find the '\' symbol. Well you escape that with a '\' so it looks like this '\\'.

Now back to our example of replacing cat or Cat with dog. We wrote the expression to do that but it might have a problem. What if the word 'catalog' was in our document. The RegEx would convert that to dogalog which is probably not something we want. It's a perfect example of how RegEx's can do things you never intended if you aren't properly precise. So actually, what we wanted to do instead of replacing 'cat' with 'dog' is replace ' cat ' when it is surrounded by spaces to ' dog ' surrounded by spaces. That would limit it to the *word* cat and we wouldn't accidentally match other words that happen to have c-a-t in them.

Except, what if cat is at the end of a sentence and is followed by a period not a space? We want to include that right? So we really want to find a space, c-a-t, and either a space or a period. Or a comma. Or a semi-colon. Maybe it's c-a-t followed by anything *but* another letter. That's an example of how a RegEx might *not* do things you intended because your pattern isn't properly lenient.

Developing a pattern that does match what you do want and doesn't match what you don't is a real challenge. Whole books have been written on developing Regular Expressions and the web is full of tutorial sites for it if you are interested. What prompted this post was a problem I had the other day. I had a file that consisted of 5000 lines of information that I needed to import into a database. It was arranged something like this

[barcode],[Date] [Author] [Notes]

[barcode],[Date] [Author] [Notes]

or

ISG000002421, 10-23-2009 10:23:34 Jsmith This is the description, of product A...

ISG000002456, 10-24-2009 9:10:04 JDoe More notes that contain text...
and so on...

Pretty straightforward but there were some problems. First, I needed to separate the barcode section from the rest of the text with a pipe '|' instead of a comma. Easy enough it would seem just find all the commas and replace them with a |. Not so fast though bucko. There might be commas in the notes section that I don't want to replace so I had to find only the commas that came after the barcode value. Hmm, there might be barcodes mentioned in the notes section though so I really want to find only the commas that come after the barcodes that start at the beginning of the line. Here's the pattern that matches that:

Search: ^ISG\d+,

The caret - ^ - matches the beginning of the line, the ISG matches itself, the \d matches a digit (zero through 9) and the + means one or more of the preceding items (the digit), and then the comma and a space that you can't see. OK, so that's what I want to find but what do I want to replace it with? I can't just replace it with a | because I'd replacing the whole pattern - the barcode and the comma with just the pipe character. I'd lose the barcode. Since I'm matching different barcode patters - the \d+ that matches sequences of digits - I don't really know what to replace it with. Fortunately, RegEx's will remember the patterns that they match and let you put those matches back during the replace. Here's how:

Search: ^(ISG\d+),
Replace: ^\1\|

I've added parenthesis around the barcode pattern. This says "remember this pattern, I'm going to use it later". Then I use \1 in the replacement pattern which corresponds to whatever text was matched in the parenthesis in the search, followed by the pipe character (which I have to escape because it normally means "or" in a regular expression and I want it to actually mean | here).

So that should fix that but there were a few other problems. The notes section sometimes had quotes around them and I didn't want that. I didn't want to remove any quotes that were inside the notes, just the ones that might be around the whole notes section. That was pretty easy too:

Search: ^(ISG\d+\| )"(.+)"$
Replace: \1\2

This says "find everything up to the first quote character and remember it in pattern 1. Then match the everything up to the quote before the end of the line and remember it in pattern 2. Replace all that with what is in pattern 1 and pattern 2. Since the quotes were not in the patterns that were remembered, they get dropped.

The last problem was tougher. It turned out that some of the notes actually contained multiple entries. So the notes section might look like this for some lines:

ISGxxx | 10-24-2009 10:2:23 This is note 1 11-2-2009 9:3:13 This is note 2...

I really needed to separate those multiple notes into separate lines for the same barcode. I won't go into the pattern that I built that let me do that. Suffice it to say that it was rather complex. It probably took me about 30 minutes to figure out the patterns that I needed but it probably saved me about 3 hours of tedious, error prone, hand-editing of that 5000 line file. Fun stuff.

1 comment:

Catholic Mom said...: Okay. I'm not sure what twisted thing in my mind prompted me to continue reading this to the end of the post. I'm thoroughly impressed that you can figure that gobbly-gook out, and completely sure there is a reason I do not do anything remotely like that for a living! So glad you solved your puzzle!; 3:15 PM

Letters to Mom

Saturday, March 06, 2010

Regular Expressions

1 comment:

My Flickr Photos

About Me

My Facebook Page