Saturday, March 27, 2010

Back from the Brink

Dontcha just love computers?

It's weird how the simplest of changes can cascade into a rolling thunder of destruction and mayhem. Okay, nothing actually caught fire, but this has been a rough few days for the Hudgins' Mac.

Here's the story. The newest version of the Mac operating system - 10. 6 (named Snow Leopard) - was released last August or September. I didn't upgrade to it because 1) I'm cheap and 2) I didn't see a compelling need to do so. Snow Leopard was slated as a release with a lot of fixes under the hood and without a lot of fancy features. It was however, supposed to speed things up. Any hint of an idea where this is going???

So, a couple of weeks ago, I got the urge to start work on another iPhone application. Actually, it will eventually be an iPad application - for the new tablet computer that Apple is releasing. It's basically a giant iPhone-looking device but it has some new features and I thought it would be fun to write a program for it. Well, I went to download the software development kit for it and found out that I needed to upgrade to Snow Leopard in order to use it. What the heck, it's about time anyway right?

I ordered it from Amazon and it came in a few days. I went through the install process and got it all up and running and was amazed - at how slow everything was. Programs were taking forever to launch and then ran at a snail's pace once they were up. Oh crap! This is not good. I asked Mr. Google what the heck was up with this and was greeted by page after page of people having the same problem - terrible slow downs after upgrading. Hmm. I guess I should have done that search *before* I did the upgrade...

I limped along this way for a few days and ran into the next problem. I went to start Parallels - the program that lets me run Windows programs (Quicken) on the Mac. When it tried to start, it politely told me that the current version wouldn't run on Snow Leopard and I would have to upgrade it (Parallels) to the latest version. Argh! Okay, I bought that online and downloaded it and installed it. That program is a pretty big drain on the Mac anyway and now, with the machine limping along as it was, it was essentially unusable. Something had to be done.

I should point out that I had chosen to do an upgrade of Snow Leopard as opposed to a clean re-install. There are two schools of thought on this. Some people think that you should just wipe your disk clean and start fresh while others maintain that the upgrade process is good enough that you don't need to "clean house" beforehand. Of course, if you do the clean install, you've got to have good backups of all your stuff because a clean install means deleting everything on the disk. I actually have two methods of backing up the computer - I use Time Machine which takes hourly backups of the system and stores them on a different disk and I also have Mozy which backsup over the network to some big disk in the sky somewhere.

Despite all those precautions, there is always a fear that you've missed some critical file in your backup plan or something won't come back when you try to restore. Not to mention that a fresh install also implies that you reinstall all your applications (as opposed to recovering them from the backup) so you need to have all your registration codes for all those programs you bought. That's why I decided to do the upgrade instead of the clean install.

After trying every manner of voodoo to get my system even workable, I decided that I needed to wipe it out and start clean. I went through all my files and made sure I had backups of everything as best I could and then fired up the Disk Utility to wipe things out. I swear I stared at the "Do It!" button for about 5 minutes before casting my bits into oblivion.

Once the installation was complete, I cautiously fired up the browser to see if things were back to normal or hopefully, even faster. When programs launch on the Mac, they "bounce" in the Dock at the bottom of the screen as they load. Before the reinstall, my web browser was taking about 25 bounces before a window would appear and even then, it wasn't fully loaded. After the re-install, it launches in about 2 bounces. Oh yeah!

I've spent the better part of today getting things re-installed and working again. I've re-downloaded all kinds of programs, pulled activation keys out of old emails, and basically tweaked things back to the way they were before the big reset. The whole license key thing is fraught with opportunities for disaster. I thought I had one today when I tried to reinstall Parallels. When I bought the upgrade earlier, I just installed it over top of the old version and it went fine - besides from being slow.

When I reinstalled it today, it asked for my Activation key which I got when I bought the upgrade, but then it asked for the Activation key for the original version that I upgraded over. Oh cool! I think I bought that about two years ago. I have no idea what that code is! I looked through old emails, looked at the manual, found the original install disk - no Activation key. I called support and asked them if they had any record of my Activation key. Nope, they don't have it and can't generate one for me. Great! The guy on the phone did say that the key should be on the original CD. I double checked the CD again, this time looking at the back of the CD jacket, and viola! There was the key. I entered that in and was able to complete the installation. Wow - that was close!

There are probably other problems still lurking - like programs that I rarely use that I didn't re-install and will go to use one day, or files that I haven't restored from backup. We'll see.

It has been a long and arduous process but I'm so glad to have my speedy computer back, that I think it was worth it. I think...

Saturday, March 06, 2010

Regular Expressions

OK, we're going to take a trip to Geek World today and talk about Regular Expressions.

One of the things that computers are really good at is finding things right? You type in what you want to find and the computer goes off and finds "matches". The trick, of course, is being able to clearly tell the computer what you want to find. Most people have experience searching for things on the web where you simply throw as many terms at the search engine as you can to try and narrow down the results list.

Let's say you are searching for a dentist in your area who does laser teeth whitening or something like that. You could search for "Dentist" but the list of results would be rather large and mostly irrelevant to your area. You could refine the search by typing "Dentist laser" and that might narrow it down some. You could type "Dentist Laser Uxbridge teeth" and that would be very precise but that might be overly restrictive because you might not get Dentists in the next town over. So clearly, you have to find just the right combination of specificity and generality to get the results you want.

But I'm not going to talk about web search engines. I really want to talk about searching for text in a file - say looking for the word "shrimp" in a Word document. Obviously, that's pretty easy to do. You click Edit, Find..., type in "shrimp", hit OK, and Word shows you all the occurrences of shrimp in the document. OK, but what if you wanted to find both shrimp and shrimps? Or what if you only wanted to find shrimp if it was the first word in a sentence? What if you wanted to find the word shrimp only if it was the first word in a sentence or if it was followed by the word Gumbo and make sure the 's' in shrimp was capitalized? That's probably easy enough to do by hand but what if you had a file with 500,000 lines of purchase descriptions from the Bubba Gump Shrimp Factory that you had to process like that?

You need a way to precisely define a pattern that will match this relatively obscure combination of letters and possibly perform an operation on some of those characters automatically. You need a Regular Expression.

A Regular Expression - or RegEx - is a pattern matching language that is used extensively in Geekdom. It's an truly powerful and complex system that lets you perform amazing feats of text manipulation. Here is a simple Regular Expression that matches an email address. A web developer might use this to verify that someone registering is entering a proper email address:

^\w+@[a-zA-Z_]+?\.[a-zA-Z]{2,3}$

Yeah, so that's a bunch of gobblty-gook right? Yup, with great power comes great...difficulty. It's very difficult to express a text pattern that is precise where you need it to be but flexible enough to include variations that you want to be flexible about. An email address is something@something.something. You know the @ sign is in there but the other stuff is pretty much whatever you want right? Well, no. Email addresses can't have spaces in them so you have to throw those out. You can use dot and underscores but not dashes. There are a few other rules about a proper email address as well. So, specifying exactly what you want to match can get tricky. It's especially tricky because you have to use characters - the very things you are trying to find - to specify the pattern you are trying to specify. The result can be extremely cryptic as you can see above - and that is a simple one!

There is a funny saying that everyone learns when first starting with Regular Expressions. It goes something like this:

"So you have a problem you want to solve with a Regular Expression. Now you have two problems."


Indeed, you could easily spend more time writing and testing a regular expression to make a change than you would have if you had just paged through the entire document and made the change by hand. I know. I've done it.


As I kind of hinted, RegEx's are used to find and optionally replace so here's a simple example. Different text editors allow you to specify the search pattern and the replace pattern in different ways but for now we'll just show things like this:


Search: cat
Replace: dog


The characters 'cat' are a RegEx pattern. Same with dog. They are "literals" and do exactly what you would expect - they match themselves. 


If I wanted to find all the occurrences of either 'cat' or 'Cat' and replace it with 'dog', I could do something like this:


Search: [c|C]at
Replace: dog


The '[c|C]' pattern says lowercase 'c' *or* (the | symbol) uppercase 'C', followed by a and t.

So, there are special symbols used in the expressions to add flexibility. Of course, this use of special symbols presents a problem. The vertical bar symbol - called a pipe symbol - is a special symbol that means "or" - this *or* that. But what if you really wanted to find a pipe symbol in your text. You need a way to tell the pattern to not treat it as meaning "or" but to actually match itself. To do that, you "escape" the symbol by putting a '\' in front of it like this \|. So if we wanted to find something like 'car|truck' and replace it with 'car and truck', we would say:


Search: car\|truck
Replace: car and truck


Yeah, I hear you. What if you want to find the '\' symbol. Well you escape that with a '\' so it looks like this '\\'. 


Now back to our example of replacing cat or Cat with dog. We wrote the expression to do that but it might have a problem. What if the word 'catalog' was in our document. The RegEx would convert that to dogalog which is probably not something we want. It's a perfect example of how RegEx's can do things you never intended if you aren't properly precise. So actually, what we wanted to do instead of replacing 'cat' with 'dog' is replace ' cat ' when it is surrounded by spaces to ' dog ' surrounded by spaces. That would limit it to the *word* cat and we wouldn't accidentally match other words that happen to have c-a-t in them. 

Except, what if cat is at the end of a sentence and is followed by a period not a space? We want to include that right? So we really want to find a space, c-a-t, and either a space or a period. Or a comma. Or a semi-colon. Maybe it's c-a-t followed by anything *but* another letter. That's an example of how a RegEx might *not* do things you intended because your pattern isn't properly lenient. 

Developing a pattern that does match what you do want and doesn't match what you don't is a real challenge. Whole books have been written on developing Regular Expressions and the web is full of tutorial sites for it if you are interested. What prompted this post was a problem I had the other day. I had a file that consisted of 5000 lines of information that I needed to import into a database. It was arranged something like this

[barcode],[Date] [Author] [Notes]

[barcode],[Date] [Author] [Notes]

or

ISG000002421, 10-23-2009 10:23:34 Jsmith This is the description, of product A...

ISG000002456, 10-24-2009 9:10:04 JDoe More notes that contain text...
and so on...


Pretty straightforward but there were some problems. First, I needed to separate the barcode section from the rest of the text with a pipe '|' instead of a comma. Easy enough it would seem just find all the commas and replace them with a |. Not so fast though bucko. There might be commas in the notes section that I don't want to replace so I had to find only the commas that came after the barcode value. Hmm, there might be barcodes mentioned in the notes section though so I really want to find only the commas that come after the barcodes that start at the beginning of the line. Here's the pattern that matches that:

Search: ^ISG\d+,

The caret - ^ - matches the beginning of the line, the ISG matches itself, the \d matches a digit (zero through 9) and the + means one or more of the preceding items (the digit), and then the comma and a space that you can't see. OK, so that's what I want to find but what do I want to replace it with? I can't just replace it with a | because I'd replacing the whole pattern - the barcode and the comma with just the pipe character. I'd lose the barcode. Since I'm matching different barcode patters - the \d+ that matches sequences of digits - I don't really know what to replace it with. Fortunately, RegEx's will remember the patterns that they match and let you put those matches back during the replace. Here's how:

Search: ^(ISG\d+),
Replace: ^\1\|

I've added parenthesis around the barcode pattern. This says "remember this pattern, I'm going to use it later". Then I use \1 in the replacement pattern which corresponds to whatever text was matched in the parenthesis in the search, followed by the pipe character (which I have to escape because it normally means "or" in a regular expression and I want it to actually mean | here).

So that should fix that but there were a few other problems. The notes section sometimes had quotes around them and I didn't want that. I didn't want to remove any quotes that were inside the notes, just the ones that might be around the whole notes section. That was pretty easy too:

Search: ^(ISG\d+\| )"(.+)"$
Replace: \1\2

This says "find everything up to the first quote character and remember it in pattern 1. Then match the everything up to the quote before the end of the line and remember it in pattern 2. Replace all that with what is in pattern 1 and pattern 2. Since the quotes were not in the patterns that were remembered, they get dropped.

The last problem was tougher. It turned out that some of the notes actually contained multiple entries. So the notes section might look like this for some lines:

ISGxxx | 10-24-2009 10:2:23 This is note 1 11-2-2009 9:3:13 This is note 2...

I really needed to separate those multiple notes into separate lines for the same barcode. I won't go into the pattern that I built that let me do that. Suffice it to say that it was rather complex. It probably took me about 30 minutes to figure out the patterns that I needed but it probably saved me about 3 hours of tedious, error prone, hand-editing of that 5000 line file. Fun stuff.