Monday, April 04, 2011

Big Data

It's the new buzzword. Big Data refers to the vast quantities of information that is being generated and collected by various companies, web sites, governments, whatever. More specifically, it refers to what those entities might want to *do* with all that data.

Working at EMC, we've been dealing with large amounts of data for a long time. Our products (in general) are gigantic disk drives that store and protect "Mission Critical" data for companies large and small. Our Marketing people are quick to point out mind boggling statistics like - the amount of information produced last year is larger than all the of information produced in all the previous years combined - things like that.

Information production is only getting faster and faster too. It's one thing to have to store all that information but increasingly, companies want to be able to *mine* that information. As the amount of information grows, data analysts can apply statistical methods to look for patterns in the data and 1) determine behaviors and 2) predict activity.

So, analysts can look at sales figures and see that people are buying more of one product than another and adjust inventory levels or do other things to make sure their business is positioned correctly. They can also look at the data, combine it with other data and create models that let them predict what people are going to do when this or that changes.

Now, analysts have been doing this kind of thing for a long time - it's not really new. What is new is the amount of data being processed and the need to process it very, very quickly - real-time analytics.

Real-time analytics means looking at the data as it comes in and analyzing it right then. In the past, the analysis had to be performed on small subsets of the data in the "Data Warehouse". The analysis systems were not big enough or powerful enough to plow through all the data, they had to take a sample and hope that it had enough information to provide meaningful insight. Plus, it took them hours and hours to run those models and get an answer

Obviously, if you don't get a big enough sample, you could arrive at inaccurate conclusions. For example, you can take a look at the stock market values over any period of time. Depending on which week you happen to pick, you might conclude that the market is going up, down, or staying the same but that might not represent the larger trend. To get a "better" picture, you really need to look at more data - data that represents a longer period of time. In general, some patterns don't emerge until you get a sufficiently broad look at the data. Thus the dilemma. You need large samples of data to analyze and the bigger the sample, the more time it takes to analyze. But, to beat your competition, you need to get the results in seconds, not hours or days of the traditional systems. You want to be able to look at the cash register as the clerk is scanning items, find what else that customer has bought from you, and offer them an accessory that would beautifully match the dress they just bought today and the shoes they bought last month. Winning!

So the term Big Data not only means the *amount* of data out there but how to process and use that data to gain an edge in business. At work, we've been getting into this more and more. We're no longer interested in just storing the data for our customers, we need to help them mine it and get value out of it.

There are lots of interesting applications. We are currently working with Utility companies to help them figure out how to manage Smart Meter Data. It used to be that the power company would come by every month and read your power meter to figure out how much electricity you used so they could send you a bill. They got a little more advanced by installing meters that they could read from a truck as it drove by your house - no need to get out, find the meter, and write it down.

Enter the age of Smart Meters. These meters will now look at your power consumption and send it back to the Utility company every 15 minutes. So, instead of getting 12 readings a year from each customer, they are now getting 34,000 readings per year from each user. That's 3,000 times more information than they had been getting previously.

Not only are they looking for ways to manage this info, they are looking for ways to *use* it - beyond just sending you your bill. For example, it's really, really, expensive and hard to build a new power plant but there is a never ending demand for power. And it's the peak power demand that is killing the power companies. When everyone comes home at night and cooks dinner or washes clothes on Saturday, they have to have enough capacity to handle the peaks but that capacity goes unused in the valleys. If they could lower the peaks, they wouldn't have to build more plants.

Enter the Smart Meter. The power company can offer you an incentive and say, if you reduce your consumption from 5:00 pm to 11:00 pm - our peak demand time - we'll offer you a rebate (or some other incentive) on your bill. With the Smart Meter, they can tell not only how much power you use but when you use it, and try to adjust your behavior. In some scenarios, they can even tell what kinds of things are using your power and be able to send you a letter that says "We see you have a 1995, Kenmore model C-RAP dishwasher. Newer models use much less energy so we'll give you a rebate if you replace it."

They can also look at the grid of meters and get a better picture of their delivery system. With real-time analytics of the information coming in, they can detect say, voltage variations in a particular neighborhood. They can see that one particular transformer is common to all the affected meters and, using their statistical models, predict that it will fail in 2 weeks. They can then roll a truck to replace it *before* it blows, avoiding down time, angry customers, and unfavorable news reports..

Cool.

No comments: