Our Legacy Big Data Problem

imageA few days ago, I discussed how Big Data, as a technology, has relevance as a means to gain Insight. This is all fine and good, but is it a technology that we need in the Content Management space? Moore’s law seems to be keeping our data in good shape.

Except…

…inside every piece of content is information. It isn’t unstructured, it just isn’t in a structure readily interpreted by machines. That structure is what provides context and that context is the key to extracting insight.

Now extend that out to Petabytes. That is Big Data.

Same Problem, New Decade

This isn’t new. I think that the most consistent problem I’ve had with Content Management deployments over the years is the search engine. They couldn’t always keep up with the rate of ingestion. When they did, the search results were usually imperfect.

That’s how bad things are with extracting information from within our Content. If we enter in search phrase and get the document we wanted, we call that success.

But is it?

Let’s take Immigration. What if I wanted to know….

  • How many immigrants came from Venezuela last year?
  • How many lived in a small town named Macarao?
  • How many entered the country last week?
  • Has there been a surge of applications from any single location?
  • Are there commonalities in the content which might indicate fraud?

This is important. Some answers can be derived from metadata. Others cannot. Some require instant analysis. If there is a surge of fraud, it needs to be detected immediately so people can act quickly.

These are just some questions that might be locked in the Content. We don’t know. We CAN’T know until we can intelligently analyze the data.

When you consider the volume of legal immigrants in the United States, it doesn’t take Watson to determine that there are petabytes of content just waiting to provide answers.

Yes and….?

The question out there is, Can Big Data help? Here we are, stuck trying to solve these old problems in our Content Management systems and we aren’t sure if we can create that cross-over. That is the real question, Can Hadoop be leveraged to provide real-time insights into our business using the information inside our Content?

11 years ago, I posited that I would never sell a search engine that could correctly interpret context. I told everyone that they would have to wait for my son to sell them the solution.

That statement was a combination of two factors, the need for better algorithms to perform the analysis and powerful enough machines to execute them in a timely fashion. The combination of Big Data and the Cloud solve the latter by going parallel in a way that most organizations will be able to afford in the next few years.

The final question is, can we get the content into Hadoop? Can we derive enough structure from the content to populate Hadoop in such a way that there is something meaningful to analyze?

Or to put it bluntly, am I or my son going to going to be implementing the the solution?