Thursday, July 2, 2015

The Problem With Big Data

Over the past few years, Big Data has been a marque term in the technology space.  There have been multiple different organizations who have tried to tackle the fundamental problems that are associated with Big Data analytics.  Heavy investments have come into key areas mainly focused around scalability and request efficiency.  Even recently, IBM made a huge investment into the Big Data space.  Ultimately, Big Data technologies are meant to answer questions based on noticed trends that would be impossible for a user or groups of users to determine themselves.  Although these technologies are quite impressive, they often lack an easy way to COLLECT metrics outside of involving someone familiar with the components specific outputs.  This is the problem with Big Data, the question of "how do I get the data I want to be consumed by the Big Data solution" is still a question no one has an industry recognized solution for.

"Just Get Us the Data, and We'll Take it From There"

If you read my previous article, you know I am borderline obsessed with the phrase "there has to be a better way".  There are plenty of examples of Big Data tools utilizing common communication frameworks and protocols.  The real effort in this process is how does an application owner pull out that data and submit it to the Big Data solution with minimal and effective efforts.  Many times, I see organizations put the effort of outputting the data on developers to either write REST interfaces into the application, or even worse (from a performance perspective) write out the data to log files.  Both of these efforts end up solving the problem, but could introduce security issues along side the fact you are asking a developer to implement a brand new component into the solution for one off requests.  Try this exercise:

  1. Think of a question you would like to ask your application
  2. Write down all of the metrics or points of data required to come up with an answer
  3. Picture all of the individuals required to get at each one of those points of data
  4. What if one of those individuals defines a metric differently, what would be the impact to the answer?
  5. Can anyone maliciously use this data if accessed?

This is just the topics you need to cover for answering one question.  The only way to prevent this from becoming unbearable, is to get to the same results using a different path.

If You are Relying off Logs, You are Doing IT Wrong

Log files are up for interpretation.  Did a developer come up with that string that is written?  Then there are probably edge cases where that output is not right.  Basically, stop writing log files to get at specific points of data.  There are solutions out there that will instrument the application and provide a much richer context of what is going on within the stack.  This context could never be captured in a string written to the file system.  Instead of writing one off messages to solve one off problems, work on how to implement a single framework that can be utilized across the organization to get at all points of data.

The Three Points

Coming full circle, a real Big Data implementation is comprised of three components (similar to monitoring tools); data collection, data analysis, and data presentation.  I get the feeling there large number of players trying to corner the latter two areas.  The data collection area is still extremely green in my eyes, and I am eagerly waiting for someone to really make a play for answering that question.