Skip to main content

Introduction

Despite the press and board room discussion, most companies have not yet adopted big data technology. When considering big data, every business leader has the same basic questions.

  • Do I actually have big data?
  • What strategic advantage will I gain?
  • Where do I start?
  • What hardware and software is needed?
  • What sort of team needs to be assembled?
  • What are reasonable timelines and objectives?

Our purpose here is to answer these questions and provide business leaders a basic game plan for adopting big data technology.

Why big data

Big enough data

Big data is any data large enough that it cannot be easily processed or managed on traditional systems. The name “big data” is actually misleading. You do not need petabytes of data to have a big data problem. A better name might be “big enough data.” If your data processing job is taking hours to complete, then you could benefit from big data technology.

A more accurate name still would be “massively distributed computing.” It’s not very catchy, but that is the essence of big data technology: its power comes from enabling multiple machines to work on the same problem at the same time. For instance, what would take 24 hours to run on one server can be completed in less than an hour on 24 servers.

Perhaps we should just call it Hadoop, which has become synonymous with a host of open source proj- ects that encompass most of what we call big data technology.

Three basic concepts

There are three basic concepts which underlie big data technology:

  • Data is shared across servers
  • Processing is parallel
  • Processing is pushed to the data

First, data is shared across servers. No longer is one machine large enough to hold all our data. However, we want all the data to be accessible together. The answer is to make the storage of several machines look like one big hard drive. That is what technologies such as Hadoop Distributed File System (HDFS) and Amazon Simple Storage Service (S3) do.

Second, processing is parallel. To process the data sequentially would take too long. Instead, the data must be processed in parallel. That means, in many cases, fundamentally changing the way we write software. Many technologies have emerged to make this easier, often even trivial.

Third, processing is pushed to the data. Traditional application architectures pushed the data to the processing. No network is big enough or fast enough anymore to move all our data. On the other hand the computation code is much smaller than the data it processes. We must thus bring the code to the data and distribute our processing across multiple servers and have it compute as close to the data as possible. That is what technologies such as Hadoop mapreduce, Spark and Flink do.

There are dozens of big data technologies and products, and more announced every week, but they each leverage, expand or support these three basic concepts.

The business objectives

You are not collecting and processing data for the sake of processing data. Your objective is to use data to do what you do better. In fact, you want to become better at using data to do what you do better. You are hoping to achieve a strategic breakthrough, somehow transform your business so that you use data in ways that are not immediately apparent now. At the very least, you do not want to be left behind by other companies that are transforming themselves into data-driven enterprises. Companies that are now at the forefront of their industries have come to understand that their data is a valuable asset which they can leverage to drastically improve the way they do business.

You want your firm to evolve to a higher level of analytics. There are four levels of business analytics:

  1. Descriptive
  2. Diagnostic
  3. Predictive
  4. Prescriptive

One begins with descriptive analytics, which is basic reporting of what has happened, such as sales reports or sentiment surveys. Descriptive analytics provide only hindsight, but they are a necessary building block to moving forward to higher analytics.

Diagnostic analytics is the application of statistical tools to discover trends and correlations. It seeks to answer why events happen, but it is still retrospective.

The next stage in evolution is predictive analytics, to seek what will happen. A common example is sales forecasting.

The final objective, however, is prescriptive analytics, which is the ability to answer how to make something happen. An example of prescriptive analytics is price optimization.

levelnamequestion
1DescriptiveWhat happened?
2DiagnosticWhy did it happen?
3PredictiveWhat will happen?
4PrescriptiveHow do we make it happen?

Sounds perfect: use data to know what actions to take in order to achieve our business goals. How do we get started?

Four phases

By planning or circumstance, most companies follow four phases in adopting big data technology:

  1. PROOF OF CONCEPT (POC)
  2. RESEARCH AND DEVELOPMENT
  3. FIRST PRODUCTION
  4. ENTERPRISE INFRASTRUCTURE

Consider it a natural progression. We will briefly describe these four phases here and then explore each of them deeper over the following chapters.

Proof of concept

The first phase is proof of concept. It usually takes 1 to 3 months. The objective is to prove the hype. Can big data technology actually deliver an order of magnitude improvement over traditional technology?

The company picks a single, well-defined business objective and implements a solution using big data technology. The business objective can either be one that is solved now but causing problems or one that is considered too hard or expensive to solve with traditional technology.

We recommend that the company pick an existing process that takes several hours to run. We call it a “big enough” data challenge. We also recommend that companies evaluate different technology alternatives during this phase.

Research and development

The second phase is research and development. It is usually 3 to 6 months in duration. The objective is to learn how to build a big data solution as a team.

The company takes the proof of concept or a simi- lar business objective, expands it and turns it into a solution that business users can actually use. The results are treated as beta and run in parallel with any system that is to be replaced.

First production

The third phase is first production. It usually takes another 3 to 6 months beyond research and development. The objective is to move the solution from beta to a proper production system that business users can rely on.

In this phase, the information technology department, operations, support and infrastructure all become integrally involved.

Enterprise infrastructure

The fourth phase is enterprise infrastructure. It can take 6 to 12 months to complete. The objective is to create a platform the whole company can use. How to implement this phase is beyond the scope of this paper. However, each of the previous three phase is designed to prepare you for success in this phase.

Not exactly linear

The four phases are not as clear-cut as they may first seem. They do follow sequentially, but there can be significant overlap.

More than one department may conduct proof of concept projects. In fact, it should be encouraged so long as there is open communication between the teams. Different departments may have significantly different requirements. The sooner these differences are discovered the less impact they will have in cost and risk later on.

Different business solutions can move through the research and development phase and first production in parallel. The company may want more than one production solution in place before it decides to move into the enterprise infrastructure phase.

Strategic considerations

In each phase, there are a number of strategic decisions to consider. The primary ones are:

  • Software
  • Hardware and networking
  • Team composition
  • Training
  • Business processes
  • Road blocks

In the following chapters, we will describe each of the four phases in detail and discuss how to address these strategic considerations.

The foremost strategic consideration is that big data technology is constantly moving. It is a natural tendency to make a decision, draw a line in the sand and say, “There, that’s done. This is the way it will be.” Your big data decisions will need to continuously evolve. Build a team and infrastructure that will move with technology, not meet it as it exists today. Your company needs to build wagons, not forts.