Skip to main content

Research and development



The second phase of big data adoption is research and development. You have completed the proof of concept. You have learned more about the technology and proven to yourselves as a team that big data has the potential to provide significant improvements for your business. Now it is time to solve a real business problem.

The objective in the previous POC phase was to prove the technology to yourself. The primary objective of this phase is to prove the technology to the business. Some other secondary objectives are:

  • Learning the technologies in depth
  • Getting use to managing a cluster, including installing, configuring and monitoring services
  • Building a core team that can create big data solutions
  • Solve a real business problem
  • Have business users beta testing the solution

If you are taking on an existing problem, then you will build your solution in parallel. Most likely, you will consume the same data as the existing business application and process it on the new platform. At the end of the phase, you should be able to demonstrate that the big data platform produces the same or better results in less time.

If you are taking on a new problem, then at the end of the phase you should have a minimum viable application that is demonstrable to the business.


The research and development phase should take 3 to 6 months, depending on the scope of the prob- lem and the size of your team. You should not at- tempt to solve a problem that will take more than 6 months. The team will be discouraged and the business needs to see results.


A big data research and development project follows the following seven basic steps.

  1. Pick a business problem
  2. Have beta users
  3. Expand the team
  4. Expand the cluster
  5. Learn the technology
  6. Build the solution
  7. Run the process in parallel

These steps build upon the work and experience your team gained during the proof of concept project. As such, there will be fewer details to cover.

We will discuss how to conduct each of these steps.

Pick the business problem

As with the POC, your success is determined to a large degree by which business problem you chose to solve. If the problem is too large or complex, then you are setting yourself up for failure or, at the very least, expense and delay. If the problem is too small, then the business will be underwhelmed by your results.

The R&D project does not have to be the same problem used in the POC, but it helps. It should have a larger scope than the POC. You will want to pick a project that you can take into production. However, do not treat this phase as a production development. It is still too early for that.

Start with a batch process. Real time streaming is the cutting edge for now and it will eventu- ally become mainstream, but it is not there yet. Manage your risk and start with batch processes or micro-batch such as Spark. You will have time to evolve as you learn.

Have beta users

In this phase, you need an engaged customer. Picking your beta users goes hand-in-hand with picking the business problem. One can usually find a technology enthusiast in the business side, or someone who seeks out and embraces change.

Your internal customers should be your best champions. On the other hand, an unsatisfied customer can kill your project. You will need to manage customer expectations. Hadoop is not a panacea, a solution to all your technology woes. Hadoop does not replace all existing technology. You may be surprised that the legacy relational database does some things remarkably better than Hadoop. Your customer should know not only the strengths but also the weaknesses of Hadoop.

Expand the team

If you found one person who had all the skills you needed in the POC phase, lucky you, but that one engineer will not be enough for R&D. You will need a team. There are three distinct roles in big data development: data administrator, data engineer and data scientist.

  • Data administrators are responsible for creating, configuring and monitoring the cluster. A data administrator must have Linux administration experience and should have experience in networks and database administration. They will need specialized knowledge with Hadoop. There are hundreds, perhaps thousands, of parameters to set in configuring a cluster. Some can have significant impact on the performance of your data processes.
  • Data engineers are responsible for the loading, transformation, storage and processing of data on the cluster. A data engineer must have experience with Java or Scala and should have experience with SQL databases. They will need specialized knowledge with Hadoop, mapreduce, Spark and NOSQL.
  • Data scientists are responsible for finding meaningful information in the data. A data scientist must know R or Python and should know machine learning libraries.

If you did not get contract support from a Hadoop distributor during the POC phase, then you should give it serious consideration for the R&D phase. A single phone call or email can provide the answer you need within a matter of hours which could otherwise cost your team days of frustration trying to research and solve on their own. The Hadoop technology base is huge and constantly changing.

No one can know it all. Most distributors now provide proactive help. They monitor the performance of your cluster and provide recommendations on how to improve it.

Expand the cluster

The cluster used in the POC is not going to be big enough. You will need more resources. You will want to rebuild the cluster with high-availability. Create separate control nodes and worker nodes. Your Hadoop distributor should be able to provide you reference hardware implementations.

In this phase, we recommend that you use cloud computing for your R&D cluster. It is not a matter of direct cost. In fact, cloud computing will probably be more expensive. Rather, it is opportunity cost. Your organization is still learning. You could take weeks to buy 10 expensive servers only to find out in 3 months that they are the wrong configuration. It happens. On the other hand, you could start up a cluster in the cloud in a day and three months later change the hardware configuration completely with a click of a mouse.

Learn the technology

During the POC, your team only became acquainted with Hadoop technology. Now is the time to dive deep and learn.

If you have experienced Hadoop developers, then have them cross train your other team members. Otherwise, spend the time and money to send your engineers to training. A week of hands-on instruction is worth four weeks of web searching, trial and error. Do not get me wrong. You want your team to conduct trial and error, but their effort will be much more valuable if it has a foundation of knowledge from which to expand.

If you are an agile development shop, you can still run your R&D project in sprints. You will not have user stories for much of your work. Instead, you can have hypothesis stories. A user story enables a developer to focus on solving a customer requirement as quickly and succinctly as possible. Likewise, a hypothesis story enables a developer to focus on solving a technology problem or question as quickly and succinctly as possible. It is all too easy to get lost testing things simply because they are there. A hypothesis tells you what you are testing and why you are testing it.

Build the solution

Learning the technology is the research part; building a solution is the development part. Having a business problem to solve will focus your efforts and make the learning more practical and effective. It also communicates the results of your learning much better to the business. Remember, though, that you are not building production code. This is an R&D project. What time you would spend hardening your code you should spend instead experimenting.

When building your solution, take the simplest, most brute force approach first. See if it works. Establish a baseline then evolve your solution with more efficient processes and structures. Distributing computing requires a different way of programming. Sometimes an approach that would be more efficient in a traditional object-oriented application will be highly inefficient in a distributed process. You will only learn by trial and error.

Not all of the work is data science. Much of the work will be getting the data into the cluster and making it usable. You should treat these processes as first-class engineering problems and not some side distraction. Working in an enterprise, you are accustomed to extract-transform-load (ETL) processes and probably have commercial tools with the firm to handle such jobs. As noted in the POC before, you should load data directly into Ha- doop as is. In other words, you should build ex- tract-load-transform (ELT) processes, not ETL pro- cesses. There is a significant difference.

In the POC you probably ran all your processes man- ually. In this phase, you will need to automate your processes. There are several Apache projects for handling the scheduling and automation. Oozie is the most widely supported. Each has its advan- tages and disadvantages. You should allot time to find which one meets your needs.

Run process in parallel

There are two paths to take in your R&D effort:

  1. create a new process that was impossible or impractical to solve before
  2. replace an existing process in a much faster or more efficient way

If you took the second path, then you will need to run your new process in parallel with the old process in order to validate your success with the business. They need confidence in the process so that you can move on to the next phase. You should build automated reconciliation between the new data and the old data. Of course, you should run that reconciliation in the Hadoop cluster. Reconciliation is a natural mapreduce problem.

Once your beta users are satisfied with the results, it is time to move on to the next phase: first production.


Finding experienced engineers

Experience in big data is in extremely high demand. If you can locate consultants or employees available, you will find that they are expensive. One alternative is to grow your own, that is, pick smart employees and pay to have them trained in the technology.

If cost and time are constraints, then you should consider having offshore development teams that have on-shore leadership. The offshore team should be as close to your time zone as possible to enable team interaction. We recommend 4 hours of overlap between the onshore and offshore teams. Otherwise, there will be too many delays as issues take more than one business day to resolve.

Knowing when things go wrong

When your beta users see data that does not match their expectations, they will be left to wonder what went wrong.

  • Is the source data corrupted?
  • Is there a bug in the processing?
  • Is the model or expectation wrong?

Human nature being what it is, users will usually assume that there is something wrong with the data or the code before assuming there is something wrong with their model.

You should also have unit tests for your code. It is not straightforward with mapreduce code, but it is possible. You will also need to create some ad hoc means to check your data integrity. We will discuss these issues more in the next phase.

Overwhelmed with the technology

Apache Hadoop is a very large and ever growing ecosystem of technologies. Hortonworks provides support for 22 Hadoop related open source projects (or applications) but that is only a subset of all the projects. In many cases, there are two or more projects that serve the same or overlapping purposes. That is just the projects adopted by the Apache Foundation. There are dozens of other open source projects released and supported by large technology companies that are not managed in the Apache Foundation. New projects are announced every month. These various projects are interdependent and each of these projects is releasing new features all the time. It can be overwhelming.

Pick and choose your scope. You cannot learn everything at once. Be aware of the various projects but focus on the ones that you need. It is very easy to chase after a new project simply because it is new. Remember that it usually takes a year or longer for these projects to mature and stabilize into something you can deploy into production.

In order to keep pace with changes, consider allocating 10 percent of your team’s time (four hours per week) to continually read up and test new technologies.