Skip to main content

Proof of concept

Overview

Objectives

The first phase in big data adoption is proof of concept. A better name might be proof of value, since you are proving the value of the technology to your company rather than the technology itself, but the term proof of concept is more readily un- derstood.

Your primary objective of this phase is to prove the technology to yourself. Some other secondary objectives are:

  • Become familiar with the various technologies such as HDFS, map-reduce, Hive, Spark, big tables, data flows, etc.
  • Become familiar with a big data cluster and the various services
  • Select a software or consulting vendor to work with in future phases

Remember that a proof of concept is about learning. You do not have to get everything right. Failing is learning. Sometimes failing is the most valuable learning experience. For instance, suppose you pick a vendor for the POC that does not work out. Congratulations, you are successful. You now know what vendor you will not work with in future phases. Plus, you probably know a lot more about what to look for and how pick a vendor that will work out.

Timeline

The timeline for this phase can vary, but we rec- ommend you plan on three months with six weeks of actual development work. Getting the hardware and network resources you will need will take time. If you are using the cloud, you will still need time for your company to create accounts with the cloud provider, address security concerns and establish safeguards.

Process

A big data proof of concept follows the following 10 basic steps.

  1. Pick the team
  2. Pick the problem
  3. Establish a baseline
  4. Assemble the hardware
  5. Install the software
  6. Load the data
  7. Create your processing
  8. Run your benchmark
  9. Share your results

We will discuss how to conduct each of these steps.

Pick the team

Picking the team starts with identifying who in your company will lead the project. The team should be as small as possible. You will need four sets of skills for the project:

  • Subject matter expert, someone who knows the business side of the problem you are tackling
  • Senior Java or Scala developer, since most of the big data technologies are written in these languages.
  • Senior database developer, someone who is an expert in SQL and understands how databases work
  • Senior network and Linux administrator, since big data runs on Linux and is network intensive.

You may find someone in your company that has all four skill sets. If so, then congratulations, your team is one person. More likely, you will have three or four different employees. Most likely still, you will have only one or two employees on the project and will need consultants to support you in the effort. Having someone from outside who already has experience will help tremendously.

Picking the team also means picking your software partner. You should pick a company that provides a supported distribution of Hadoop. There are several to choose from, including Hortonworks and Cloudera. We will go into why you need a distribution in the Install the Software section below.

Each vendor has its strengths and weaknesses, primarily on the software they support and the services they provide. If the choice is not obvious, then you should pick two or three of the vendors and run the proof of concept with each. It will add time and cost but will lower uncertainty, which is defi- nitely a prime objective of the project.

Pick the problem

Success in the proof of concept hinges on defining an objective that is clear, relevant and practical. You do not need to have petabytes of data to process. For instance, you do not need to start processing Twitter feeds just so you have a lot of data. Focus on the processing time, not the data size. We recommend that you pick an existing process that takes several hours to run. We call it a “big enough” data challenge. It should be gigabytes of data if possible. Megabytes is probably too small and terabytes is fine but not necessary.

Another option is to replace an existing data warehouse process. Many companies face the challenge that their data warehouse content and processing loads keep growing linearly or even exponentially while their hardware and processing power grow stepwise or not all. Big data provides a less expensive means to create your analytics.

You will not process the data the same way on big data as you do in the data warehouse. You will not be creating a star schema, for instance. However, you should be able to provide the same end results, that is, the same report table. The data should take much less time to create and be much faster to access. In fact, you should shoot for 10 to 100 times faster.

Establish a baseline

In order to prove success, you need data. You need to know where you are starting in order to know how far you go. If you have picked an existing process, then establishing a baseline is much simpler. You simply observe the characteristics of what is running now. The three primary characteristics are:

  1. DATA VOLUME
  2. PROCESSING TIME
  3. HARDWARE

Data volume is the size of the data (gigabytes or terabytes) that is being processed. There is the base amount of accumulated data and the amount of data that is added each day. If data is added with files, then you will need the number of files and the distribution in sizes. For example, you may receive 10 to 100 files each day with a range in size from 60 to 120 GB each, with a daily average of 60 files and an average size of 75 GB.

Processing time is both CPU time and end-to-end wall clock time. The extract-transform-load (ETL) process time should not be ignored. In fact, you may find that faster ETL is one of the biggest gains in the proof of concept. Loading data into traditional databases is usually complicated, brittle and slow.

For hardware, you need the number of processors, the CPU speed, the amount of memory (RAM) and the amount of disk space. Moving from a traditional system to a distributed system like Hadoop means that you will move from one server to multiple servers, but these metrics will be important when presenting the results.

Depending on the particular problem, other metrics may be relevant. However, this covers the basics.

Assemble the hardware

Big data is distributed processing, which means that it runs on multiple servers. With most new technology, a developer just downloads the software to his laptop and starts hacking. You can’t do that with Hadoop. It requires a minimum of four servers to work properly. Most developers do not have four servers lying around that they can play with to learn a new technology.

You should consider using cloud computing for your POC. Hadoop distributors have made getting a cluster running in the cloud very easy. However, for some companies, cloud is not an option or another POC in itself.

We recommend, as a minimum, 8 computers with 8 GB RAM, dual processors and 1TB hard drives. The more hardware you have, the better. The servers should be identical if at all possible.

Note that Hadoop requires Linux. Check the Hadoop distribution for which Linux distributions it supports. Picking the right Linux distribution for your Hadoop will make installation much easier.

You will need to get your network administrators involved. You will need fixed IP addresses for the servers and names assigned in your domain name service (DNS).

You will also need to have the servers on the same router, otherwise you will choke the office network with traffic. Hadoop is very network intensive. Your servers also need internet access to download all the various software needed. Hadoop is actual- ly many different projects, so automated download is essential for easy installation.

Install the software

You should use a Hadoop distribution provided by a vendor. Yes, Hadoop, Spark and the whole zoo of technologies are open source, but building, installing, configuring and integrating all of those packages yourself can be challenging. And, once you have it running, you would need to upgrade the software with each release, which seems to occur constantly. It would require a great deal of expertise and time that can be better spent on solving business problems specific to your company. A Hadoop distributor such as Hortonworks manages this pain for you. Plus, they offer limited-time free versions of their distributions specifically for proof of concepts.

Hadoop distributors have made deploying a cluster much easier. Even with using a distribution, however, installing a Hadoop cluster can be complicated, especially when doing it for the first time. There are literally hundreds of parameters to set in Hadoop. We recommend that you bring in help or have access to support from your Hadoop distributor in case you run into problems.

Do not be afraid to wipe it all out and start over again. Better to experiment and learn how to configure a cluster during the POC. It becomes much more expensive in the later phases.

Load the data

If possible, load all your data as-is into HDFS. Avoid any transformation before loading. There are three reasons for this: traceability, flexibility and speed.

  1. If the original files are in HDFS, you can trace how you transformed the data from one form to another all the way back to the source.
  2. You have the flexibilityto transform the source into other formats at a later time.
  3. You have the full speed of Hadoop parallel processing to do all these transformations.

Keep track of how long it takes to load the files. This will be helpful later when calculating your benchmarks and presenting results. Consider that a huge time cost in traditional systems is the actual ingest of data into the database. If your source data is delimited text files, you can declare a folder with the metadata in Hive (one of the databases in Hadoop), upload the files as-is to an HDFS folder and have the data queryable in seconds.

Be aware that some Hadoop data ingest tools can overwhelm a traditional database you might be reading data from. Hadoop by nature wants to operate in parallel. With a simple command line instruction, you could have every processor on every server in your cluster requesting data from the database, which could easily flood the database’s connection pool.

Create your processing

Use the technology that is most comfortable for your team. If your team is strong in relational databas- es, the use Hive and write your jobs as SQL joins. If your team is strong in scripting languages such as Python, then use Pig to write your data process- ing jobs. If you have Java or Scala programmers, then use classic map-reduce or better yet, Spark to write your processes.

Keep the data processing focused and as simple as possible while still meaningful. You are proving the technology, not building an application. Your process does not need to be production grade code. It just needs to get the data from source to target at a speed and scale that is a significant improve- ment for your business.

Run your benchmark

As mentioned earlier, consider the data ingest time as part of your benchmarks. It is a matter of seconds to load megabyte text files into HDFS and have the data immediately queryable in Hive. That is not the case for traditional databases. Highlight that advantage to your audience.

Your processing time should scale completely linearly. For example, if it takes 2 hours on 4 servers then it should take 1 hour on 8 servers. Demonstrating this linear scalability should be easy during the POC. Seeing is believing.

Share your results

Do not hide your light under a bushel. Share your results with the rest of the company as broadly aspossible. Publish on an internal wiki or social media if you have it. Have presentations with question and answer sessions.

Some lessons learned from other POCS:

  • Spend at least half of your time explaining the basics of what Hadoop is and the business problem you were solving
  • Present solid, specific metrics regarding the servers used, the amount of data and the time to process
  • Do not belittle the technology or the solution you are replacing.
  • Emphasize that Hadoop does not replace all other existing technologies. It does, however, make certain types of problems easier, faster or even possible to solve.
  • If you have worked with a software or consulting vendor, have them participate in the presentation but do not let them turn it into a sales pitch.

Remember that the POC is just the first step of your journey. The presentation is vital for you to muster the internal support necessary to move on to the next phases.

Challenges

There are four common challenges that arise in conducting a big data proof of concept:

  1. Underestimating the time to get servers
  2. Picking the wrong problem
  3. Letting the IT department lead
  4. Underestimating internal opposition

We will quickly discuss each of these challenges.

Underestimating the infrastructure lead time

In a large firm, getting the servers and networking you need to conduct the proof of concept is usual- ly the hardest and longest task. Large firms have strict network policies and longer purchasing cy- cles. Ordering servers and getting them installed can take weeks.

If you decide to use a cloud provider such as Ama- zon, Microsoft or Google, you will still face lead times, especially if your firm has never used cloud before. You will need to set up accounts and pol- icies. You will also face security issues on what data can be uploaded to the cloud.

Picking the wrong problem

Do not pick a problem that is too big. You do not want to get bogged down in a lot of development work. You should keep your team small and your deadline tight.

Do not confuse a proof of concept with developing a product. Your objective is to learn the technology and what it can do for your business. The proof of concept is only the first step. You will have time to get a working beta product in the next phase.

However, do not pick a problem that is too trivial. If you do, then your results and benchmarks will be meaningless or dismissed as irrelevant to the business.

Letting IT department lead

If you are a large firm, do not use your information technology (IT) department to lead or even conduct your big data proof of concept.

Yes, the IT department has the skills in software, hardware and networking needed. Yes, you will probably need their help with getting hardware and networking setup. However, the primary purpose of the IT department is to keep production systems up and running smoothly. All other purposes are secondary. Keeping production systems running smoothly requires minimizing or eliminating change, disruption and risk. Successful IT departments are risk averse by the very nature of their mission.

Creating a proof of concept is a research and development effort. A successful proof of concept is all about change, disruption and risk. You need a research and development culture to conduct a successful proof of concept. That culture is nearly the polar opposite of an IT department. It is not just about knowing the technology; it is about having culture that embraces risk.

Underestimating internal opposition

Prepare for the naysayers. There are people within your firm who identify their value with the technology that they support. They can easily perceive big data technology as a threat to themselves and to the firm. It is important to emphasize that big data technology does not replace all other existing technology. It is also important not to belittle the solution that you are benchmarking against in the POC. Focusing on the positive will help the company accept and embrace that big data technolo- gy makes certain types of solutions easier, faster or even possible.