First production

Overview

Objectives

The third phase of big data adoption is deploying your first production application. You have completed your research and development project. You have a data process running in parallel with a legacy process, or you have a completely new data process that was unattainable with previous technology. The primary objective of this phase is to enable users to run the business relying on the data produced by the cluster.

Up till now, your Hadoop cluster has been a sandbox that you can play in. From now on, the business will be relying on the data. If a process fails or the cluster goes down, there will be consequences.

Timeline

The research and development phase should have reduced the risk and complexity of this phase. The first production phase should take 3 to 6 months, depending on the scope of the problem and the size of your team.

Process

A big data first production project follows the following seven basic steps:

Bring in support
Split the cluster
Establish security
Build quality checks
Build the user interface
Improve the cluster
Improve the processes

Most of these steps can proceed in parallel. These steps build upon the work and experience your team gained during the proof of concept project and the research and development project. As such, there will be less details to cover.

We will discuss how to conduct each of these steps.

Bring in IT support

We strongly recommended that you not let the information technology (IT) department run the POC or R&D phases of your big data development. The reason is that successful IT departments reduce risk, while R&D is about embracing risk.

You have now entered the phase of big data adoption when you want to begin lowering risk. It does not mean that you will stop experimenting. Big data technology is still young, so it will continue to evolve at a frantic pace. However, from now on, your production big data will be separated from your experimental big data.

As such, it is time to get the IT department in- volved in your big data efforts. You will need to have IT support trained as data administrators in order to monitor and maintain your production cluster.

Split the cluster

You will have noticed during the R&D phase that the cluster seems to gradually grow in size. That is natural and will continue. At this point, however, you will need to make a big increase in the size of your Hadoop infrastructure. You should create two clusters: a development (DEV) cluster and a production (PROD) cluster. There is no practical need for creating a test, user acceptance testing (UAT) or other stages that you might have in your other application environments. It is overkill. Hadoop is a distributed resource sharing platform. Your DEV platform can serve all of those other roles. What is important is that you have one cluster that your engineers can break and another that your internal customers can rely on to be always up and available.

You have two options. You can either keep the existing cluster as DEV and create a new clus- ter for PROD, or you can designate your R&D clus- ter as PROD and create a new DEV cluster. There is no right or wrong approach. For the most part, it will depend on your IT policies.

Establish security

Your production cluster will need to be secure. Hadoop came out of a laboratory, so security was not a pressing issue in its first versions. Security in a distributed processing environment is not trivial. However, Hadoop has been in the enterprise for some time now, so there are excellent enterprise-strength security options available. For the most part, these solutions are built on Kerberos.

Your IT department will need to drive this process. Most Hadoop distribution vendors provide special professional services to help with setting up a secure cluster.

Your security should focus on enabling two things: monitoring resource usage and controlling data access. In this phase, you probably only have one department using the cluster. However, Hadoop is a shared resource. You should treat the first production project as the first tenant of many in a multitenant environment.

Build your quality checks

In the R&D phase, we discussed the challenge of knowing when things go wrong. In a production environment, it is critical that you monitor the quality of the data. You will be loading data from different sources, from different departments within the company or different data vendors. The source of the data is out of your control, so you cannot assume that it is correct.

There are basic monitors, such as alerting if files do not arrive by a certain time. There are basic checks, such as checksum or parsing to insure the files are not corrupted. You should go beyond these basics and check the data itself with statistical analysis. For instance, if it is account data, then there should be a distribution for the number of accounts and transactions received each day. Your quality check processes should automatically alert if there is a significant deviation.

Build your user interface

Customers focus on what they understand. You should not expect your business users to understand big tables and mapreduce, so do not expect much feedback on the Hadoop technology itself. They will understand the data and the user interface, however, so expect plenty of feedback on that.

There are several applications on the market that provide self service data analytics for Hadoop. Work with business users to build the data sets that work best with these tools. Business users should be able to visually explore the data to find trends and anomalies.

There are also analyst notebooks such as Jupyter and Zeppelin for the power users. The clever business analyst that builds complex models in Excel will love these applications. The analyst notebooks put machine learning algorithms in the hands of your users. As you progress towards becoming a data driven enterprise, you should encourage each department to have their own data science capabilities.

Improve the cluster Improving the cluster can mean increasing the capacity by adding more worker nodes. Hadoop is linearly scalable, so you should see incremental and proportional improvements in performance for every server you add to the cluster.

Improving the cluster also means tuning the configuration settings. There are literally a thousand configuration properties you can set and adjust that alter the behavior and performance of the cluster. You will need to adjust the parameters to match the the time of jobs that you run on your cluster. You should rely on your Hadoop distribution partner to help in this effort; it is a key part of the value that they add.

Improve the processes

We recommend that is most cases you take the simplest, most direct approach in your first solution of a big data problem. In other words, use the brute force approach and get something working. In massively parallel processing, the brute force approach often works surprisingly well. You may waste hours of development work creating an optimized or elegant solution that does not really significantly improve the performance of the application. Go with “good enough” first. You can always make it better later.

It is the nature of big data to constantly grow in size and complexity. A process that may have finished quickly enough with 1 terabyte (TB) may not meet the service level agreement (SLA) at 10 TB. As such, you will need to refactor your solution periodically over time. For instance, partitioning the data can have significant impact to performance in certain scenarios.

Your performance over time should look a saw blade: sharp immediate improvements with gradual degradation over time.

Challenges

Some of the challenges that you faced in the R&D phase, such as finding experienced engineers, will persist. You will also face new challenges in the first production phase.

Resource conflicts

Your big data application is now in production, so a delay or failure has consequences for the business. It is easy to guarantee resources when you have only one tenant on the cluster. Things become a bit more complicated as you expand the usage of the cluster and add more tenants and more processes.

With YARN, Hadoop has evolved to enable multi- tenant usage of cluster. Enabling the function- ality and avoiding all conflicts, however, are not the same. There are always finite resources. You will face the need for setting priorities between tenants and even between processes. This will require negotiation, leadership and some accounting.

Knowing what is in production

A major transition for the first production cluster is how changes are deployed. Up till now, your engineers probably worked directly on the cluster, editing and testing in an ad hoc fashion. Now that has to change.

You must control how code changes are introduced into the cluster. Do not allow developers to load scripts into the production cluster directly. Use DEVOPS practices of deploying new code by script and auditing all changes. You should be able to know what version of what code was running at any given time.

Creating a data-driven culture

Our discussion has focused on technology and its implementation, but the technology is not an end in itself. The data produced and processed by the technology is not even an end in itself. They are but means to an end, which is to change the way you conduct business, to embrace data as a core to decision making.

At the end of it all, big data is about culture change, which is much harder than mere technology implementation. Big data is a tool, a powerful tool and, yes, even an expensive tool, but unless you change the way you do business and embrace the power the tool provides, your success will be limited. Without culture change, you will not achieve the full benefit of the costs you have incurred.

First production

Overview​

Objectives​

Timeline​

Process​

Bring in IT support​

Split the cluster​

Establish security​

Build your quality checks​

Build your user interface​

Improve the processes​

Challenges​

Resource conflicts​

Knowing what is in production​

Creating a data-driven culture​