August 7, 2018 - Richard Tomlinson | Big Data Ecosystem

Typical Cloud BI Deployment Patterns with Arcadia Data

An increasing number of our customers and prospects are asking us what options we provide for cloud BI with Arcadia Enterprise. Some customers are just experimenting, others are moving their test and dev environments off premises, and a few brave souls are all in on cloud for their enterprise production applications. The key driver for this shift is cost reduction. Why pay for expensive hardware, data centers, and resources to manage all of this when you can offload that expense to a cloud services provider? Sounds like a good idea right? Well, the answer is maybe, depending on how you want to leverage cloud.

This article focuses on how customers are deploying Arcadia Data in the cloud and talks to the pros and cons of the various approaches.

Architecture Patterns

We have seen three main ways in which people are already deploying Arcadia Data in the cloud, so let’s discuss these in turn.

Basic Lift and Shift

By far the most common (and coincidentally unexciting) architecture pattern for Arcadia Data in the cloud is what I call basic lift and shift. The reason I call it this is because it is the exact same pattern as a typical on-premises deployment that is simply picked up and moved into an identical architecture in the cloud. This is outlined in the following graphic.

In this diagram, the customer has chosen to run their on-prem Apache Hadoop distribution (e.g., CDH) within virtual machines on a typical cloud compute service like Amazon EC2. The white boxes represent Hadoop data nodes on which ArcEngine is running (represented as the red circles) alongside other compute engines such as Apache Impala, Apache Hive, and Apache Spark (represented by the white circles). On the very same Hadoop data nodes, HDFS is used for data storage. The raw data is spread across the data nodes accordingly (smaller white rectangles above) with the idea that each node computes its local data in parallel before merging the results and passing back to the requesting application. Arcadia Data analytical views (red cubes) are also stored next to the raw data in HDFS. This model is often called a coupled architecture in that compute and storage are together on the same physical node.

This is definitely the simplest model to deploy and does have the advantage of some reduced costs versus the on-prem model. This is because the customer is paying the cloud service provider to host the environment rather than acquiring their own hardware and paying for the associated data center and resources needed to support it. However, they are only really saving costs associated with running the hardware rather than the complete environment. The customer is still responsible for paying for and managing the Hadoop distribution within this environment, which is significant. A big disadvantage of this architecture against other cloud models is that scaling up and down is very hard. If the customer needs to add or remove nodes, they have to rebalance the entire HDFS system to redistribute the data evenly. This is an extremely time consuming and costly exercise that often takes the cluster out of commission for a significant period of time.

Decoupled Lift and Shift

The next most common architecture pattern that our customers use when deploying Arcadia Data is what I call decoupled lift and shift. I should probably think of a better name for this pattern but you should get the general idea. The basic difference between this model and the model previously described is that the storage tier has become decoupled from the compute tier. This is shown in the following graphic.

We see that the compute engines (ArcEngine, etc.) still run inside the on-prem Hadoop distribution, which are then hosted in the cloud using the appropriate compute service. The big difference here, however, is that no data is stored on the local disk of the compute nodes. In this model, the raw data (and Arcadia Data analytical views) are stored in the cloud service provider’s storage service. This is typically object storage, which is very different from HDFS and much better suited to the cloud for a variety of reasons.

Since all of the Hadoop distribution vendors now support object storage in addition to HDFS, this decoupling is possible. Simply upload/store the data in cloud storage and access it over the network from the compute tier. The big advantage with this model is that storage and compute tiers can be scaled independently. If an organization wishes to increase (or decrease) compute power then they just add more compute nodes. They do not have to rebalance the data evenly across the local disk per the prior model since the storage tier is a completely separate environment. Similarly, if more storage is needed, this is scaled up independently without adding additional compute nodes.

This model starts to introduce the ability for a somewhat elastic Hadoop architecture since customers can provision, scale up, scale down, and deprovision compute nodes as needed for a particular project or workload. Having the ability to scale down and even deprovision the compute cluster saves money since most cloud providers typically charge using a metered pricing (pay as you go) model. If you don’t need the services, you can remove them and stop paying without impacting the most valuable asset of all – the data.

The model does have some disadvantages in terms of cloud architecture, however, as it requires a lot of manual intervention. Scaling up or down is usually a manual process in that someone on the customer side has to decide how many nodes are required and when to provision and deprovision them. Additionally, the customer still owns the overall management and administration of the Hadoop system. This is essentially a hosted model and the cloud service provider (e.g., Microsoft, Amazon, Google) is not on the hook for SLAs related to availability and usage of Hadoop. This remains the responsibility of the customer and/or the Hadoop distribution provider, which means there are still multiple throats to choke.

Decoupled Managed

The final design pattern for cloud is what I call decoupled managed architecture. This is the most complete cloud model in that all the components, compute, storage, and Hadoop distribution are provided and managed by the cloud service provider. This is seen in the next graphic.

This option is most attractive to customers that want to offload entirely the management and costs of Hadoop to the cloud service provider. For example, an Amazon EMR (Elastic MapReduce) architecture not only leverages S3 for storage and EC2 for compute, but the entire data management software stack on top of this infrastructure is also provided and managed by Amazon. Arcadia Data is then orchestrated within this environment by the Amazon management and administration tools.

Additionally, another big advantage of this pattern is that it becomes much more elastic than our previous examples. Using Amazon’s EMR Management Console, for example, it would be easy to configure the system to add and remove compute nodes containing ArcEngine processes based on a predetermined schedule or even dynamically in the moment as workloads require it. The cloud Hadoop distribution and all of the surrounding software is written from the ground up to work together in the same environment, making it much simpler to provision and manage the environment over time. There would also be cost benefits associated since all of the components of the architecture are owned by the same vendor.

Summary

We have outlined the three most common cloud deployment patterns above but they are not the only possible architectures available to customers. Many customers often mix and match their on-prem and cloud architectures in somewhat of a hybrid model where some elements of the implementation are in the cloud while other components are still hosted in the customer’s own data centers. We even have seen most recently a movement to remove all the layering from the cloud architecture completely. In this environment, all components including storage, compute, metadata, orchestration and administration, identify management, and so on are separated into distinct platform (PaaS) and infrastructure services (IaaS) that talk to each other remotely across the network. In this model, no Hadoop distribution (as we currently know it) is required at all.

Is your Arcadia Data or other BI deployment in the cloud today? Does it map to any of the patterns described above? Let us know, we would love to understand more about your environment and plans for cloud with Arcadia Data. Drop us a note, we would love to hear what you have to say!


Related Posts