This blog post is the first in a series based on the ebook Modern Business Intelligence: Leading the Way to Big Data Success. We wrote this book because it’s such an exciting time for data-driven companies. They’re exploring new technologies and new approaches to create competitive advantage and even market disruption. Traditional approaches to data management aren’t enough these days since they can’t efficiently handle the volume, velocity, and variety of data found in today’s business environment. Fortunately, data technologies have matured, and organizations are now able to successfully leverage their big data to drive their business operations. What’s next for these organizations? It’s a focus on end users—particularly those in BI and analytics, and the notion of self-service BI, which is particularly important when it comes to gaining competitive advantage via data agility.
We believe our book provides a window into emerging trends in big data and the need for analytics solutions that emphasize more user-friendly approaches, such as more sophisticated visualization techniques. Throughout the book, we’ll share our observations of the industry and what today’s businesses need to consider to remain competitive.
Stay tuned to our blog site, where we’ll feature posts based on summaries of individual chapters from the ebook. In this first chapter, we provide a brief overview of the big data ecosystem, with an emphasis on Apache Hadoop and Apache Spark, as well as a few other platforms. Here is a brief summary of this first chapter.
The Big Data Ecosystem Starts with Apache Hadoop
Hadoop adoption and expansion is continuing to grow at a dramatic pace. In fact, Hadoop is predicted to grow at a compound annual growth rate of 59% through 2020, according to Research and Markets. This rate of doubling every two years mirrors IDC’s predictions of growth in overall data volume. In addition, Forrester Research predicted that eventually all enterprises will adopt and deploy Hadoop somewhere in their organization. A client survey by Gartner in September 2016 shows that 73% of organizations either have or plan to invest in big data, and that number increases to 86% for large enterprises.
Hadoop is Considered the “De Facto Standard” for Data Lakes
Hadoop’s main feature is its ability to capture and store multiple types of data. It typically uses the MapReduce engine to run on multiple data nodes and distributes workloads in parallel across a deployment. Hadoop also provides a comprehensive framework for high-level big data analytics, and it has built-in redundancy capabilities to prevent data loss should a node fail.
IT and BI Users Alike Are Empowered to Use Hadoop
It’s important to keep in mind that Hadoop isn’t a rogue set of technologies. It embraces popular technologies like SQL, RDBMSs, and OLAP so it has attracted IT professionals that have been trained in traditional query languages. A whole subsystem of SQL-on-Hadoop and OLAP-on-Hadoop offerings have emerged, such as:
- Apache Hive. Considered the first native SQL-on-Hadoop engine, Apache Hive is most often used with traditional BI tools for batch and ETL processes.
- Apache Impala. A SQL-on-Hadoop engine that runs on top of Hadoop, it differs from Hive in that it works in real time and is great for BI and analytics operations.
- Apache Drill. Drill is the first SQL query engine that uses a schema-free JSON object model. It defines schema dynamically instead of requiring predefined schemas.
- Spark SQL. Spark SQL uses DataFrames, which support both structured and semi-structured data. It provides a domain-specific language (DSL) to work with DataFrames in Scala, Java, or Python.
Since Hadoop provides support for these traditional approaches, it empowers BI professionals and other IT users to benefit from big data analytics.
Adding a Little Spark to the Ecosystem
Most likely you’ve already heard of Apache Spark, the powerful ultra-high speed general processing engine built around speed, ease of use, and sophisticated analytics. It plays with others well— it works with Hadoop and can access data sources like HDFS and Apache HBase.
There are a number of reasons why it’s been so widely adopted:
1) It uses an in-memory application framework. This is preferable to MapReduce, which uses a disk-based data processing approach.
2) Spark also provides a lot of support for other languages via its APIs for Java, Python, and Scala as well as for SAS, SQL, and R, which are hot languages for data scientists and BI professionals.
3) Spark can be used in almost any setting, whether it’s cloud, on-premises, or a hybrid.
4) You can also use Spark to perform advanced data operations that you commonly use with databases such as “filter,” “join,” and “group by.”
5) Spark provides its own graph data support via its GraphX service so you can quickly analyze relationships between entities.
It’s important to keep in mind that no technology, even Spark, will handle all of your big data needs. Spark does have some limitations you should be aware of. First of all, it’s generally geared for technical audiences, not business analysts. Spark lacks business user-oriented visualizations, which is why modern visualization technologies like Arcadia Enterprise play an important role. Other limitations include the fact that Spark doesn’t have its own file management system, and it doesn’t support real-time processing, as it uses iterative processing—data is iterated in batches or micro-batches, so each iteration has to be scheduled and executed separately.
Other Data Platform Players in the Market
As you can see in the famous Data Platforms Map from 451 Research, the data platform landscape is very complex. Many technologies have come around since the early days of RDBMSs. NoSQL, NewSQL, and object stores are just a few of the other players in the market. Read Chapter 1 in our eBook to find out more about these platforms.
Don’t Be Left Behind
As companies seek more data, more agility, more insights, all with lower costs, a new paradigm is required. Simply applying legacy BI technologies to this new modern world of big data is not going to cut it. Data-driven organizations are adopting new approaches such as SQL-on-Hadoop, OLAP on big data, Spark, NoSQL, NewSQL, object stores, and native visual analytics platforms like Arcadia Data.
In the next blog post in this series, we’ll look at BI and analytics and discuss the present and future of enterprise reporting along with the rise in self-service BI reporting. We’ll also highlight a Hadoop analytics case study that features a powerful self-service BI environment that enables agility and collaboration across teams. All that and more is in Chapter 2: BI and Analytics Meet Business Transformation. Be sure to check out the entire eBook here.