Big data should have been the rocket fuel that launched business intelligence (BI) into the stratosphere, so why are so many companies still stuck in neutral when it comes to harnessing the power of Apache Hadoop, or, more specifically, BI on Hadoop, for improved business insights?
It’s because BI has failed to keep up with the changing data landscape. The cost, rigidity and overhead requirements of legacy BI systems have confined them to a limited set of use cases. Even though Hadoop celebrated its 10th birthday last year, most legacy BI programs still struggle to take advantage of the flexibility and cost innovations it introduced.
The roots of this disconnect go back to the legacy architectures of most popular BI systems, which were developed in the 1980s and 1990s. BI introduced some important innovations, such as multidimensional modeling and rich visualizations, but those benefits came with costs and compromises that haunt legacy systems to this day. Among them are the following.
- Dedicated servers are often required to run BI models, and the cost of that overhead forces organizations to limit usage. Some BI platforms still require dedicated servers built on expensive and proprietary “scale-up” architectures (rather than “scale out”, parallel or distributed), which require wholesale replacement of processors, memory and storage.
- Legacy BI systems also grew up around a client/server architecture, requiring extensive server processing power and large file transfers to and from client PCs. This was the most efficient processing model at the time, but today it’s a relic. The need to continually pass data back and forth consumes bandwidth and bogs down performance. Some tools even store data locally on the PC, creating a gaping security hole.
- Many BI platforms work only on highly structured and cleaned data stored in the rows and columns of relational tables. Extensive preparation is needed prior to loading this data into the BI server, which limits the universe of business users who can take advantage of the tools. While some legacy BI platforms have evolved to handle unstructured data, doing so requires a Herculean data transformation effort that few companies can afford.
- Most BI tools also can’t process real-time or streaming data, which IDC estimates will make up about one-quarter of the information organizations will need to handle by 2025.
- Reliance on dedicated BI servers also means organizations must make copies of production data, which introduces a variety of security, data integrity, and consistency problems. Best practices for database management dictate that copies should be avoided, but BI servers require it.
Fast forwarding 20 years, we can see that this legacy has held back the promise of BI. For many organizations, populating the BI data store still involves excruciating manual effort to transform, normalize and validate data, a process that may consume days or even weeks of a data scientist’s time. And the share of corporate data that BI tools can process is shrinking. Only about 20% of corporate data is stored in a structured format; the other 80% is in free-form documents, emails, text messages, PDFs and multimedia – and the volume is growing rapidly.
Collectively, these limitations bind users’ hands just as the big data revolution promises to set them free. The BI industry has come up with a variety of work-around approaches to connecting legacy tools with Hadoop data stores. These include query accelerators, software data transformation layers, and SQL-on-Hadoop translators. All have problems, however. Software layers introduce performance penalties, make monitoring a chore and add to IT management overhead. SQL-on-Hadoop projects each support a different subset of the ANSI SQL standard that is the default for nearly every business intelligence suite, meaning that the results of translated queries may be incomplete or even wrong.
BI on Hadoop Without Compromise
There is another approach worth considering: Replace legacy BI tools with a modern, integrated and massively parallel infrastructure that incorporates Hadoop as a native data store and integrates a business semantic layer as an integral component rather than a bolted-on extra.
One of Hadoop’s greatest virtues is its low-cost, clustered architecture that scales out linearly as nodes are added. Extracting data from a Hadoop cluster to load into a BI server makes no sense if operations can be performed directly on the cluster. Yet most legacy BI tools force this intermediate stage. A better solution has the BI engine running in parallel directly on the Hadoop data nodes directly to minimize extracts, copies, translation servers and other intermediate steps.
Modern BI solutions also enable users to ditch the PC. For data analytics and visualization, today’s browsers are just as capable as desktop computers, as well as being more secure and simpler to manage. They’re easier to update and configure, and they enable users to access their models from anywhere without requiring a download or browser plugin.
BI servers optimized for big data engines can also seamlessly accommodate a wide range of new data types – such as free-form text and semi-structured machine logs – without extensive cleansing and preparation. They can also field information from streaming sources such as Confluent’s KSQL for Apache Kafka, a factor that will only become more important as BI evolves from its decision-support roots into a real-time recommendation engine.
Structured data will be around for a long time to handle business’ transactional needs, but the most exciting new opportunities are in applications that also integrate unstructured sources. If your BI engine can’t combine machine-generated data from set-top boxes or manufacturing equipment, sentiment analysis derived from Twitter conversations with sales data, or measure the effect of a new ad campaign on demographic customer segments, then it’s probably time to think about a new approach based on today’s most powerful big data technology.