August 30, 2018 - John Thuma | Big Data Ecosystem

What about Excel? Can I connect Excel to Hadoop?

A few days ago someone asked me why people aren’t using more Microsoft Excel with ApacheHadoop. I started to think about it and on the surface it made sense. After all, the biggest databases and most important applications in the world run on Microsoft Excel, right? Some will laugh at this statement but there is truth to it.

I started poking around the internet and stumbled upon a page that showed me how to connect Microsoft Excel to Hive and Hadoop. It claimed that all you need is an ODBC driver and a connection string and you could import data into Hadoop. Hmmmm?? Intriguing! In my 30 years of experience, I have seen some pretty snazzy and amazing Excel workbooks. Still… Excel has major limitations in this use case, and it is not an enterprise BI application. Like many big data applications, it is incomplete by itself, requiring other tools to make it work well with big data.

To put it simply, Excel has major limitations with concern to scale. These limitations will present challenges as Apache Hadoop can house trillions of records of a variety of types. With Excel, you are limited to 1,048,576 rows and only 16,384 columns. Each cell can only contain 32,767 characters. You are also limited by the amount of available memory and system resources where you are running the Excel application. So by no means is MS Excel a big data tool that can be used for big data analytics or data munging. Excel is generally a desktop application running on a standard personal computer and does not scale out. But even if it could scale, you are still limited by the number of rows and columns specified above.

Excel is not an enterprise level tool. Meaning that it lacks controls, authorization, authentication, security, backup, and the simple ability to share outcomes. Wait a minute, you say! I can put a password on my Excel spreadsheet and I can back it up. True, but these are not robust security and can be cracked by a $39 software product called Passware (www.passware.com/excel). Once your data leaves Hadoop, or the data lake, many bad things can happen: it can get lost, accidentally emailed to someone that shouldn’t have it, and/or copied and sent around the organization. Having a centralized security model is key, and keeping your data in a safe and encrypted space is important. Having data in a local Excel file means that you can easily combine it with other data sources and render once unidentified as identified. As your friends that know anything about HIPAA compliance and identifying patient protected healthcare information (PHI) can tell you, the point is once the data leaves the central governed repository it should be assumed unsafe and a potential security/privacy risk to your organization. Excel is definitely not an enterprise tool and as we will find out, it is really an incomplete solution for big data analytics.

Excel always leaves me wanting more. More size, more scalability, and more enterprise features. So what do we have to do? We have to build OLAP cubes and rollup databases that shrink the data down to sizes small enough to use for Excel sized analytics. This takes away the ability to drill down to your base records. You also have to put extra software tools on top of Excel like Pivot Table Services. There is no end in sight for what exotic contraptions I can build to make Excel feel like it is a big data analytics tool. But, this brings complexity, cost to develop, and many additional tools to the party. It also means ETL which can lead to latency and stale data. The other issue is that if I have a bunch of Excel workbooks with sensitive data on them, I better have the entire hard drive encrypted while data is at rest. Most bigger organizations have this capability as well as LDAP, and other directory services to lock down the device. However, there is very little to stop someone from combining data together to re-identify patients, payment card data, and highly sensitive data. Excel by itself is not the answer.

What you need is a single tool that provides a single pane of glass to perform analytics against big data. A tool that implements security from the source like Apache Ranger, or Sentry. The tool should also enable row level security and lock people down from bringing in foreign data sources that can be combined to re-identify anonymous data. You also need a tool that is simple to use. Everything should be drag and drop, the tool should help you make choices about visualizations, and it should also enable you to accelerate objects without hiring a band of code ‘Special Forces’ to make it happen. Accelerating a dashboard for performance should be easy enough for the business user to do themselves without IT needing to be involved. You also need a tool that enables natural language processing or what I like to call ‘Search.’ Gone are the days were the only way to search was ‘SELECT COUNT(*) FROM…. NO WAY! Instead, you want to ask human-readable questions like: “Give me all net sales in the SE from the toys department?” You want all of this through an easy to use and lock down browser interface that you can use from your mobile phone. What you want is Arcadia Enterprise! Check out this video to learn more: Datasets and Dashboards


Related Posts