Implementing data security controls can be hard, so it is up to technology vendors to work together to make security as simple as possible. This is especially true of big data environments where there are many pieces to integrate. When it comes to working with the Hortonworks Data Platform (HDP), a tight integration with Apache Ranger is essential. And quite appropriately, Hortonworks recently shared some integration stories in a breakout session at the recent DataWorks Summit (June 13-15) on partner integrations with Apache Ranger. The talk, titled Partner Ecosystem Showcase for Apache Ranger and Apache Atlas, was led by Srikanth Venkat, senior director of product management at Hortonworks, and Ali Bajwa, principal partner solutions engineer at Hortonworks.
Srikanth kicked off the presentation by giving an overview of both Apache Ranger and Apache Atlas, two open source projects that address big data security and governance. Not surprisingly, the room was nearly at capacity, as security and data governance continue to be top-of-mind issues for many big data professionals. If you’re not familiar with Ranger, it is a technology that provides centralized authorization controls for data in Apache Hadoop. It provides policy-based access controls on many different resources in Hadoop, so that you can tag data sets and then apply authorization policies on those tagged assets. Atlas is a data governance engine that includes lineage tracking as well as tagging capabilities that work with Ranger to provide the policy-based authorization. Both Ranger and Atlas have strong communities that contribute to innovation, which is especially important to encourage other technology vendors to join in.
Arcadia Data, Talend, and Protegrity are three such technology vendors that have tight integrations with the Ranger/Atlas ecosystem. Each company briefly described their integration and explained why it was important to end users.
Laurent Bride, CTO of Talend, was first to discuss his company’s integration. As a data integration provider, Talend lets users transform data in HDP to prepare it for end user needs. And since data integration pipelines go hand-in-hand with data lineage tracking, their integration with Atlas APIs gives users a central interface for viewing data lineage. He showed a demo where he ran a transformation pipeline and then displayed the lineage in a graphical format with Atlas.
Sunil Sabat, who handles partner solutions at Protegrity, talked about how his company provides tokenization, a form of encryption that masks sensitive data including PII, PHI, and HIPAA-related data. He noted the difference between data access and data protection, in that Ranger is used to prevent showing certain data elements while Protegrity obfuscates sensitive data. In other words, Ranger will hide the data you are not authorized to see, but Protegrity will allow you to see a modified version of the data. The latter is useful for when your analytics depend on the existence of sensitive data (e.g., querying on a “social security number” field) but do not need to know the real value.
In between Laurent’s and Sunil’s talks was Shant Hovsepian, CTO and co-founder of Arcadia Data. Shant started with an admitted shameless plug on his Twitter handle, @SuperDuperShant. He urged the audience to follow him, as he proclaimed embarrassment of having fewer followers than his mother. He then gave a brief overview of Arcadia Data, a provider of a native visual analytics platform architected for big data. By “native visual analytics,” he was referring to the fact that Arcadia Data (or more specifically, the product: Arcadia Enterprise) runs in-cluster in HDP and is managed by Apache Ambari. His key points were that Arcadia Enterprise was the first native visual analytics platform for big data, the only in-cluster BI platform, and consisted of two main components: a visualization front-end, and a distributed analytics engine which provides the granular drill downs and performance of OLAP but without the heavy IT intervention for building and maintaining cubes.
Shant said that since business users and IT professionals tend to be at odds with one another, solving that conflict is a top objective. Organizations need to balance internal transparency while still controlling who sees what, because the business users want fast access to data, and IT wants to maintain control. That’s where Ranger comes in for analytics with Arcadia Data on HDP. Shant espoused the elegance and advantages of Ranger in HDP, and pointed out:
- Centralization. IT teams love having centralized authorization and auditing across Hadoop components because it simplifies how you provide access, in a single location.
- Resource authorization. Ranger provides access authorization based on resources and supports role-based access control (RBAC) in addition to policy-based access control (PBAC), both of which simplify the process of provisioning security.
- Policy-based behavior. Capabilities such as column-level masking, geo-location based access, and time-based access all represent powerful features that big data users need.
- Extensible architecture. With its plug-in- and API-oriented architecture, Ranger is easier to integrate with.
Then Shant turned to what he called his “money slide.” At this point, he explained why integration with Ranger was such a great thing to do from the Arcadia Data perspective. First, he acknowledged that security code is very complicated, and it’s hard to get right, and easy to get wrong. And fortunately, the strong community around Ranger/Atlas would ensure security was done right. This means that technology partners shouldn’t have to take on the task themselves. Second, in a typical self-deprecating comment often found among engineers, he claimed laziness as a reason for not building a dedicated security model in Arcadia Data. Of course, this is just another way to say that there’s no need to reinvent the wheel.
His third and final point kicked off an interesting discussion that included some audience participation. The topic was, “zero knowledge proofs,” which he claimed was typically restricted to graduate-level cryptography classes. A quick poll of the audience revealed that only a handful had previously known of this topic. His discussion continued with a nice example: he claimed that he had a magical power to be able to quickly count the number of leaves on any tree, and he asked the audience what would be a good way to verify his claim. Some ideas around trust and small-scale testing were dismissed, as these would not provide sufficient proof.
What he was getting at was a way to prove his magical power without revealing anything more about it other than the fact that he had such power. That’s what a zero knowledge proof is: the ability to prove something without revealing any additional knowledge. His answer was quite simple. He would first state the number of leaves in a tree, and then with his back turned, ask a verifier to optionally remove a leaf from the tree, possibly as determined by a coin flip. Then, if he could provide a count that coincided with the verifier’s action (i.e., the same number if no leaf was removed, or one fewer leaves if a leaf was removed), then that would be a good indication that his power was real. Of course, one test would not be sufficient, so with enough iterations, Shant could prove with very high probability that his power was real (the key here is that the proof is probabilistic, not deterministic, as most other proofs are).
His discussion was an interesting way of reinforcing the point that from an Arcadia Data standpoint, we don’t care about the details of the security model implementation—we just want a verifiable answer on whether a user has access to specific data or not. This standpoint once again suggests that relying on a technology like Ranger is the right way to handle security in a big data environment.
For the sake of staying on time, Shant wrapped up his presentation with a fast summary of his remaining slides. During the next 30 seconds (literally), he covered many points, but one area of particular interest was a summary of Arcadia Enterprise’s security features:
- Single copy of data reduces the footprint of data copies, allows a single policy definition, and makes compliance easier.
- Enterprise-grade architecture includes support of key security frameworks like Kerberos, LDAPS/AD, PAM, SAML, etc.
- Integrated access with Ranger ensures no risk of mismatching policies.
Security is only one reason why Arcadia Enterprise is ideal for big data. Shant also mentioned that the analytics acceleration engine caches and pre-computes query components to vastly improve performance, scale, and concurrency. Another noteworthy point is that Arcadia Data has several certification badges with Hortonworks (HDP, YARN, Security, Operations) which you can see here: https://hortonworks.com/partner/arcadia-data/.
The audience seemed pleased with the overall session, and hopefully they got a good sense of the importance and popularity of Ranger and Atlas, as well as the value of the partner technologies. Arcadia Data had many good conversations about native visual analytics on HDP, and we look forward to doing more work with Hortonworks. The DataWorks Summit was a great success as well, with lots of end user customers looking to evolve their big data projects and support more use cases and departments on their system. Discussions with visitors at our booth revealed a more sophisticated crowd than in previous years, which demonstrates stronger adoption of big data technologies in the market today. In most cases, our message resonated with our visitors because they’ve already experienced the challenges of traditional BI in a big data environment, and they’re looking for a more agile and scalable solution. DataWorks Summit attendees are clearly very serious about Hadoop and the value it offers to business users, and these are the types of professionals that we help. With the many great meetings we’ve had over the three-day summit, we were happy to be a part of it and we look forward to the next venue.