A Brief Overview of the Big Data Ecosystem
(Hadoop, Spark, and Beyond)
Other Platforms: NoSQL, NewSQL, Object Stores
Other technologies in the big data ecosystem that work well alongside Hadoop, SQL-on-Hadoop, and Spark are worth mentioning here. These classes of technologies include NoSQL, NewSQL, and object stores. Each of these are intended to solve big data challenges with different approaches as noted below. These technologies are typically used with Hadoop/Spark to handle big data analytics, as none of these technologies were designed specifically for analytics to the level Hadoop and Spark were. Still, their complementary use with Hadoop/Spark make them an important part of any modern data architecture.
Arguably, the most significant technology is the NoSQL databases, which were originally created to overcome the limitations of RDBMSs when using big data. Instead of a well-defined, tabular data model as specified by the relational model in RDBMSs, NoSQL databases “denormalized” data that allowed you to put as much data as you wanted into a single record. This meant that if you wanted to save or retrieve information about a specific entity, such as a person or an invoice, you only needed to access a single database record. This modeling of data allowed other advantages such as a scale-out architecture that leveraged low-cost, commodity hardware, just like Hadoop. In conjunction, the performance was superior and much more cost-effective than RDBMSs because of the simplicity of the database reads and writes.
And while NoSQL databases often displaced RDBMSs for certain workloads, they are certainly not a drop-in replacement. The term NoSQL was originally interpreted as “‘No’ to SQL” for a short while until the industry acknowledged that NoSQL should not be viewed as a direct replacement to RDBMSs. The acronym morphed into “not only SQL” to suggest that NoSQL and relational databases could act as complements in a data center.
Some of the early technologies that helped to give rise to the NoSQL movement include MongoDB, CouchDB, Apache Cassandra, and Apache HBase, though many other NoSQL databases are available today. And the early origin of the term “NoSQL” is actually a little more complicated than described above, but one thing to note here is the irony of the name. While the pioneering NoSQL databases had no SQL interface, more and more NoSQL databases today are adopting SQL as the query language, making NoSQL an increasingly obsolete name. Just as it had helped with Hadoop, SQL promises to make NoSQL databases easier to adopt and deploy in production environments.
NewSQL represents the emerging class of technologies that leverage the key characteristics of RDBMSs but were architected to solve the scale limitations. Since traditional RDBMSs were designed to run on a single hardware server, handling scale for more data and more users typically meant upgrading your hardware. With NewSQL, you get all the benefits of RDBMSs while also gaining the benefit of scaling out on commodity hardware (again, like Hadoop), which is necessary for growing big data volumes. NewSQL databases are typically run in-memory to overcome the latency of disk accesses and of coordinating data sets across nodes in a cluster. Since the NewSQL databases are still in their early stages, they are less certain to impact the BI world in the near future as other emerging technologies. As such, much of the discussion in this book will not specifically call them out as data platforms for BI. Examples of NewSQL databases include MemSQL, VoltDB, and ClustrixDB.
Finally, object stores represent yet another approach to addressing big data challenges. These are simply low-cost places to store large volumes of data. Object stores are ideal for “objects” which in this context are large files. Each object/file is typically accessed via a standard URL. The advantages are purely about cost and convenience, as there’s no compute layer that processes the data. These advantages have made object stores very popular for big data environments. Also, many websites take advantage of object stores because object stores are one of the most inexpensive ways to store data that needs to be accessed across the cloud. Many popular technologies including Spark can leverage object stores such as Amazon S3 as the means for storing big data.