This blog was first published on Medium.
WARNING: This is not for the high-tech unicorns, you mythical beasts who sparkle SQL and Java and make code bloom wherever you go. This is for the regular person who wants to understand Apache Spark at a pedestrian level. There are many resources online that help you take a deeper dive into Apache Spark, but it can be difficult for the code-illiterate to make heads (or tails) of it. No offense to the Unicorns, they are pretty awesome!
This is for the curious minds that want to learn a little about Apache Spark. Maybe you will want to take it further and learn more; that would be swell.
We will follow the who, what, where, when, why, and how pattern for these articles. I hope you enjoy them. Also, if you have a topic or technology request, let me know in the comments below.
BEFORE WE GET STARTED LET’S DISCUSS WHAT APACHE SPARK IS NOT: a stationary electric charge, typically produced by friction, that causes sparks or crackling or the attraction of dust or hair. (Static Electricity)
Data scientists use Apache Spark to perform advanced data analytics. Python brings an extensive set of advanced analytical functions that can be performed on data in Spark. Python is one of the more popular languages of the data science community and is also supported by Spark via a toolset called pySpark.
Data engineers are data designers or builders. They usually assist data scientists and application developers in the data curation journey. They develop the architecture for the organization based on use cases and needs.
Application developers can build solutions using Apache Spark. These applications are generally for analytical and business intelligence purposes. Spark is great for data analysis style applications and not for transaction processing applications.
BOTTOM LINE: Apache Spark requires a decent amount of technical knowledge to make work. The average business person will require lots of help to get running on Apache Spark (being very generous). It is for programmers, data scientists, and highly technical unicorns.
Apache Spark is a distributed processing system used for big data workloads. A distributed system is a collection of many servers all working together like a team. A distributed system can handle bigger datasets, do more data crunching, and solve big data problems because I am spreading the work across many computers. Think of it this way: If you want to move a couch it will take more than just one person!
Apache Spark also takes advantage of in-memory caching for fast analytic queries for any data size. An in-memory cache is designed to store data in RAM and not on disk. You can use languages like Scala, Python, R, and SQL to leverage Apache Spark.
BOTTOM LINE: Apache Spark was built in 2009 and made resilient by 2012 because MapReduce (see the WHY section below) was slow and complicated, and people wanted something faster and easier. Apache Spark is fast and is far simpler to program than MapReduce. Imagine shoving a bunch of data into computer memory and being able to read it, process it, or do something rapidly. That is Apache Spark.
Apache Spark can be installed on-premises and even on your laptop. You can have multi-node instances as well. Spark is also available in the cloud and you can easily stand up an instance in Amazon Web Services or Microsoft Azure. There is also Databricks, which is a cloud provider of Spark. Databricks allows you to stand up a cluster very quickly and has an easy to use web interface.
BOTTOM LINE: Apache Spark is open source so go crazy and install it yourself on your laptop and give it a try. It comes with some examples too. If you want to go big with Spark I would recommend trying it out on Databricks first or doing it with a cloud provider like AWS or Azure.
Spark is useful for people who know how to use it and how to get around its limitations. A simple demo is not enough; don’t be fooled by word count examples. It isn’t always easy. Working with data at scale can be a tough ride with respect to memory issues.
The big gotcha will come when you go from a single node instance to a multi-node cluster environment. Advancing to large clusters is when the challenges of big data kick in with respect to data sizes, workloads, and managing the environment. It is kind of like the old saying: “the more moving parts the greater the opportunity for things to break.”
BOTTOM LINE: Tread lightly and be patient. Apache Spark requires some serious technical talent to make work. Don’t be fooled by the simple examples and demonstrations, make sure that you get a real example at the scale of big data. Big data is massive data sets, much bigger than what can fit in an Excel spreadsheet.
Spark was developed in 2009 in UC Berkeley’s AMPLab and was turned over to the Apache open source community in 2010. In 2012 it was ready for prime time in response to frustrations in the MapReduce cluster computing paradigm, the initial programming framework of Apache Hadoop. MapReduce forces a linear dataflow structure on distributed programs: MapReduce programs read input data from disk, map a function across the data, reduce the results of the map, and store reduction results on disk. In layman’s terms, MapReduce was slow! SLOW + BIG DATA = NO JOY, thus we get Spark.
BOTTOM LINE: In the world of big data, developers got frustrated with the performance of MapReduce. Since necessity is the mother of invention: we got Apache Spark
If you are reading this article, you are curious about what Apache Spark is and how it works. If you muster up the courage, you too can get it running on your laptop. Give it a try. Here is an excellent article on installing a single node instance of Apache Spark: Install Spark on Mac (PySpark)
I have faith that you too can find your INNER-UNICORN!
BOTTOM LINE: Getting started with a single node instance of Apache Spark is fun and easy to do! It shouldn’t take very long, roughly an afternoon. There is a big difference between a single node (one computer) and a multi-node cluster (many computers) of Apache Spark. If you really want to impress your friends in Technology-land have the pySpark window open next time you present (see below). It will blow their minds! (Make sure to have a serious look of confidence on your face, then change the subject.)
CONCLUSION: Spark is an incredible technology and this article only touches the surface. I didn’t go into RDDs and other aspects but maybe in a future article I will. Hope to see you next time when I talk about Apache Hadoop!
CALL TO ACTION: In the meantime, if you want to do something really super easy and fun check out Arcadia Instant. It is a free version of our technology. If you are curious to learn more, ping me and I will show you around!