Using Apache Spark for the First Time
by Jim Scott
If you read our recent blog post, “Spark 101: What Is It, What It Does, and Why It Matters,” you know that Spark is a general-purpose data processing engine that can be used for stream processing, machine learning, data integration, and interacting with and exploring data. You can read more about it in the Getting Started with Spark: From Inception to Production ebook. In this blog post, I’ll show you how to get started with Spark.
Spark has a very low entry barrier to get started, which eases the burden of learning a new toolset. It is straightforward to download Spark and configure it in standalone mode on a laptop or server for learning and exploration. This low barrier to entry makes it relatively easy for individual developers and data scientists to get started with Spark, and businesses to launch pilot projects that do not require complex re-tooling or interference with production systems.
Apache Spark is open source software, and can be freely downloaded from the Apache Software Foundation. Spark requires at least version 6 of Java, and at least version 3.0.4 of Maven. Other dependencies, such as Scala and Zinc, are automatically installed and configured as part of the installation process.
Follow these simple steps to download Java, Spark, and Hadoop, and get them running on a laptop (in this case, one running Mac OS X). If you do not currently have the Java JDK (version 7 or higher) installed, download it and follow the steps to install it for your operating system.
Visit the Spark downloads page, choose a pre-built package, and download Spark. Double-click the archive file to expand its contents so you can use them.
A Spark Demo
Open a text console, and navigate to the newly created directory. Start Spark’s interactive shell with:
A series of messages will scroll past as Spark and Hadoop are configured. Once the scrolling stops, you will see a simple prompt.
At this prompt, let’s create some test data: a simple sequence of numbers from 1 to 50,000.
val data = 1 to 50000
Now, let’s place these 50,000 numbers into a Resilient Distributed Dataset (RDD) which we’ll call sparkSample. It is this RDD upon which Spark can perform analysis.
val sparkSample = sc.parallelize(data)
Now we can filter the data in the RDD to find any values of less than 10.
sparkSample.filter(_ < 10).collect()
Spark should report the result with an array containing any values less than 10.
As you have hopefully discovered, Spark is relatively simple to use. You can get more information from the Quick Start guide optimized for developers familiar with either Python or Scala, or the tutorial linked to the simplified deployment of Hadoop, the MapR Sandbox. You can also learn more about build options in Spark’s online documentation, which includes optional links to data storage systems such as Hadoop’s HDFS or Hive.
In this blog post, we provided a guide for getting started with Spark, and included installation instructions and a demo for working with some test data.
If you have any additional questions about using Spark, please ask them in the comments section below.
- [Book Discussion] - Getting Started with Apache Spark
- Getting Started with Spark on MapR Sandbox
- The Essential Apache Spark Cheat Sheet
- Explore Apache Spark Resources & Product Information (backup original)