Contents

Data Engineering Notes | Scala Spark 3

0. Introduction on Spark and Scala

Spark

Spark: a unified analytics engine for large-scale data processing.

  • fast: Run programs up to 100x faster than Hadoop Mapreduce in memory, or 10x faster on disk
    • DAG(direct acyclic graph) Engine optimizes workflows

Apache Spark consists of Spark Core Engine, Spark SQL, Spark Streaming, MLlib, GraphX and Spark R.

  • Spark Core: provides the in-built memory computing and referencing datasets stored in external storage systems and uses RDD data structure to speed up the data sharing in distributed processing systems like MapReduce from permanent storage like HDFS or S3 which may be slow due to the serialization and deserialization of I/O steps.
  • Spark SQL: allows users to perform ETL(extract, transform and load) functions on data from a variety of sources in different formats like JSON, Parquet or Hive and then execute ad-hoc queries.
  • Spark Streaming
  • MLlib
  • GraphX

1. Installation on MacOS

In https://spark.apache.org/docs/latest/ (updated on Sep 7, 2021)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
% /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
% softwareupdate --install -a
% brew install openjdk@8
% sudo ln -sfn /usr/local/opt/openjdk@8/libexec/openjdksudo ln -sfn /usr/local/opt/openjdk@8/libexec/openjdk.jdk /Library/Java/JavaVirtualMachines/openjdk-8.jdk
% java -version
openjdk version "1.8.0_302"
% brew install scala@2.12
% brew install apache-spark
% spark-shell
Welcome to

      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.2
      /_/

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 11.0.12)
Type in expressions to have them evaluated.
Type :help for more information.

[scala> :q

Go to Macbook Finder, press command+shift+G and input /usr/local/Cellar

  • we can see the versions of Java, Scala and Spark
  • Common Issue: Setting PATH in bash
1
2
3
4
5
% echo 'export PATH="/usr/local/opt/openjdk@8/bin:$PATH"' >> ~/.zshrc
% source ~/.zshrc

export SPARK_HOME=/usr/local/Cellar/apache-spark/3.1.2/libexec
export PYTHONPATH=/usr/local/Cellar/apache-spark/3.1.2/libexec/python/:$PYTHONP$

Downloads IntelliJ and install scala plugin

  • Run HelloWorld.scala
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
package com.sundogsoftware.spark

import org.apache.spark._
import org.apache.log4j._

object HelloWorld {
  def main(args: Array[String]): Unit = {

    Logger.getLogger("org").setLevel(Level.ERROR)

    val sc = new SparkContext("local[*]", "HelloWorld")

    val lines = sc.textFile("data/ml-100k/u.data")
    val numLines = lines.count()

    println("Hello world! The u.data file has " + numLines + " lines.")

    sc.stop()
  }
}
/*
/Library/Java/JavaVirtualMachines/openjdk-8.jdk/Contents/Home/bin/java
Hello world! The u.data file has 100000 lines.

Process finished with exit code 0
*/

Reference: