Amazon offers a lot of cloud-based services hence getting started with AWS can be tough. Which one to use? Services usually do not have expressive names and one might get even more confused when it comes to combining multiple services.

The idea was to deploy a Spring Boot application performing ETL (extract, transform, load). Any application can be deployed using EC2. But there is one special thing: it is a Spark application. Therefore we opted for EMR.


Note: EMR stands for Elastic MapReduce. It is a big data platform, providing Apache SparkHive, Hadoop and more. Managed Hadoop framework enables to process vast amounts of data across dynamically scalable Amazon EC2 instances.

S3 (Simple Storage System) is scalable distributed storage system, Amazon’s equivalent to HDFS and probably the most widely used service. Enables to upload anything (word documents, text, image, videos, etc.) and to instantly retrieve data when needed. One can access it from EC2 instance. 


I want to show how to use EMR and what I and my team-mate went through.

One of the most challenging problem were conflicting dependencies between those from the application and provided by AWS Spark platform. Additional configuration turned out to be crucial to make it work. 

Infrastructure

Before getting into step-by-step solution, here is an overview of how everything is bound together. Just to have an image of what I am talking about. 

AWS infrastructure diagram

Application static resources were placed in S3 bucket (etl-web)To use those, the application needs just one property: 

spring.resources.static-locations: http://etl-web.s3-website.eu-central-1.amazonaws.com

That was fairly easy. But lets move to backend serious things.

Modify project configuration

First, we had to make sure that all the Spark dependencies have been provided from the EMR, not from the application JAR. 

compile("info.fingo.etl:etl-plugins-api:${versions.pluginsApi}") { 
    exclude group
: 'org.apache.spark', module: 'spark-catalyst_2.11' 
} 
compileOnly(
"org.apache.spark:spark-core_2.11:${versions.spark}") 
compileOnly("org.apache.spark:spark-catalyst_2.11:${versions.spark}") 
compileOnly(
"org.apache.spark:spark-sql_2.11:${versions.spark}") 
compileOnly(
"org.datanucleus:datanucleus-core:${versions.datanucleus}")  // force to use newer version 
compileOnly("com.databricks:spark-xml_2.11:${versions.sparkXml}") 
compileOnly(
"org.apache.spark:spark-hive_2.11:${versions.spark}") 
compileOnly(
"org.apache.spark:spark-hive-thriftserver_2.11:${versions.spark}") 

Note: Gradle’s compileOnly ensures that dependencies are not included on the runtime classpath.


Secondly, we excluded logback from the project. 

exclude group'ch.qos.logback', module: 'logback-classic' 

Lastly, we separated AWS-dedicated application profile in Gradle. The application is submitted as a Spark job in cluster mode.

The other profile has Spark dependencies declared with compile clause and enables to start the application with Spark embedded or in Standalone Mode.

AWS deployment also required few properties to be overridden (they are passed in spark-submit script):

spring.resources.static-locations: http://etl-web.s3-website.eu-central-1.amazonaws.com 
spark.ui.port: 18080 
logging.config: /home/hadoop/config/logger.xml

Among Web Interfaces, EMR exposes Spark HistoryServer UI on 18080 port.

Dependencies conflict

After creating a profile for building Spark compatible JAR, we had to deal with libraries which could not be excluded 

“Why?” – one may ask. Well, one example can be Jackson. This set of libraries (jackson-core, jackson-annotations, jackson-databind) is an internal Spring Boot dependency. Their versions do not comply with Spark ones. The problem occurs in runtime – application will not start. There seems to be nothing we can do in Gradle to fix the situation. That is how dependency hell looks like. 

Some of these libraries are used across the app and have different interfaces than those provided by Spark.  

How to handle thisThere are two helpful properties that can be passed to the spark-submit script.

"spark.driver.userClassPathFirst""true" 
"spark.jars""s3://etl-deps/jackson-databind.jar,s3://etl-deps/jackson-annotations.jar,s3://etl-deps/jackson-core.jar,s3://etl-deps/guava.jar,s3://etl-deps/etl-plugins-api.jar,s3://etl-deps/etl-plugins.jar",

First one tells Spark to use libraries provided in the classpath firstly. In the second parameter we pass libraries in JAR files to the classpath. Unfortunately, each JAR must be passed explicitly – there is no way of telling Spark to look into a specific directory with JARs.  

Locally, spark-submit execution goes as follows: 

$ bin/spark-submit --deploy-mode client \
    --class org.springframework.boot.loader.PropertiesLauncher \
    --driver-java-options -Dloader.main=info.fingo.etl.EtlApplication \
    --conf spark.driver.userClassPathFirst=true \
    --conf spark.jars="/tmp/guava-26.0-android.jar,/tmp/jackson-annotations-2.8.0.jar,/tmp/jackson-core-2.8.10.jar,/tmp/jackson-databind-2.8.11.2.jar" \
    /tmp/application-2.7.0-SNAPSHOT.jar

The JARs of conflicting libraries should be stored somewhere. For AWS we placed them on S3. 

Application configuration 

OK, having acknowledge that we are quite close to complete the deployment, we need to make sure that the configuration is passed to the Spring Boot. To accomplish that, we wrote bootstrap action script for moving two files from S3 to the local EMR instance directory. 


Note: Bootstrap actions are scripts that can be run on the cluster before it launches. To be more concise: before the application, we specified, is being run. 


#!/bin/bash 
mkdir -p /home/hadoop/ 
s3 cp s3://etl-deps/application-aws.properties /home/hadoop/ 
s3 cp s3://etl-deps/logger.xml /home/hadoop/
 

These two files cannot be accessed by Spring Boot from Sdirectly (we cannot set path to S3 in the properties file) and need to be locally visible. 

Ready to go  

We are ready to launch cluster on the AWS. There are only 4 simple steps in order to do that. 

1: Software and Steps 

Set up applications which should be available in the EMR. In spark-submit script we are telling Spring to use application-aws.properties.

Create cluster: step 1

Additionally, we need to pass the software settings with Spark properties configuration. 

[ 
  { 
    "classification""spark", 
    "properties": { 
      "maximizeResourceAllocation""true" 
    } 
  }, 
  { 
    "classification""spark-defaults", 
    "properties": { 
      "spark.driver.userClassPathFirst""true", 
      "spark.jars""s3://etl-deps/jackson-databind.jar,s3://etl-deps/jackson-annotations.jar,s3://etl-deps/jackson-core.jar,s3://etl-deps/guava.jar,s3://etl-deps/etl-plugins-api.jar,s3://etl-deps/etl-plugins.jar", 
      "spark.sql.warehouse.dir""s3://etl/warehouse" 
    } 
  } 
] 

2: Hardware 

In this step we can configure the hardware, e.g. number of nodes, their type (Master/Core/Task)instance type (model), auto-scaling options, etc. 

Desired configuration – performance tests have shown it to have the best time efficiency to cost ratio for our case:

Create cluster: step 2

Sample Auto Scaling for Task nodes:

Scale out Task nodesScale in Task nodes

3: General cluster settings 

In this step we can set cluster name and select bootstrap actions scripts. 

Create cluster: step 3

4: Security 

Configure security groups, e.g. set 8081 port open to inbound/outbound TCP connections. 


Note: If you plan to use SSH, except creating EC2 key pair, you should also ensure that the security group for Master has allows inbound traffic via SSH (port 22) for the public IP address you are using. 


Create cluster: step 4

Create cluster 

On the last step we are creating and launching a cluster by clicking the button “Create cluster”. 


Note: After running a cluster it cannot be stopped or paused, like EC2. For EMR termination is the only option.


Opening ETL 

After starting a clusterchoose it from the list. See available public address shown below. 

Summary tab of ETL cluster with master public DNS

In the example, the public address is set to ec2-18-184-190-45.eu-central-1.compute.amazonaws.com. You can type this address in the browser along with the port 8081. 

ETL application running on Amazon EMR

VoilàETL is working.