EMR Pyspark Batch Processing Project
Project Description:
In this project I have i have implemented a batch processing pipeline using EMR and pyspark with a fictional super market data. The following technologies were used:
Pyspark
AWS
EMR
S3
Athena
EC2 Key Pairs
Architecture
EMR Setup:
Navigate to the Amazon EMR page and click on create cluster.
Give the cluster a name and choose spark engine.
Be sure to set an automatic termination after specific idle time since EMR does not have a free tier.
In the security configuration select an existing key pair if you have one or create a new one and download the pem file. This is needed to use putty to access the master node.
Next create an IAM role for the service role and instance profile
Then navigate to the security group of the cluster and edit the inbound rules to add SSH to your IP. This is needed to able to putty into the master node.
Now open putty and enter the public DNS of the master node.
Enter the path where the pem file of your key pair associated with the EMR cluster and connect connect.
You would be connected to the master node. Enter hadoop as the user. Now you would be able to run spark code in your cluster.
Running PySpark code on EMR:
Once you are in your cluster use the vi editor to open the native text editor.
In that enter the pyspark code you want to run.
In this case I have written a simple pyspark code to filter a data present in an S3 bucket.
Run spark-submit command to run your script.
Please find below the screenshots and the code.
Setting up an Athena table without Glue Crawler:
I have set up an athena table manually without using the glue crawler.
We need to add the columns manually in the Athena table creation window.
I have added the screenshots detailing the steps to create an Athena table.