Kafka Athena Streaming Project
Project Description:
In this project I have implemented an End-To-End Data Engineering Project simulating real time data using stock market data using Kafka. The following technologies were used:
Programming Language - Python
Apache Kafka
Cloud - AWS
EC2
S3
Glue Crawler
Glue Catalog
Athena
Architecture
EC2 and Kafka Setup:
Choose launch instance from AWS EC2 page.
Select a free tier eligible Amazon Machine Image. I have chosen Amazon Linux for this project.
Choose a free tier Instance type from the instance type drop down.
Select a existing key pair if available or create a new one for this EC2 instance.
Allow SSH traffic from your IP or all IP(Not recommended).
Choose the required storage. Upto 30 GB is free, however this much space is not required for this project.
Launch the instance.
We can connect to the instance using the key pair using putty or we can directly connect from the EC2 console.
Next step is to setup Kafka in the EC2 instance.
Please follow the steps highlighted below to set up kafka.
Python code to simulate streaming stock data:
I have used a sample stock data and used the following code to simulate a streaming data.
I have created a producer object and used a while to continuously send data to the consumer.
Python code get the data from consumer and upload to an S3 bucket:
I have set up an s3 bucket.
Then I created a consumer object and dumped the data as JSON to the previously created bucket.
The data has started to be loaded to the s3 bucket.
Setting up a glue crawler to crawl the s3 bucket:
Navigate to the AWS Glue Console.
Navigate to the crawler page and click on create crawler.
Go Next and check the Not Yet option in the is your data already mapped to glue tables question.
Select Add data source and choose s3. Click on Browse S3 and choose the bucket where the data is streamed.
In the next window choose an IAM role if you already have one or create a new IAM role for glue and give s3 access.
In the next page create a new database or choose an existing database.
Run the crawler.
Exploring the data in Athena:
Navigate to the Athena console.
In the query editor you would find the database you created in the crawler step.
It will have a table in the name of the s3 bucket.
The data will be constantly updated as more data is dumped in the s3 bucket.
The data being constantly updated is showcased in the screenshots.