#BLOG2 ETL using AWS Glue x Lambda x Sqlectron x S3
Creating S3 bucket as the source:
I have provided a dataset of happiness survey about people being happy or unhappy from varying cities. You can download the dataset from Kaggle or use anything else from Kaggle.
Note: If you are using any other dataset then please create the table columns accordingly in the Sqlectron.
Now, Create an S3 bucket with default settings and upload the happydata.csv into it. This file is used for 1st ETL job run. We would be uploading a 2nd CSV file to check whether the updated data is being added to the Sqlectron or not.
Glue Setup:
We need an IAM role as the first step since it will be used in all the further sections by the AWS Glue to get and put the data from the source and target.
With the perspective of running this project, I have provided full access to the below tools in my IAM role. You will provide full access only when you encounter any issue while processing this configuration else it is advised to provide only the required access;
-AdministratorAccess
-AWSGlueServieRole
-CloudWatch full access
-Glue full access
-RDS full access
-S3 Source bucket access
Assuming you have created an IAM role with the above files in it! Let’s create a Crawler and Connection.
Creating source crawler:
Go to AWS Glue and select crawler from the Data catalog from the left panel. Select create crawler option and provide the name as “TargetS3” and hit next button as below;
In the data source, we will provide the source from where this crawler will be fetching the S3 file and read the schema(structure) from the file(Happydata.csv). Schema is the structure and the datatype used in the column of the file. In our case, ‘int’ values have been used in the happydata.csv file, hence it will consider the schema as BigInt once the crawler runs and stores the schema in the glue database.
The data source should look like this:
Provide the IAM role we created above:
In the next step, create a database to store the schema for our read S3 file. Hit the ‘Add database’ button and provide a name for this source Database and come back to this screen and hit the refresh button and select your created database. I have created one with the name ‘s3-glue-output’.
In the next step, review the config and hit the ‘create crawler’ button on the bottom right!
Select the crawler and hit the run button from the top right.
The below image is from the database schema:
Let us set up Sqlectron:
Download the Sqlectron and follow the below steps for setup;
-Go to AWS RDS to set up MySQL database.
-Select a template as production. Availability & Durability as Single DB instance.
-DB cluster identifier as it is. Provide a Master username as admin and password as you wish.
-Select DB instance class as t2.micro or t3.micro if that is available else choose the class which has the lowest vCPUs so that it costs less or nothing at all.
-Storage type as GP2.
-Choose the default VPC. Public access ‘Yes’.
-Security group as default and Database port as ‘3306‘.
-Database authentication as password authentication and that’s it. You are good to go!!
-Remember to change the inbound/outbound rules of the security group of your RDS instance. It should look like this:
1st rule is added for the Sqlectron to connect to our RDS instance.
2nd rule is added so that the Glue Connection is set up successfully.
If you face any issue while connecting to Sqlectron then check whether you have selected ‘Yes’ in public access to the instance and the security group for the same looks like the above image.
Once you have set up your RDS database, we can connect to Sqlectron using the username and password we provided.
-Once the RDS MySQL is set up, choose the endpoint link from the Connectivity and Security section from your RDS console and paste it as mentioned in the below image.
Hit the connect button from the screen as below:
You can see our database name. Provide the create table syntax in the console and hit execute.
Now, in order to create a connection between the data stores we have to create an AWS Glue connection.
Creating connection:
Provide a name, Connection type as ‘Amazon RDS’, and database engine as ‘MySQL’.
Go to the connection access tab and choose the JDBC URL. If the RDS configuration is set up properly then your database instance URL will be available in the list. Provide the username and password same as your provided while creating your DB instance in RDS and hit the ‘Create Connection’ button on the bottom right.
Once this is done, it will automatically select the VPC, Subnet, and security group.
Please make sure this w.r.t the connection:
-The subnets associated with the VPC are the same as from the connection.
-There is a route w.r.t. the VPC.
-There is a gateway endpoint for S3 and that endpoint has a route same as the route tables.
-There are two security groups
The one you specified while setting up the Sqlectron.
The default or your own created security group with the inbound and the outbound rules aspecified in the below screenshots.
once the above configuration has been setup, click on the Action and select ‘Test connection’.
You will see that the connection is successful if you see the below confirmation:
**In case you see an error then an AWS pop will provide you with a resolution link**
Creating target crawler:
Use the same steps from the source crawler, just change the data source as below;
in the path, techbaar is the database name that we created in Sqlectron and % specifies all the schema/tables inside it.
We shall run this crawler so that it can fetch the schema of the Sqlectron table.
Run the Target crawler now and you will see the below schema in your target crawler’s database:
Creating an ETL Job:
This job will run whenever we add a file to the S3 bucket, which will call lambda function 1 to run the TargetS3 Crawler and once this crawler has completed successfully, it will execute the ETL job which is triggered by lambda function 2.
-In the job details, provide the job name and the IAM role and keep the other settings as it is.
I’d prefer the old UI for making the job, you can use the new one if you wish!
Our ETL job configuration is done. Let us run this job now and we should be able to see all the entries from the S3’s happydata.csv to Sqlectron table.
At present the table is empty. Now when we run the job..
You can see in the above image, all the entries are fetched from the source(S3) to the target(Sqlectron).
Creating Lambda functions to automatically trigger when an S3 object is uploaded and to execute the ETL job:
Open up the lambda console and choose python as the runtime and paste the code from the link into the editor and hit the Deploy button on top:
Remember to provide Glue and S3 access to this lambda function’s IAM role.
Now go to your S3 bucket → Properties → Event notification, and create an event as below:
And choose the destination as Lambda:
Now, whenever a new S3 file has been uploaded to the bucket, the TargetS3 Crawler will be triggered automatically and once it goes into the ‘succeeded’ state, the ETL lambda trigger function will execute and this will trigger the ETL job.
Creating ETL lambda trigger function:
-Create a lambda function and choose python as runtime and paste the code from the link.
Provide CloudWatch full access and Glue full access to its IAM role.
Now, go to the CloudWatch console, and create an event to trigger the above lambda function.
-Provide a rule name → hit Next
-Event source as AWS events or EventBridge partner events
-Sample events as AWS events and from the list choose ‘Glue Crawler State Change’.
-For creation method chose Custom pattern and in the event pattern editor paste the JSON code from the link → hit Next
-Target 1 as AWS service and target as Lambda function and choose the function you just created for the ETL job run → hit Next
-No need to provide tags, go to review and create step, check your event, and hit the ‘create rule’ button on the bottom right.
Your Event should look like below:
We are done with the setup of this project!!
Now, When you upload an updated file in the S3 bucket, our 1st lambda function will be triggered and it will execute the TargetS3 crawler. Once this crawler finishes and its state is changed from running to succeeded, this will be captured by the CloudWatch event and it will trigger our 2nd Lambda function and will execute of ETL job!
Congratulations on executing this project and coming to the end of it!!🎉
If you face any issues in configurations then kindly drop a mail(jayshah1299@gmail.com) to me for assistance!💻
You can run an ad or news in our newsletters which gets you ahead by decluttering your product/article/project/innovation in tech and bringing your expected audience closer to you.