Implement and manage data ingestion pipelines from diverse sources such as Kafka, RDBMS (Postgres) using CDC (Change Data Capture), and file systems (CSV) following Medalion Architecture principles
Develop and optimize data transformations using PySpark and SQL to handle data ranging from MB to GB, depending on the source
Conduct unit testing and integration testing to ensure the accuracy and reliability of data transformations and pipelines
Work with AWS technologies, including S3 for data storage and Docker on AWS for containerized applications
Implement and manage infrastructure using Terraform, such as creating S3 buckets, managing Databricks Service Principals, and deploying infrastructure as code
Deploy and manage solutions using CI/CD pipelines, particularly with CircleCI, to ensure seamless and automated deployment processes
Requirements:
Minimum 4-5 years of a professional experience
Proficiency in SQL and Python
Strong experience with AWS cloud services
Hands-on experience with DataBricks
Knowledge of ETL Processing
Effective communication skills in English (minimum B2 level)