Design and build automated data pipelines to ingest, process, and transform data from various sources.
Leverage Python and PySpark for data processing and implementing transformation algorithms.
Work with Jupyter notebooks to document and test data processing workflows.
Optimize and ensure scalability of ETL (Extract, Transform, Load) processes.
Collaborate with other data engineers, analysts, and architects to build a robust data architecture (Data Fabric).
Ensure data quality, availability, and integrity throughout the processing pipeline.
Work with cloud-based tools and databases - Azure
Minimize processing times and ensure timely delivery of data to end-users.
Nasze wymagania
Total Experience - 4 years or more
Expertise in Python – advanced knowledge of Python, including libraries like pandas, numpy, etc., and data processing tools.
Experience with PySpark – hands-on experience with Apache Spark and its integration with large-scale data processing systems.
Experience with ETL and data pipelines – designing, implementing, and optimizing ETL processes in cloud or on-premises environments.
Experience with cloud platforms - Azure
SQL proficiency – solid experience working with relational databases (e.g., PostgreSQL, MySQL, SQL Server).
Experience with Jupyter notebooks – using Jupyter for prototyping and testing data transformation processes.
Understanding of data architecture (Data Fabric) – knowledge of building data infrastructure and integrating multiple data sources across the organization.
Analytical skills – ability to solve complex problems and optimize data processing workflows.
Collaboration and communication skills – ability to work effectively in a cross-functional team and communicate with various stakeholders.