In modern data engineering development, effective CI/CD pipelines are crucial for maintaining efficient workflows and ensuring the timely delivery of high-quality data products. DataOps CI/CD offers a powerful platform for automating these pipelines, allowing developers to define, execute, and monitor their build, test, and deployment processes seamlessly. However, achieving granular control over job execution based on various conditions can sometimes pose challenges. We will present a way to control the execution of downstream jobs based on the success or failure of upstream jobs, while also allowing for certain failures to be tolerated without halting the pipeline. In this article, we'll explore a real-world scenario and demonstrate how to enhance workflow control in DataOps CI/CD pipelines using a practical example.
Scenario Overview:
Consider a scenario where a software development team is managing a complex data pipeline consisting of multiple ingestion and curation jobs. The team wants to ensure that downstream curation jobs are only executed if upstream ingestion jobs succeed. However, they also want to accommodate certain failures in the ingestion process without halting the pipeline entirely. Achieving this fine-grained control over job execution requires careful orchestration within the DataOps CI/CD pipeline.
Example Implementation: Let's dive into an example implementation to showcase how this workflow control can be achieved using DataOps CI/CD. We'll break down the implementation into two main parts: the parent job responsible for data ingestion and the child job responsible for data curation.
-
Parent Job (Data Ingestion):
parent_job:
stage: ingestion
script:
- echo "Executing data ingestion process"
# Actual ingestion script goes here
after_script:
- echo "CI_JOB_STATUS=$CI_JOB_STATUS" > exported_variables.env
artifacts:
reports:
dotenv: exported_variables.env
In this job, the data ingestion process is executed, and the job status ($CI_JOB_STATUS
) is stored in an environment variable using an after_script. Additionally, the status is saved as a dotenv
artefact for later reference.
-
Child Job (Data Curation):
child_job:
stage: curation
script:
- |
if f "$CI_JOB_STATUS" == "success" ]]; then
echo "Parent job was successful, proceeding with data curation."
# Actual curation process goes here
else
echo "Parent job failed or had allowed failure, taking alternative action."
# Perform alternative action or retain old data
fi
dependencies:
- parent_job
In this job, the script checks the status of the parent job ($CI_JOB_STATUS
). If the parent job was successful, the data curation process proceeds as usual. Otherwise, alternative actions can be taken, such as retaining old data or skipping the curation process altogether.
Conclusion:
By leveraging DataOps CI/CD's features such as environment variables, artefacts, and job dependencies, we've demonstrated how to achieve fine-grained control over job execution in a complex data pipeline scenario. This example showcases the flexibility and power of DataOps CI/CD in orchestrating workflows to meet specific requirements. As software development teams continue to evolve their CI/CD practices, mastering workflow control becomes essential for optimising efficiency and ensuring reliable software delivery.