Best Practices for Combining DataOps CI/CD Rules Using the Extends Keyword

  • 7 February 2024
  • 0 replies
  • 23 views

Userlevel 4
Badge

DataOps offers powerful features for automating the testing, building, and deployment of your data management projects. Among these features is the ability to define job rules, which dictate when a job should be executed based on certain conditions. However, when using the extends keyword to include additional logic from multiple base jobs, there are some important considerations to keep in mind to ensure the rules are combined effectively.

Understanding the Problem

Let's start by understanding the issue at hand. When using the extends keyword to inherit rules from multiple base jobs, DataOps’s underlying subsystem does not merge the lists of rules. Instead, it replaces them. This means that only the rules from the last base job included via extends will take effect. As a result, the rules defined in earlier base jobs are overridden by those in the last one. 

In essence, the extends keyword serves to inherit and build upon a single base job's logic, but it does not support the aggregation of rules from multiple base jobs. This behaviour ensures clarity and predictability in the interpretation of job logic within the system.

Example Scenario

Consider a scenario where you have a set of rules for determining when ingestion jobs should run in your CI/CD pipeline. Initially, you define rules to determine if ingestion should occur based on branch names and environment variables. Later, you introduce additional rules to choose between full or incremental ingestion.

.should_run_ingestion: 
rules:
# Run on master or qa
- if: '$CI_COMMIT_REF_NAME == "master" || $CI_COMMIT_REF_NAME == "qa"'
when: on_success
# Run if FORCE_INGESTION is set to this job name
- if: '$FORCE_INGESTION == $INGESTION_JOB && $INGESTION_JOB != null'
when: on_success
# Run if FORCE_ALL_INGESTION is set then ingest all jobs
- if: '$FORCE_ALL_INGESTION == "TRUE"'
when: on_success

.should_run_if_incremental:
rules:
# Run if INCREMENTAL_LOAD is set then run incremental on all jobs
- if: '$INCREMENTAL_LOAD == "TRUE"'
when: on_success

.should_not_run_if_incremental:
rules:
# Run if INCREMENTAL_LOAD is not set then run full ingestion
- if: '$INCREMENTAL_LOAD != "TRUE"'
when: on_success

Problem and Solution

In this example, the rules for incremental ingestion are defined separately from the rules for full ingestion. When these rules are included using the extends keyword, only the rules from the last included base job will take effect. As a result, the rules for incremental ingestion override those for full ingestion, leading to unexpected behaviour.

Revised Approach 

Instead of having separate base jobs for each set of rules, consolidate the rules into three distinct sets based on the different conditions: full ingestion, incremental ingestion, and no incremental ingestion. Then, extend each job with the appropriate set of rules based on the specific requirements of the job.

# Full ingestion rules
.should_run_ingestion:
rules:
# Run on master or qa
- if: '$CI_COMMIT_REF_NAME == "master" || $CI_COMMIT_REF_NAME == "qa"'
when: on_success
# Run if FORCE_INGESTION is set to this job name
- if: '$FORCE_INGESTION == $INGESTION_JOB && $INGESTION_JOB != null'
when: on_success
# Run if FORCE_ALL_INGESTION is set then ingest all jobs
- if: '$FORCE_ALL_INGESTION == "TRUE"'
when: on_success
#- when: never

# Incremental ingestion rules
.should_run_if_incremental:
rules:
# Run if INCREMENTAL_LOAD is set then run incremental on all jobs
- if: '$INCREMENTAL_LOAD == "TRUE" && $FORCE_ALL_INGESTION == "TRUE"'
when: on_success
- if: '$INCREMENTAL_LOAD == "TRUE" && ($CI_COMMIT_REF_NAME == "master" || $CI_COMMIT_REF_NAME == "qa"'
when: on_success
- if: '$INCREMENTAL_LOAD == "TRUE" && $FORCE_INGESTION == $INGESTION_JOB && $INGESTION_JOB != null'
when: on_success
- when: never

.should_not_run_if_incremental:
rules:
# uses variable set in variables.yml
# Run if INCREMENTAL_LOAD is not set then run oneoff ingestion on all jobs
- if: '$INCREMENTAL_LOAD != "TRUE" && $FORCE_ALL_INGESTION == "TRUE"'
when: on_success
- if: '$INCREMENTAL_LOAD != "TRUE" && ($CI_COMMIT_REF_NAME == "master" || $CI_COMMIT_REF_NAME == "qa")'
when: on_success
- if: '$INCREMENTAL_LOAD != "TRUE" && $FORCE_INGESTION == $INGESTION_JOB && $INGESTION_JOB != null'
when: on_success
- when: never

Explanation

In this revised approach, the rules for each type of ingestion scenario (full ingestion, incremental ingestion, and no incremental ingestion) are grouped into separate base jobs. Each base job defines the conditions under which ingestion should occur based on the specific requirements.

Job Extension

When defining individual jobs, extend them with the appropriate set of rules based on their ingestion needs.

job1:
extends:
- .should_run_ingestion
# Additional job configuration

job2:
extends:
- .should_run_if_incremental
# Additional job configuration

job3:
extends:
- .should_not_run_if_incremental
# Additional job configuration

Conclusion

By organising the rules into distinct sets based on the different scenarios, and then extending individual jobs with the appropriate set of rules, you can ensure that the CI/CD pipeline behaves as expected according to the specific requirements of each job. This approach maintains clarity and flexibility while avoiding the pitfalls of rule overriding when using the extends keyword in DataOps CI/CD configurations.

For example, if a job should run regardless of the ingestion mode but still consider other conditions, you would extend .should_run_ingestion. If a job should only run for full ingestion, you'd extend .should_not_run_if_incremental, and for incremental ingestion, you'd extend .should_run_if_incremental.

This approach not only ensures that the rules are applied correctly but also makes it easier to maintain and understand the logic behind job execution in your pipeline.


0 replies

Be the first to reply!

Reply