Is there a pattern for generating tasks at pipeline runtime based on external data?

Userlevel 4

Have you encountered this use case before? How are others solving it? 

There are two ways to do this: 

  • Within a single job: We do this regularly — one job can still saturate all the resources for a host. We use this quite frequently, and of the form create_list_of_jobs | parallel -jobs 800% do_some_work.sh

    This creates a thread pool of 8 per CPU core (32 threads on a 4-core machine). It will then take the first 32 jobs and call do_some_work.sh, passing in the individual parameters each time the shell script is called. As soon as the first call finishes, the system starts on the 33rd call until all 5000 are completed. 

    This is a very effective way to perform a lot of work in parallel but still respect upstream API concurrent limitations. If your upstream only allows you 16 simultaneous connections, fix the jobs limit as -jobs 16. Note that there are many variations on this we can help you with. 

  • With many jobs: In DataOps.live, you can create jobs dynamically or programmatically, and this involves having some trivial script that essentially produces a large YAML block. 

For more information about what you can do with REST APIs, check out Using the REST API.

0 replies

Be the first to reply!