Skip to content

Advanced Disciplines

Jobs, CronJobs, and Operational Workflows

Batch workloads are where retries become storms. Learn Job and CronJob semantics, then design workflows that don’t amplify failure.

Text

Authored as doctrine; evaluated as systems craft.

Doctrine

A Job is a contract: reach completion under a retry policy. A CronJob is a contract: initiate Jobs on a schedule under concurrency posture. Both can become incidents when retries amplify external failures.

Kubblai doctrine: define failure posture explicitly. Retries without budgets are sabotage.

  • Define backoff and deadlines. Decide what ‘give up’ means.
  • Make Jobs idempotent or make side effects explicitly transactional.
  • Treat schedules as operational promises; monitor them.

Retry semantics and backpressure

Jobs retry failed pods up to `backoffLimit`. CronJobs can overlap if you allow concurrency. Under partial outages, overlap plus retries can create a self-inflicted load storm.

If the job touches external systems, retries must be aligned with downstream rate limits and failure posture.

  • Use `activeDeadlineSeconds` to bound time spent in a broken state.
  • Use `concurrencyPolicy: Forbid` when overlap is dangerous.
  • Prefer smaller, observable units of work over monolith jobs.

Cleanup and history discipline

Left unchecked, CronJobs generate history that becomes noise. Keep enough to debug, not enough to drown.

Use TTL cleanup for Jobs where appropriate; preserve logs/metrics separately.

  • Set `successfulJobsHistoryLimit` and `failedJobsHistoryLimit` deliberately.
  • Consider `ttlSecondsAfterFinished` for ephemeral jobs.
  • Capture job outcomes as metrics; don’t depend on object archaeology.

A minimal CronJob manifest

A safe default posture for scheduled work: forbid overlap, bound runtime, keep history small.

CronJob (minimal posture)

yaml

apiVersion: batch/v1
kind: CronJob
metadata:
  name: nightly-check
spec:
  schedule: "0 2 * * *"
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 2
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      backoffLimit: 2
      activeDeadlineSeconds: 600
      template:
        spec:
          restartPolicy: Never
          containers:
            - name: check
              image: alpine:3.20
              command: ["sh","-c","echo ok"]

Field notes

CronJobs fail silently when no one watches them. Treat batch as production: alerts, dashboards, and explicit ownership.

If you run migrations as Jobs, define rollback posture. Migrations are rarely reversible; treat them as governance events.