~/jonas — a job queue on Postgres SKIP LOCKED

2026-05-21 · ~7 min

litequeue exists because I wanted a job queue with exactly one operational dependency, and that dependency was already running. I have a Postgres in every project. I do not want a second stateful thing to back up, monitor, and explain to future-me at 02:00.

The whole trick is one clause: FOR UPDATE SKIP LOCKED. It lets N workers poll the same table and each walk away with a disjoint set of rows, no coordination, no advisory-lock bookkeeping. Postgres has had it since 9.5. It is not new and it is not clever. That is exactly why I trust it.

The claim

A job is a row. Claiming a job is a single statement that finds the oldest runnable row, skips anything another worker already has locked, marks it running, and returns it:

UPDATE jobs
SET status     = 'running',
    locked_at  = now(),
    locked_by  = $1,
    attempts   = attempts + 1
WHERE id = (
    SELECT id FROM jobs
    WHERE status = 'queued'
      AND run_after <= now()
    ORDER BY run_after
    FOR UPDATE SKIP LOCKED
    LIMIT 1
)
RETURNING id, payload;

The inner SELECT ... FOR UPDATE SKIP LOCKED takes a row lock; the outer UPDATE flips the status inside the same transaction. Two workers running this concurrently never collide — the second one's SELECT simply skips the locked row and grabs the next one. When the transaction commits, the lock is gone and the row is now running, which the predicate excludes, so nobody picks it up twice.

That is the entire concurrency story. No queue partitions, no consumer groups, no rebalancing. Add a worker, it competes for rows. Kill a worker, the others don't notice.

Visibility timeouts without a reaper

The annoying part of any at-least-once queue is the crashed worker: it claimed a job, marked it running, then died before finishing or failing. The naive fix is a background goroutine that periodically scans for stale running rows and resets them. I had that. I deleted it last week.

The reaper is just a query, so I folded it into the claim itself. A job is runnable if it is queued, or if it is running but its lock is older than the visibility timeout:

WHERE (status = 'queued'
       OR (status = 'running' AND locked_at < now() - $2::interval))
  AND run_after <= now()

Now a stalled job is reclaimed by whoever next polls, on the same code path as a fresh one. No separate timer, no second query, nothing to get out of sync. The attempts counter still increments, so a job that crashes a worker repeatedly will eventually cross a retry ceiling and get parked in a dead-letter status instead of looping forever.

The best background job is the one you didn't have to schedule. If a maintenance task can ride along on the hot path, it can't drift, can't fail silently, and can't be the thing you forgot to deploy.

What this is not

This is not a million-jobs-per-second design. The claim does an index scan and an update per job; throughput is bounded by how fast one Postgres can churn small write transactions. For me that's comfortably into the thousands per second on cheap hardware, which is several orders of magnitude more than I need. The day I outgrow it I will know, because I will have an entirely different set of problems and a budget to match.

It is also not ordered across workers. If you need strict per-key ordering you add a partition column and claim within a key, which SKIP LOCKED handles fine but which I won't pretend is free.

Indexes, briefly

One partial index does almost all the work:

CREATE INDEX jobs_runnable
ON jobs (run_after)
WHERE status IN ('queued', 'running');

Completed and dead rows fall out of the index entirely, so the table can accumulate history without the claim getting slower. I sweep terminal rows to a jobs_archive table on a cron, but even that is optional — it's housekeeping, not correctness.

The config files for all of this get diffed with toml-diff before deploy, which is the other reason 0.4 shipped the week I was wiring litequeue's worker pool up: I kept getting fooled by reordered keys in the worker config and wanted a diff that ignored order. Boring tools feeding boring tools.

If you have a Postgres and a problem that looks like a queue, try the four lines of SQL before you stand up a broker. I'd guess most people who reach for Kafka here never needed it — I certainly didn't.

← back to writing

A job queue on Postgres SKIP LOCKED

The claim

Visibility timeouts without a reaper

What this is not

Indexes, briefly