This document provides a comprehensive breakdown of all metrics emitted by the Batch Queue and Fair Queue systems, including what they mean and how to identify degraded system states.
The batch queue system consists of two layers:
- BatchQueue (
batch_queue.*) - High-level batch processing metrics - FairQueue (
batch-queue.*) - Low-level message queue metrics (withname: "batch-queue")
Both layers emit metrics that together provide full observability into batch processing.
These metrics track batch-level operations.
| Metric | Description | Labels |
|---|---|---|
batch_queue.batches_enqueued |
Number of batches initialized for processing | envId, itemCount, streaming |
batch_queue.items_enqueued |
Number of individual batch items enqueued | envId |
batch_queue.items_processed |
Number of batch items successfully processed (turned into runs) | envId |
batch_queue.items_failed |
Number of batch items that failed processing | envId, errorCode |
batch_queue.batches_completed |
Number of batches that completed (all items processed) | envId, hasFailures |
| Metric | Description | Unit | Labels |
|---|---|---|---|
batch_queue.batch_processing_duration |
Time from batch creation to completion | ms | envId, itemCount |
batch_queue.item_queue_time |
Time from item enqueue to processing start | ms | envId |
These metrics track the underlying message queue operations. With the batch queue configuration, they are prefixed with batch-queue..
| Metric | Description |
|---|---|
batch-queue.messages.enqueued |
Number of messages (batch items) added to the queue |
batch-queue.messages.completed |
Number of messages successfully processed |
batch-queue.messages.failed |
Number of messages that failed processing |
batch-queue.messages.retried |
Number of message retry attempts |
batch-queue.messages.dlq |
Number of messages sent to dead letter queue |
| Metric | Description | Unit |
|---|---|---|
batch-queue.message.processing_time |
Time to process a single message | ms |
batch-queue.message.queue_time |
Time a message spent waiting in queue | ms |
| Metric | Description | Labels |
|---|---|---|
batch-queue.queue.length |
Current number of messages in a queue | fairqueue.queue_id |
batch-queue.master_queue.length |
Number of active queues in the master queue shard | fairqueue.shard_id |
batch-queue.inflight.count |
Number of messages currently being processed | fairqueue.shard_id |
batch-queue.dlq.length |
Number of messages in the dead letter queue | fairqueue.tenant_id |
Understanding how metrics relate helps diagnose issues:
batches_enqueued × avg_items_per_batch ≈ items_enqueued
items_enqueued = items_processed + items_failed + items_pending
batches_completed ≤ batches_enqueued (lag indicates processing backlog)
Symptoms:
batch_queue.items_processedrate drops to 0batch-queue.inflight.countis 0batch-queue.master_queue.lengthis growing
Likely Causes:
- Consumer loops crashed
- Redis connection issues
- All consumers blocked by concurrency limits
Actions:
- Check webapp logs for "BatchQueue consumers started" message
- Verify Redis connectivity
- Check for "Unknown concurrency group" errors
Symptoms:
batch_queue.item_queue_timep99 > 60 secondsbatch-queue.queue.lengthgrowing continuouslybatch-queue.inflight.countat max capacity
Likely Causes:
- Processing is slower than ingestion
- Concurrency limits too restrictive
- Global rate limiter bottleneck
Actions:
- Increase
BATCH_QUEUE_CONSUMER_COUNT - Review concurrency limits per environment
- Check
BATCH_QUEUE_GLOBAL_RATE_LIMITsetting
Symptoms:
batch_queue.items_failedrate > 5% ofitems_processedbatch-queue.messages.dlqincreasing
Likely Causes:
- TriggerTaskService errors
- Invalid task identifiers
- Downstream service issues
Actions:
- Check
errorCodelabel distribution onitems_failed - Review batch error records in database
- Check TriggerTaskService logs
Symptoms:
batch_queue.batches_enqueued-batch_queue.batches_completedis increasing over timebatch-queue.master_queue.lengthtrending upward
Likely Causes:
- Sustained high load
- Processing capacity insufficient
- Specific tenants monopolizing resources
Actions:
- Monitor DRR deficit distribution across tenants
- Consider scaling consumers
- Review per-tenant concurrency settings
Symptoms:
- Some
envIdlabels show much higheritem_queue_timethan others - DRR logs show "tenants blocked by concurrency" frequently
Likely Causes:
- Concurrency limits too low for high-volume tenants
- DRR quantum/maxDeficit misconfigured
Actions:
- Review
BATCH_CONCURRENCY_*environment settings - Adjust DRR parameters if needed
Symptoms:
batch_queue.item_queue_timehas periodic spikes- Logs show "Global rate limit reached, waiting"
Likely Causes:
BATCH_QUEUE_GLOBAL_RATE_LIMITis set too low
Actions:
- Increase global rate limit if system can handle more throughput
- Or accept as intentional throttling
# Throughput
rate(batch_queue_items_processed_total[5m])
rate(batch_queue_items_failed_total[5m])
# Success Rate
rate(batch_queue_items_processed_total[5m]) /
(rate(batch_queue_items_processed_total[5m]) + rate(batch_queue_items_failed_total[5m]))
# Batch Completion Rate
rate(batch_queue_batches_completed_total[5m]) / rate(batch_queue_batches_enqueued_total[5m])
# Item Queue Time (p50, p95, p99)
histogram_quantile(0.50, rate(batch_queue_item_queue_time_bucket[5m]))
histogram_quantile(0.95, rate(batch_queue_item_queue_time_bucket[5m]))
histogram_quantile(0.99, rate(batch_queue_item_queue_time_bucket[5m]))
# Batch Processing Duration
histogram_quantile(0.95, rate(batch_queue_batch_processing_duration_bucket[5m]))
# Current backlog
batch_queue_master_queue_length
batch_queue_inflight_count
# DLQ (should be 0)
batch_queue_dlq_length
| Condition | Severity | Threshold |
|---|---|---|
| Processing stopped | Critical | items_processed rate = 0 for 5min |
| High failure rate | Warning | items_failed / items_processed > 0.05 |
| Queue time p99 | Warning | > 30 seconds |
| Queue time p99 | Critical | > 120 seconds |
| DLQ length | Warning | > 0 |
| Batch completion lag | Warning | batches_enqueued - batches_completed > 100 |
| Variable | Impact |
|---|---|
BATCH_QUEUE_CONSUMER_COUNT |
More consumers = higher throughput, lower queue time |
BATCH_QUEUE_CONSUMER_INTERVAL_MS |
Lower = more frequent polling, higher throughput |
BATCH_QUEUE_GLOBAL_RATE_LIMIT |
Caps max items/sec, increases queue time if too low |
BATCH_CONCURRENCY_FREE/PAID/ENTERPRISE |
Per-tenant concurrency limits |
BATCH_QUEUE_DRR_QUANTUM |
Credits per tenant per round (fairness tuning) |
BATCH_QUEUE_MAX_DEFICIT |
Max accumulated credits (prevents starvation) |
When investigating batch queue issues:
- Check consumer status: Look for "BatchQueue consumers started" in logs
- Check Redis: Verify connection and inspect keys with prefix
engine:batch-queue: - Check concurrency: Look for "tenants blocked by concurrency" debug logs
- Check rate limits: Look for "Global rate limit reached" debug logs
- Check DRR state: Query
batch:drr:deficithash in Redis - Check batch status: Query
BatchTaskRuntable for stuckPROCESSINGbatches