Background Jobs

Background jobs and system maintenance tasks including token refresh, webhook delivery, state cleanup, and database optimization.

Updated Dec 16, 2025
Edit on GitHub
operations background-jobs maintenance reliability

Background Jobs and System Maintenance

AuthOS runs several background jobs to ensure reliability, performance, and data integrity. These jobs run automatically on startup and operate continuously throughout the application lifecycle.

Overview

Background jobs handle critical operational tasks:

  • Token Refresh: Proactively refresh expiring OAuth tokens
  • Webhook Delivery: Reliable webhook delivery with automatic retry
  • State Cleanup: Remove expired OAuth and SAML authentication states
  • Database Optimization: Performance optimization for database operations

All background jobs log their activity and handle failures gracefully without affecting the main application.


Architecture: Transactional Outbox Pattern

Overview

The platform implements the Transactional Outbox pattern using the system_jobs table to ensure reliable background job processing with exactly-once delivery guarantees.

Why Transactional Outbox?

Traditional background job systems can lose jobs if the application crashes after committing a database transaction but before enqueuing the job. The Transactional Outbox pattern solves this by:

  1. Atomic Writes: Jobs are inserted into the database within the same transaction as the business logic
  2. Persistent Queue: Jobs survive application restarts and crashes
  3. Exactly-Once Processing: Each job is processed exactly once, even under failures
  4. Order Preservation: Jobs can be prioritized and scheduled for future execution

System Jobs Table Schema

CREATE TABLE system_jobs (
    id TEXT PRIMARY KEY,
    job_type TEXT NOT NULL,           -- "send_email", "deliver_webhook", "stream_audit_logs"
    payload TEXT NOT NULL,             -- JSON-encoded job data
    status TEXT NOT NULL,              -- "pending", "processing", "completed", "failed"
    priority INTEGER NOT NULL DEFAULT 0,
    max_retries INTEGER NOT NULL DEFAULT 3,
    attempt_count INTEGER NOT NULL DEFAULT 0,
    worker_id TEXT,                    -- ID of worker processing this job (hostname-UUID)
    scheduled_for TEXT NOT NULL,       -- ISO 8601 timestamp
    last_attempt_at TEXT,
    completed_at TEXT,
    failed_at TEXT,
    error_message TEXT,
    created_at TEXT NOT NULL,
    updated_at TEXT NOT NULL
);

Indexes for Performance

The table includes optimized indexes for efficient job processing:

-- Primary query: Fetch pending jobs ready for processing
CREATE INDEX idx_system_jobs_status_scheduled
ON system_jobs (status, scheduled_for, priority);

-- Query: Filter jobs by type
CREATE INDEX idx_system_jobs_type
ON system_jobs (job_type);

Job Types

The system supports multiple job types:

Job Type Purpose Payload
send_email Transactional email delivery Email address, subject, body (HTML/text)
deliver_webhook Webhook event delivery with retry Webhook ID, event type, payload, delivery ID
stream_audit_logs SIEM log streaming Config ID, audit type, batch ID
custom Extensible custom jobs Any JSON payload

Job Lifecycle

  1. Enqueue (Transaction):

    // Within a database transaction
    JobQueueService::enqueue(
        db_transaction,
        JobType::SendEmail,
        &email_payload,
        priority: 10,
        max_retries: 3,
        scheduled_for: Some(Utc::now())
    ).await?;
    
  2. Pending: Job sits in the queue with status = 'pending'

  3. Processing: Job processor picks up the job:

    • Updates status = 'processing'
    • Increments attempt_count
    • Updates last_attempt_at
  4. Completion:

    • Success: status = 'completed', completed_at set
    • Failure: status = 'failed', failed_at and error_message set
    • Retry: status = 'pending', scheduled_for updated with exponential backoff

Job Processor

The Job Processor runs as a background worker that atomically claims and processes jobs from the system_jobs table.

Schedule: Continuously runs, sleeping for 10 seconds when queue is empty

Atomic Claiming: Uses database-level locking for safe horizontal scaling:

  • PostgreSQL/MySQL: SELECT ... FOR UPDATE SKIP LOCKED via SeaORM
  • SQLite: Optimistic locking with transaction isolation

Worker Identification: Each worker has a unique worker_id (format: hostname-UUID) for traceability.

Processing Flow:

1. Atomically claim next pending job (lock row, update status)
2. Execute job handler based on job_type
3. Mark as completed or failed
4. Retry with backoff if attempt_count < max_retries
5. Immediately claim next job or sleep if queue empty

Horizontal Scaling: Multiple workers can safely run concurrently. The FOR UPDATE SKIP LOCKED pattern ensures:

  • Each job is processed exactly once
  • No race conditions between workers
  • Workers don’t block each other

Retry Strategy

Failed jobs are automatically retried with exponential backoff:

Attempt Delay Total Time
1 0s 0s
2 30s 30s
3 60s 1m 30s
4 120s 3m 30s

After max_retries attempts, jobs are marked as permanently failed.

Benefits

Reliability

  • Jobs are never lost due to application crashes
  • Atomic writes ensure consistency between business logic and job creation
  • Automatic retry with exponential backoff

Observability

  • All jobs persisted in database with full history
  • Track job status, retry count, and error messages
  • Query job metrics and failure patterns

Scalability

  • Multiple workers can process jobs concurrently
  • Priority-based processing for critical jobs
  • Scheduled jobs for delayed execution

Transactions

  • Jobs can be enqueued within database transactions
  • Jobs are only created if the transaction commits
  • No orphaned jobs from failed transactions

Example: Webhook Delivery with Outbox

// API endpoint that needs to send a webhook
pub async fn create_user(db: &DatabaseConnection, user_data: UserData) -> Result<User> {
    // Start transaction
    let txn = db.begin().await?;

    // Business logic: Create user
    let user = UserStore::create(&txn, &user_data).await?;

    // Enqueue webhook job (within same transaction)
    JobQueueService::enqueue(
        DB::Txn(&txn),
        JobType::DeliverWebhook,
        &WebhookJobPayload {
            webhook_id: "webhook-123",
            event_type: "user.created",
            payload: json!({ "user_id": user.id }),
            delivery_id: Uuid::new_v4().to_string(),
        },
        priority: 10,
        max_retries: 5,
        scheduled_for: None, // Send immediately
    ).await?;

    // Commit transaction (atomically creates user + webhook job)
    txn.commit().await?;

    Ok(user)
}

Monitoring

Query job metrics directly from the database:

-- Pending jobs count
SELECT COUNT(*) FROM system_jobs WHERE status = 'pending';

-- Failed jobs in last 24 hours
SELECT * FROM system_jobs
WHERE status = 'failed'
  AND failed_at > datetime('now', '-24 hours');

-- Average attempt count for completed jobs
SELECT AVG(attempt_count) FROM system_jobs
WHERE status = 'completed';

-- Job processing rate (last hour)
SELECT
  job_type,
  COUNT(*) as total,
  SUM(CASE WHEN status = 'completed' THEN 1 ELSE 0 END) as completed,
  SUM(CASE WHEN status = 'failed' THEN 1 ELSE 0 END) as failed
FROM system_jobs
WHERE created_at > datetime('now', '-1 hour')
GROUP BY job_type;

Production Considerations

  1. Job Retention: Implement cleanup for old completed/failed jobs
  2. Dead Letter Queue: Monitor permanently failed jobs for manual intervention
  3. Alerts: Set up alerts for high failure rates or pending job backlog
  4. Scaling: Add more worker processes to increase throughput
  5. Monitoring: Track job processing latency and success rates

Token Refresh Job

Purpose

Automatically refreshes OAuth access tokens before they expire to ensure uninterrupted service access. This prevents users from experiencing authentication failures due to expired tokens.

Schedule

Runs every 5 minutes (300 seconds)

Behavior

  1. Queries database for tokens expiring within the next 1 hour
  2. For each expiring token:
    • Determines correct OAuth credentials (BYOO or platform)
    • Calls provider’s token refresh endpoint
    • Encrypts and stores new access/refresh tokens
    • Updates token expiration timestamp
  3. Skips GitHub tokens (GitHub refresh tokens are complex/optional)

Supported Providers

  • Microsoft: Full refresh token support
  • Google: Full refresh token support
  • GitHub: Not supported (skipped)

Token Types Handled

BYOO Tokens

For organizations using Bring Your Own OAuth (BYOO), the job:

  • Retrieves organization-specific OAuth credentials from database
  • Decrypts the client secret using encryption service
  • Uses organization credentials for token refresh

Platform Tokens

For users authenticating via platform OAuth:

  • Uses platform-wide OAuth credentials
  • Applies to both platform owners and regular users
  • Supports both admin and end-user authentication flows

Encryption Support

The job supports both encrypted and plaintext token storage:

  • Encrypted: Uses EncryptionService to decrypt refresh tokens and encrypt new tokens
  • Plaintext: Falls back to unencrypted storage if encryption is unavailable

Error Handling

  • Individual token refresh failures are logged but don’t stop the job
  • Job continues running even if some tokens fail to refresh
  • Failed refreshes are retried on the next job cycle

Logging

Token refresh job started
Found 5 tokens to refresh
Refreshed token for identity: abc123
Failed to refresh token for xyz789: Provider error

Webhook Delivery Job

Purpose

Provides reliable webhook delivery with automatic retry and exponential backoff. Ensures that webhook events are delivered even if the recipient endpoint is temporarily unavailable.

Schedule

Runs every 30 seconds

Retry Configuration

Parameter Value
Maximum Retries 5 attempts
Initial Delay 5 seconds
Maximum Delay 30 minutes
Backoff Strategy Exponential with jitter

Delivery Process

  1. Fetches up to 100 pending webhook deliveries from database
  2. For each delivery:
    • Retrieves webhook configuration (URL, secret, events)
    • Generates HMAC-SHA256 signature for payload
    • Sends POST request with headers:
      • Content-Type: application/json
      • X-Webhook-Signature: sha256=...
      • X-Webhook-Timestamp: {unix_timestamp}
    • Processes response:
      • 2xx Success: Mark as delivered
      • Non-2xx or Error: Schedule retry with backoff

Exponential Backoff

Retry delays follow exponential backoff with jitter:

Attempt Base Delay Max Delay Jitter
1 5s 30m 0-9s
2 10s 30m 0-9s
3 20s 30m 0-9s
4 40s 30m 0-9s
5 80s 30m 0-9s

Jitter (0-9 seconds) prevents thundering herd when multiple webhooks fail simultaneously.

Permanent Failures

After 5 failed attempts, deliveries are marked as permanently failed:

  • Status set to failed
  • No further retry attempts
  • Error details stored in database for debugging

Security

Each webhook includes an HMAC-SHA256 signature generated with the webhook secret:

X-Webhook-Signature: sha256={hex_encoded_hmac}

Recipients should verify this signature to confirm authenticity.

Timeout

HTTP requests timeout after 30 seconds to prevent hanging connections.

Logging

Webhook delivery job started
Processing 3 pending webhook deliveries
Webhook delivery abc123 succeeded
Webhook delivery xyz789 failed with status 503
Webhook delivery xyz789 scheduled for retry at 2025-01-15T10:35:00Z
Webhook delivery failed123 permanently failed after 5 attempts

OAuth State Cleanup Job

Purpose

Removes expired OAuth authentication states from the database to maintain database health and prevent state table bloat.

Schedule

Runs every 10 minutes (600 seconds)

Behavior

  1. Queries oauth_states table for expired entries
  2. Deletes all states past their expiration timestamp
  3. Logs count of deleted states

State Lifecycle

OAuth states are temporary tokens used during the OAuth flow:

  • Created: When user initiates OAuth login
  • Used: When OAuth callback is processed
  • Expired: After configured TTL (typically 10-15 minutes)
  • Deleted: By this cleanup job

Database Impact

Prevents unbounded growth of the oauth_states table, maintaining:

  • Query performance
  • Disk space efficiency
  • Index efficiency

Logging

OAuth state cleanup job started
Cleaned up 42 expired OAuth states

If no states are expired, the job runs silently without logging.


SAML State Cleanup Job

Purpose

Removes expired SAML authentication states from the database, similar to OAuth state cleanup.

Schedule

Runs every 10 minutes (600 seconds)

Behavior

  1. Queries saml_states table for expired entries
  2. Deletes all states past their expiration timestamp
  3. Logs count of deleted states

State Lifecycle

SAML states track SAML SSO authentication flows:

  • Created: When SAML authentication is initiated
  • Used: When SAML response is processed
  • Expired: After configured TTL
  • Deleted: By this cleanup job

Logging

SAML state cleanup job started
Cleaned up 15 expired SAML states

Database WAL Checkpointing

Purpose

Aggressively checkpoint the SQLite Write-Ahead Log (WAL) to optimize database performance under heavy load.

Schedule

Runs every 10 seconds

Behavior

Optimizes database performance by maintaining efficient database file sizes and ensuring consistent read performance under heavy load.

Performance Impact

  • Benefit: Maintains consistent read performance
  • Cost: Additional background disk operations
  • Frequency: Every 10 seconds
  • Optimization: Designed for read-heavy workloads like AuthOS

Job Lifecycle

Startup

All background jobs start automatically when the application launches:

Token refresh job started
OAuth state cleanup job started
SAML state cleanup job started
Webhook delivery job started

Runtime

Jobs run independently with these characteristics:

  • Each job operates in its own execution context
  • Failures in one job don’t affect other jobs
  • Jobs continue running until application shutdown
  • Graceful error handling ensures job stability

Shutdown

Jobs terminate gracefully when the application shuts down, ensuring clean completion of in-progress operations.


Monitoring Background Jobs

Health Indicators

Monitor these signals to ensure jobs are running correctly:

Token Refresh

  • Check logs for Refreshed token for identity messages
  • Monitor for Failed to refresh token errors
  • Track count of tokens needing refresh

Webhook Delivery

  • Watch for Webhook delivery X succeeded messages
  • Alert on permanently failed deliveries
  • Monitor retry counts and delays

State Cleanup

  • Verify Cleaned up X expired states logs appear regularly
  • Monitor table row counts to detect cleanup failures

Troubleshooting

Token Refresh Not Working

Symptom: Tokens expiring without refresh

Causes:

  • Encryption service unavailable
  • Invalid OAuth credentials
  • Provider API down

Resolution:

  • Check encryption key configuration
  • Verify OAuth client ID/secret are correct
  • Check provider API status

Webhook Deliveries Failing

Symptom: All webhooks timing out or failing

Causes:

  • Recipient endpoint down
  • Firewall blocking outbound requests
  • Network connectivity issues

Resolution:

  • Test webhook URL manually with curl
  • Check firewall/security group rules
  • Verify network connectivity

State Tables Growing

Symptom: oauth_states or saml_states table size increasing

Causes:

  • Cleanup job not running
  • Database errors preventing deletion
  • Expiration logic misconfigured

Resolution:

  • Check job logs for errors
  • Verify database write permissions
  • Inspect state expiration timestamps

Performance Characteristics

Resource Usage

Job CPU Memory I/O
Token Refresh Low Low Low
Webhook Delivery Medium Low Medium (network)
OAuth State Cleanup Low Low Low
SAML State Cleanup Low Low Low
WAL Checkpointing Low Low High (writes)

Database Load

Background jobs are designed to minimize database impact:

  • Token refresh: Processes tokens efficiently in batches
  • Webhook delivery: Processes up to 100 webhooks per cycle
  • State cleanup: Efficient single-query cleanup operations
  • Database optimization: Lightweight performance maintenance

Scalability

All jobs scale with platform usage:

  • Token refresh: Scales with active users
  • Webhook delivery: Scales with event volume
  • State cleanup: Scales with authentication rate
  • WAL checkpoint: Independent of usage

Configuration

Background jobs run with the following schedules:

Token Refresh

  • Frequency: Every 5 minutes
  • Refresh Window: Tokens expiring within 1 hour
  • Providers Supported: Microsoft, Google (GitHub refresh tokens are not used)

Webhook Delivery

  • Frequency: Every 30 seconds
  • Max Retries: 5 attempts per webhook
  • Retry Strategy: Exponential backoff from 5 seconds to 30 minutes

State Cleanup

  • Frequency: Every 10 minutes
  • Cleanup Target: Expired OAuth and SAML authentication states

Database Optimization

  • Frequency: Every 10 seconds
  • Purpose: Performance optimization for read-heavy workloads

Production Considerations

Logging

All jobs log to standard output/error. Configure your deployment to:

  • Capture and aggregate logs (e.g., CloudWatch, Datadog)
  • Set up alerts for error patterns
  • Monitor job execution frequency

Encryption Service

Token refresh requires encryption service for BYOO tokens. Ensure:

  • ENCRYPTION_KEY environment variable is set
  • Encryption key is properly rotated
  • Key ID matches encrypted token metadata

Database Backups

WAL checkpointing ensures database consistency:

  • Safe to back up main database file
  • WAL file can be backed up separately
  • Point-in-time recovery supported

High Availability

For multi-instance deployments:

  • Each worker has a unique worker_id for traceability
  • Atomic job claiming with FOR UPDATE SKIP LOCKED prevents duplicate processing
  • Multiple workers can safely process jobs concurrently
  • Database-level locking ensures exactly-once delivery
  • No leader election required—all workers are equal