Background Jobs and System Maintenance

AuthOS runs several background jobs to ensure reliability, performance, and data integrity. These jobs run automatically on startup and operate continuously throughout the application lifecycle.

Overview

Background jobs handle critical operational tasks:

Token Refresh: Proactively refresh expiring OAuth tokens
Webhook Delivery: Reliable webhook delivery with automatic retry
State Cleanup: Remove expired OAuth and SAML authentication states
Database Optimization: Performance optimization for database operations

All background jobs log their activity and handle failures gracefully without affecting the main application.

Architecture: Transactional Outbox Pattern

Overview

The platform implements the Transactional Outbox pattern using the system_jobs table to ensure reliable background job processing with exactly-once delivery guarantees.

Why Transactional Outbox?

Traditional background job systems can lose jobs if the application crashes after committing a database transaction but before enqueuing the job. The Transactional Outbox pattern solves this by:

Atomic Writes: Jobs are inserted into the database within the same transaction as the business logic
Persistent Queue: Jobs survive application restarts and crashes
Exactly-Once Processing: Each job is processed exactly once, even under failures
Order Preservation: Jobs can be prioritized and scheduled for future execution

System Jobs Table Schema

CREATE TABLE system_jobs (
    id TEXT PRIMARY KEY,
    job_type TEXT NOT NULL,           -- "send_email", "deliver_webhook", "stream_audit_logs"
    payload TEXT NOT NULL,             -- JSON-encoded job data
    status TEXT NOT NULL,              -- "pending", "processing", "completed", "failed"
    priority INTEGER NOT NULL DEFAULT 0,
    max_retries INTEGER NOT NULL DEFAULT 3,
    attempt_count INTEGER NOT NULL DEFAULT 0,
    worker_id TEXT,                    -- ID of worker processing this job (hostname-UUID)
    scheduled_for TEXT NOT NULL,       -- ISO 8601 timestamp
    last_attempt_at TEXT,
    completed_at TEXT,
    failed_at TEXT,
    error_message TEXT,
    created_at TEXT NOT NULL,
    updated_at TEXT NOT NULL
);

Indexes for Performance

The table includes optimized indexes for efficient job processing:

-- Primary query: Fetch pending jobs ready for processing
CREATE INDEX idx_system_jobs_status_scheduled
ON system_jobs (status, scheduled_for, priority);

-- Query: Filter jobs by type
CREATE INDEX idx_system_jobs_type
ON system_jobs (job_type);

Job Types

The system supports multiple job types:

Job Type	Purpose	Payload
`send_email`	Transactional email delivery	Email address, subject, body (HTML/text)
`deliver_webhook`	Webhook event delivery with retry	Webhook ID, event type, payload, delivery ID
`stream_audit_logs`	SIEM log streaming	Config ID, audit type, batch ID
`custom`	Extensible custom jobs	Any JSON payload

Job Lifecycle

Enqueue (Transaction):

// Within a database transaction
JobQueueService::enqueue(
    db_transaction,
    JobType::SendEmail,
    &email_payload,
    priority: 10,
    max_retries: 3,
    scheduled_for: Some(Utc::now())
).await?;

Pending: Job sits in the queue with status = 'pending'
Processing: Job processor picks up the job:
- Updates status = 'processing'
- Increments attempt_count
- Updates last_attempt_at
Completion:
- Success: status = 'completed', completed_at set
- Failure: status = 'failed', failed_at and error_message set
- Retry: status = 'pending', scheduled_for updated with exponential backoff

Job Processor

The Job Processor runs as a background worker that atomically claims and processes jobs from the system_jobs table.

Schedule: Continuously runs, sleeping for 10 seconds when queue is empty

Atomic Claiming: Uses database-level locking for safe horizontal scaling:

PostgreSQL/MySQL: SELECT ... FOR UPDATE SKIP LOCKED via SeaORM
SQLite: Optimistic locking with transaction isolation

Worker Identification: Each worker has a unique worker_id (format: hostname-UUID) for traceability.

Processing Flow:

1. Atomically claim next pending job (lock row, update status)
2. Execute job handler based on job_type
3. Mark as completed or failed
4. Retry with backoff if attempt_count < max_retries
5. Immediately claim next job or sleep if queue empty

Horizontal Scaling: Multiple workers can safely run concurrently. The FOR UPDATE SKIP LOCKED pattern ensures:

Each job is processed exactly once
No race conditions between workers
Workers don’t block each other

Retry Strategy

Failed jobs are automatically retried with exponential backoff:

Attempt	Delay	Total Time
1	0s	0s
2	30s	30s
3	60s	1m 30s
4	120s	3m 30s

After max_retries attempts, jobs are marked as permanently failed.

Benefits

Reliability

Jobs are never lost due to application crashes
Atomic writes ensure consistency between business logic and job creation
Automatic retry with exponential backoff

Observability

All jobs persisted in database with full history
Track job status, retry count, and error messages
Query job metrics and failure patterns

Scalability

Multiple workers can process jobs concurrently
Priority-based processing for critical jobs
Scheduled jobs for delayed execution

Transactions

Jobs can be enqueued within database transactions
Jobs are only created if the transaction commits
No orphaned jobs from failed transactions

Example: Webhook Delivery with Outbox

// API endpoint that needs to send a webhook
pub async fn create_user(db: &DatabaseConnection, user_data: UserData) -> Result<User> {
    // Start transaction
    let txn = db.begin().await?;

    // Business logic: Create user
    let user = UserStore::create(&txn, &user_data).await?;

    // Enqueue webhook job (within same transaction)
    JobQueueService::enqueue(
        DB::Txn(&txn),
        JobType::DeliverWebhook,
        &WebhookJobPayload {
            webhook_id: "webhook-123",
            event_type: "user.created",
            payload: json!({ "user_id": user.id }),
            delivery_id: Uuid::new_v4().to_string(),
        },
        priority: 10,
        max_retries: 5,
        scheduled_for: None, // Send immediately
    ).await?;

    // Commit transaction (atomically creates user + webhook job)
    txn.commit().await?;

    Ok(user)
}

Monitoring

Query job metrics directly from the database:

-- Pending jobs count
SELECT COUNT(*) FROM system_jobs WHERE status = 'pending';

-- Failed jobs in last 24 hours
SELECT * FROM system_jobs
WHERE status = 'failed'
  AND failed_at > datetime('now', '-24 hours');

-- Average attempt count for completed jobs
SELECT AVG(attempt_count) FROM system_jobs
WHERE status = 'completed';

-- Job processing rate (last hour)
SELECT
  job_type,
  COUNT(*) as total,
  SUM(CASE WHEN status = 'completed' THEN 1 ELSE 0 END) as completed,
  SUM(CASE WHEN status = 'failed' THEN 1 ELSE 0 END) as failed
FROM system_jobs
WHERE created_at > datetime('now', '-1 hour')
GROUP BY job_type;

Production Considerations

Job Retention: Implement cleanup for old completed/failed jobs
Dead Letter Queue: Monitor permanently failed jobs for manual intervention
Alerts: Set up alerts for high failure rates or pending job backlog
Scaling: Add more worker processes to increase throughput
Monitoring: Track job processing latency and success rates

Token Refresh Job

Purpose

Automatically refreshes OAuth access tokens before they expire to ensure uninterrupted service access. This prevents users from experiencing authentication failures due to expired tokens.

Schedule

Runs every 5 minutes (300 seconds)

Behavior

Queries database for tokens expiring within the next 1 hour
For each expiring token:
- Determines correct OAuth credentials (BYOO or platform)
- Calls provider’s token refresh endpoint
- Encrypts and stores new access/refresh tokens
- Updates token expiration timestamp
Skips GitHub tokens (GitHub refresh tokens are complex/optional)

Supported Providers

Microsoft: Full refresh token support
Google: Full refresh token support
GitHub: Not supported (skipped)

Token Types Handled

BYOO Tokens

For organizations using Bring Your Own OAuth (BYOO), the job:

Retrieves organization-specific OAuth credentials from database
Decrypts the client secret using encryption service
Uses organization credentials for token refresh

Platform Tokens

For users authenticating via platform OAuth:

Uses platform-wide OAuth credentials
Applies to both platform owners and regular users
Supports both admin and end-user authentication flows

Encryption Support

The job supports both encrypted and plaintext token storage:

Encrypted: Uses EncryptionService to decrypt refresh tokens and encrypt new tokens
Plaintext: Falls back to unencrypted storage if encryption is unavailable

Error Handling

Individual token refresh failures are logged but don’t stop the job
Job continues running even if some tokens fail to refresh
Failed refreshes are retried on the next job cycle

Logging

Token refresh job started
Found 5 tokens to refresh
Refreshed token for identity: abc123
Failed to refresh token for xyz789: Provider error

Webhook Delivery Job

Purpose

Provides reliable webhook delivery with automatic retry and exponential backoff. Ensures that webhook events are delivered even if the recipient endpoint is temporarily unavailable.

Schedule

Runs every 30 seconds

Retry Configuration

Parameter	Value
Maximum Retries	5 attempts
Initial Delay	5 seconds
Maximum Delay	30 minutes
Backoff Strategy	Exponential with jitter

Delivery Process

Fetches up to 100 pending webhook deliveries from database
For each delivery:
- Retrieves webhook configuration (URL, secret, events)
- Generates HMAC-SHA256 signature for payload
- Sends POST request with headers:
  - Content-Type: application/json
  - X-Webhook-Signature: sha256=...
  - X-Webhook-Timestamp: {unix_timestamp}
- Processes response:
  - 2xx Success: Mark as delivered
  - Non-2xx or Error: Schedule retry with backoff

Exponential Backoff

Retry delays follow exponential backoff with jitter:

Attempt	Base Delay	Max Delay	Jitter
1	5s	30m	0-9s
2	10s	30m	0-9s
3	20s	30m	0-9s
4	40s	30m	0-9s
5	80s	30m	0-9s

Jitter (0-9 seconds) prevents thundering herd when multiple webhooks fail simultaneously.

Permanent Failures

After 5 failed attempts, deliveries are marked as permanently failed:

Status set to failed
No further retry attempts
Error details stored in database for debugging

Security

Each webhook includes an HMAC-SHA256 signature generated with the webhook secret:

X-Webhook-Signature: sha256={hex_encoded_hmac}

Recipients should verify this signature to confirm authenticity.

Timeout

HTTP requests timeout after 30 seconds to prevent hanging connections.

Logging

Webhook delivery job started
Processing 3 pending webhook deliveries
Webhook delivery abc123 succeeded
Webhook delivery xyz789 failed with status 503
Webhook delivery xyz789 scheduled for retry at 2025-01-15T10:35:00Z
Webhook delivery failed123 permanently failed after 5 attempts

OAuth State Cleanup Job

Purpose

Removes expired OAuth authentication states from the database to maintain database health and prevent state table bloat.

Schedule

Runs every 10 minutes (600 seconds)

Behavior

Queries oauth_states table for expired entries
Deletes all states past their expiration timestamp
Logs count of deleted states

State Lifecycle

OAuth states are temporary tokens used during the OAuth flow:

Created: When user initiates OAuth login
Used: When OAuth callback is processed
Expired: After configured TTL (typically 10-15 minutes)
Deleted: By this cleanup job

Database Impact

Prevents unbounded growth of the oauth_states table, maintaining:

Query performance
Disk space efficiency
Index efficiency

Logging

OAuth state cleanup job started
Cleaned up 42 expired OAuth states

If no states are expired, the job runs silently without logging.

SAML State Cleanup Job

Purpose

Removes expired SAML authentication states from the database, similar to OAuth state cleanup.

Schedule

Runs every 10 minutes (600 seconds)

Behavior

Queries saml_states table for expired entries
Deletes all states past their expiration timestamp
Logs count of deleted states

State Lifecycle

SAML states track SAML SSO authentication flows:

Created: When SAML authentication is initiated
Used: When SAML response is processed
Expired: After configured TTL
Deleted: By this cleanup job

Logging

SAML state cleanup job started
Cleaned up 15 expired SAML states

Database WAL Checkpointing

Purpose

Aggressively checkpoint the SQLite Write-Ahead Log (WAL) to optimize database performance under heavy load.

Schedule

Runs every 10 seconds

Behavior

Optimizes database performance by maintaining efficient database file sizes and ensuring consistent read performance under heavy load.

Performance Impact

Benefit: Maintains consistent read performance
Cost: Additional background disk operations
Frequency: Every 10 seconds
Optimization: Designed for read-heavy workloads like AuthOS

Job Lifecycle

Startup

All background jobs start automatically when the application launches:

Token refresh job started
OAuth state cleanup job started
SAML state cleanup job started
Webhook delivery job started

Runtime

Jobs run independently with these characteristics:

Each job operates in its own execution context
Failures in one job don’t affect other jobs
Jobs continue running until application shutdown
Graceful error handling ensures job stability

Shutdown

Jobs terminate gracefully when the application shuts down, ensuring clean completion of in-progress operations.

Monitoring Background Jobs

Health Indicators

Monitor these signals to ensure jobs are running correctly:

Token Refresh

Check logs for Refreshed token for identity messages
Monitor for Failed to refresh token errors
Track count of tokens needing refresh

Webhook Delivery

Watch for Webhook delivery X succeeded messages
Alert on permanently failed deliveries
Monitor retry counts and delays

State Cleanup

Verify Cleaned up X expired states logs appear regularly
Monitor table row counts to detect cleanup failures

Troubleshooting

Token Refresh Not Working

Symptom: Tokens expiring without refresh

Causes:

Encryption service unavailable
Invalid OAuth credentials
Provider API down

Resolution:

Check encryption key configuration
Verify OAuth client ID/secret are correct
Check provider API status

Webhook Deliveries Failing

Symptom: All webhooks timing out or failing

Causes:

Recipient endpoint down
Firewall blocking outbound requests
Network connectivity issues

Resolution:

Test webhook URL manually with curl
Check firewall/security group rules
Verify network connectivity

State Tables Growing

Symptom: oauth_states or saml_states table size increasing

Causes:

Cleanup job not running
Database errors preventing deletion
Expiration logic misconfigured

Resolution:

Check job logs for errors
Verify database write permissions
Inspect state expiration timestamps

Performance Characteristics

Resource Usage

Job	CPU	Memory	I/O
Token Refresh	Low	Low	Low
Webhook Delivery	Medium	Low	Medium (network)
OAuth State Cleanup	Low	Low	Low
SAML State Cleanup	Low	Low	Low
WAL Checkpointing	Low	Low	High (writes)

Database Load

Background jobs are designed to minimize database impact:

Token refresh: Processes tokens efficiently in batches
Webhook delivery: Processes up to 100 webhooks per cycle
State cleanup: Efficient single-query cleanup operations
Database optimization: Lightweight performance maintenance

Scalability

All jobs scale with platform usage:

Token refresh: Scales with active users
Webhook delivery: Scales with event volume
State cleanup: Scales with authentication rate
WAL checkpoint: Independent of usage

Configuration

Background jobs run with the following schedules:

Token Refresh

Frequency: Every 5 minutes
Refresh Window: Tokens expiring within 1 hour
Providers Supported: Microsoft, Google (GitHub refresh tokens are not used)

Webhook Delivery

Frequency: Every 30 seconds
Max Retries: 5 attempts per webhook
Retry Strategy: Exponential backoff from 5 seconds to 30 minutes

State Cleanup

Frequency: Every 10 minutes
Cleanup Target: Expired OAuth and SAML authentication states

Database Optimization

Frequency: Every 10 seconds
Purpose: Performance optimization for read-heavy workloads

Production Considerations

Logging

All jobs log to standard output/error. Configure your deployment to:

Capture and aggregate logs (e.g., CloudWatch, Datadog)
Set up alerts for error patterns
Monitor job execution frequency

Encryption Service

Token refresh requires encryption service for BYOO tokens. Ensure:

ENCRYPTION_KEY environment variable is set
Encryption key is properly rotated
Key ID matches encrypted token metadata

Database Backups

WAL checkpointing ensures database consistency:

Safe to back up main database file
WAL file can be backed up separately
Point-in-time recovery supported

High Availability

For multi-instance deployments:

Each worker has a unique worker_id for traceability
Atomic job claiming with FOR UPDATE SKIP LOCKED prevents duplicate processing
Multiple workers can safely process jobs concurrently
Database-level locking ensures exactly-once delivery
No leader election required—all workers are equal

Health Checks - Monitor service health
Webhooks - Configure webhooks for events
Authentication Concepts - OAuth and token management