Background Jobs and System Maintenance
AuthOS runs several background jobs to ensure reliability, performance, and data integrity. These jobs run automatically on startup and operate continuously throughout the application lifecycle.
Overview
Background jobs handle critical operational tasks:
- Token Refresh: Proactively refresh expiring OAuth tokens
- Webhook Delivery: Reliable webhook delivery with automatic retry
- State Cleanup: Remove expired OAuth and SAML authentication states
- Database Optimization: Performance optimization for database operations
All background jobs log their activity and handle failures gracefully without affecting the main application.
Architecture: Transactional Outbox Pattern
Overview
The platform implements the Transactional Outbox pattern using the system_jobs table to ensure reliable background job processing with exactly-once delivery guarantees.
Why Transactional Outbox?
Traditional background job systems can lose jobs if the application crashes after committing a database transaction but before enqueuing the job. The Transactional Outbox pattern solves this by:
- Atomic Writes: Jobs are inserted into the database within the same transaction as the business logic
- Persistent Queue: Jobs survive application restarts and crashes
- Exactly-Once Processing: Each job is processed exactly once, even under failures
- Order Preservation: Jobs can be prioritized and scheduled for future execution
System Jobs Table Schema
CREATE TABLE system_jobs (
id TEXT PRIMARY KEY,
job_type TEXT NOT NULL, -- "send_email", "deliver_webhook", "stream_audit_logs"
payload TEXT NOT NULL, -- JSON-encoded job data
status TEXT NOT NULL, -- "pending", "processing", "completed", "failed"
priority INTEGER NOT NULL DEFAULT 0,
max_retries INTEGER NOT NULL DEFAULT 3,
attempt_count INTEGER NOT NULL DEFAULT 0,
worker_id TEXT, -- ID of worker processing this job (hostname-UUID)
scheduled_for TEXT NOT NULL, -- ISO 8601 timestamp
last_attempt_at TEXT,
completed_at TEXT,
failed_at TEXT,
error_message TEXT,
created_at TEXT NOT NULL,
updated_at TEXT NOT NULL
);
Indexes for Performance
The table includes optimized indexes for efficient job processing:
-- Primary query: Fetch pending jobs ready for processing
CREATE INDEX idx_system_jobs_status_scheduled
ON system_jobs (status, scheduled_for, priority);
-- Query: Filter jobs by type
CREATE INDEX idx_system_jobs_type
ON system_jobs (job_type);
Job Types
The system supports multiple job types:
| Job Type | Purpose | Payload |
|---|---|---|
send_email |
Transactional email delivery | Email address, subject, body (HTML/text) |
deliver_webhook |
Webhook event delivery with retry | Webhook ID, event type, payload, delivery ID |
stream_audit_logs |
SIEM log streaming | Config ID, audit type, batch ID |
custom |
Extensible custom jobs | Any JSON payload |
Job Lifecycle
-
Enqueue (Transaction):
// Within a database transaction JobQueueService::enqueue( db_transaction, JobType::SendEmail, &email_payload, priority: 10, max_retries: 3, scheduled_for: Some(Utc::now()) ).await?; -
Pending: Job sits in the queue with
status = 'pending' -
Processing: Job processor picks up the job:
- Updates
status = 'processing' - Increments
attempt_count - Updates
last_attempt_at
- Updates
-
Completion:
- Success:
status = 'completed',completed_atset - Failure:
status = 'failed',failed_atanderror_messageset - Retry:
status = 'pending',scheduled_forupdated with exponential backoff
- Success:
Job Processor
The Job Processor runs as a background worker that atomically claims and processes jobs from the system_jobs table.
Schedule: Continuously runs, sleeping for 10 seconds when queue is empty
Atomic Claiming: Uses database-level locking for safe horizontal scaling:
- PostgreSQL/MySQL:
SELECT ... FOR UPDATE SKIP LOCKEDvia SeaORM - SQLite: Optimistic locking with transaction isolation
Worker Identification: Each worker has a unique worker_id (format: hostname-UUID) for traceability.
Processing Flow:
1. Atomically claim next pending job (lock row, update status)
2. Execute job handler based on job_type
3. Mark as completed or failed
4. Retry with backoff if attempt_count < max_retries
5. Immediately claim next job or sleep if queue empty
Horizontal Scaling: Multiple workers can safely run concurrently. The FOR UPDATE SKIP LOCKED pattern ensures:
- Each job is processed exactly once
- No race conditions between workers
- Workers don’t block each other
Retry Strategy
Failed jobs are automatically retried with exponential backoff:
| Attempt | Delay | Total Time |
|---|---|---|
| 1 | 0s | 0s |
| 2 | 30s | 30s |
| 3 | 60s | 1m 30s |
| 4 | 120s | 3m 30s |
After max_retries attempts, jobs are marked as permanently failed.
Benefits
Reliability
- Jobs are never lost due to application crashes
- Atomic writes ensure consistency between business logic and job creation
- Automatic retry with exponential backoff
Observability
- All jobs persisted in database with full history
- Track job status, retry count, and error messages
- Query job metrics and failure patterns
Scalability
- Multiple workers can process jobs concurrently
- Priority-based processing for critical jobs
- Scheduled jobs for delayed execution
Transactions
- Jobs can be enqueued within database transactions
- Jobs are only created if the transaction commits
- No orphaned jobs from failed transactions
Example: Webhook Delivery with Outbox
// API endpoint that needs to send a webhook
pub async fn create_user(db: &DatabaseConnection, user_data: UserData) -> Result<User> {
// Start transaction
let txn = db.begin().await?;
// Business logic: Create user
let user = UserStore::create(&txn, &user_data).await?;
// Enqueue webhook job (within same transaction)
JobQueueService::enqueue(
DB::Txn(&txn),
JobType::DeliverWebhook,
&WebhookJobPayload {
webhook_id: "webhook-123",
event_type: "user.created",
payload: json!({ "user_id": user.id }),
delivery_id: Uuid::new_v4().to_string(),
},
priority: 10,
max_retries: 5,
scheduled_for: None, // Send immediately
).await?;
// Commit transaction (atomically creates user + webhook job)
txn.commit().await?;
Ok(user)
}
Monitoring
Query job metrics directly from the database:
-- Pending jobs count
SELECT COUNT(*) FROM system_jobs WHERE status = 'pending';
-- Failed jobs in last 24 hours
SELECT * FROM system_jobs
WHERE status = 'failed'
AND failed_at > datetime('now', '-24 hours');
-- Average attempt count for completed jobs
SELECT AVG(attempt_count) FROM system_jobs
WHERE status = 'completed';
-- Job processing rate (last hour)
SELECT
job_type,
COUNT(*) as total,
SUM(CASE WHEN status = 'completed' THEN 1 ELSE 0 END) as completed,
SUM(CASE WHEN status = 'failed' THEN 1 ELSE 0 END) as failed
FROM system_jobs
WHERE created_at > datetime('now', '-1 hour')
GROUP BY job_type;
Production Considerations
- Job Retention: Implement cleanup for old completed/failed jobs
- Dead Letter Queue: Monitor permanently failed jobs for manual intervention
- Alerts: Set up alerts for high failure rates or pending job backlog
- Scaling: Add more worker processes to increase throughput
- Monitoring: Track job processing latency and success rates
Token Refresh Job
Purpose
Automatically refreshes OAuth access tokens before they expire to ensure uninterrupted service access. This prevents users from experiencing authentication failures due to expired tokens.
Schedule
Runs every 5 minutes (300 seconds)
Behavior
- Queries database for tokens expiring within the next 1 hour
- For each expiring token:
- Determines correct OAuth credentials (BYOO or platform)
- Calls provider’s token refresh endpoint
- Encrypts and stores new access/refresh tokens
- Updates token expiration timestamp
- Skips GitHub tokens (GitHub refresh tokens are complex/optional)
Supported Providers
- Microsoft: Full refresh token support
- Google: Full refresh token support
- GitHub: Not supported (skipped)
Token Types Handled
BYOO Tokens
For organizations using Bring Your Own OAuth (BYOO), the job:
- Retrieves organization-specific OAuth credentials from database
- Decrypts the client secret using encryption service
- Uses organization credentials for token refresh
Platform Tokens
For users authenticating via platform OAuth:
- Uses platform-wide OAuth credentials
- Applies to both platform owners and regular users
- Supports both admin and end-user authentication flows
Encryption Support
The job supports both encrypted and plaintext token storage:
- Encrypted: Uses
EncryptionServiceto decrypt refresh tokens and encrypt new tokens - Plaintext: Falls back to unencrypted storage if encryption is unavailable
Error Handling
- Individual token refresh failures are logged but don’t stop the job
- Job continues running even if some tokens fail to refresh
- Failed refreshes are retried on the next job cycle
Logging
Token refresh job started
Found 5 tokens to refresh
Refreshed token for identity: abc123
Failed to refresh token for xyz789: Provider error
Webhook Delivery Job
Purpose
Provides reliable webhook delivery with automatic retry and exponential backoff. Ensures that webhook events are delivered even if the recipient endpoint is temporarily unavailable.
Schedule
Runs every 30 seconds
Retry Configuration
| Parameter | Value |
|---|---|
| Maximum Retries | 5 attempts |
| Initial Delay | 5 seconds |
| Maximum Delay | 30 minutes |
| Backoff Strategy | Exponential with jitter |
Delivery Process
- Fetches up to 100 pending webhook deliveries from database
- For each delivery:
- Retrieves webhook configuration (URL, secret, events)
- Generates HMAC-SHA256 signature for payload
- Sends POST request with headers:
Content-Type: application/jsonX-Webhook-Signature: sha256=...X-Webhook-Timestamp: {unix_timestamp}
- Processes response:
- 2xx Success: Mark as delivered
- Non-2xx or Error: Schedule retry with backoff
Exponential Backoff
Retry delays follow exponential backoff with jitter:
| Attempt | Base Delay | Max Delay | Jitter |
|---|---|---|---|
| 1 | 5s | 30m | 0-9s |
| 2 | 10s | 30m | 0-9s |
| 3 | 20s | 30m | 0-9s |
| 4 | 40s | 30m | 0-9s |
| 5 | 80s | 30m | 0-9s |
Jitter (0-9 seconds) prevents thundering herd when multiple webhooks fail simultaneously.
Permanent Failures
After 5 failed attempts, deliveries are marked as permanently failed:
- Status set to
failed - No further retry attempts
- Error details stored in database for debugging
Security
Each webhook includes an HMAC-SHA256 signature generated with the webhook secret:
X-Webhook-Signature: sha256={hex_encoded_hmac}
Recipients should verify this signature to confirm authenticity.
Timeout
HTTP requests timeout after 30 seconds to prevent hanging connections.
Logging
Webhook delivery job started
Processing 3 pending webhook deliveries
Webhook delivery abc123 succeeded
Webhook delivery xyz789 failed with status 503
Webhook delivery xyz789 scheduled for retry at 2025-01-15T10:35:00Z
Webhook delivery failed123 permanently failed after 5 attempts
OAuth State Cleanup Job
Purpose
Removes expired OAuth authentication states from the database to maintain database health and prevent state table bloat.
Schedule
Runs every 10 minutes (600 seconds)
Behavior
- Queries
oauth_statestable for expired entries - Deletes all states past their expiration timestamp
- Logs count of deleted states
State Lifecycle
OAuth states are temporary tokens used during the OAuth flow:
- Created: When user initiates OAuth login
- Used: When OAuth callback is processed
- Expired: After configured TTL (typically 10-15 minutes)
- Deleted: By this cleanup job
Database Impact
Prevents unbounded growth of the oauth_states table, maintaining:
- Query performance
- Disk space efficiency
- Index efficiency
Logging
OAuth state cleanup job started
Cleaned up 42 expired OAuth states
If no states are expired, the job runs silently without logging.
SAML State Cleanup Job
Purpose
Removes expired SAML authentication states from the database, similar to OAuth state cleanup.
Schedule
Runs every 10 minutes (600 seconds)
Behavior
- Queries
saml_statestable for expired entries - Deletes all states past their expiration timestamp
- Logs count of deleted states
State Lifecycle
SAML states track SAML SSO authentication flows:
- Created: When SAML authentication is initiated
- Used: When SAML response is processed
- Expired: After configured TTL
- Deleted: By this cleanup job
Logging
SAML state cleanup job started
Cleaned up 15 expired SAML states
Database WAL Checkpointing
Purpose
Aggressively checkpoint the SQLite Write-Ahead Log (WAL) to optimize database performance under heavy load.
Schedule
Runs every 10 seconds
Behavior
Optimizes database performance by maintaining efficient database file sizes and ensuring consistent read performance under heavy load.
Performance Impact
- Benefit: Maintains consistent read performance
- Cost: Additional background disk operations
- Frequency: Every 10 seconds
- Optimization: Designed for read-heavy workloads like AuthOS
Job Lifecycle
Startup
All background jobs start automatically when the application launches:
Token refresh job started
OAuth state cleanup job started
SAML state cleanup job started
Webhook delivery job started
Runtime
Jobs run independently with these characteristics:
- Each job operates in its own execution context
- Failures in one job don’t affect other jobs
- Jobs continue running until application shutdown
- Graceful error handling ensures job stability
Shutdown
Jobs terminate gracefully when the application shuts down, ensuring clean completion of in-progress operations.
Monitoring Background Jobs
Health Indicators
Monitor these signals to ensure jobs are running correctly:
Token Refresh
- Check logs for
Refreshed token for identitymessages - Monitor for
Failed to refresh tokenerrors - Track count of tokens needing refresh
Webhook Delivery
- Watch for
Webhook delivery X succeededmessages - Alert on
permanently faileddeliveries - Monitor retry counts and delays
State Cleanup
- Verify
Cleaned up X expired stateslogs appear regularly - Monitor table row counts to detect cleanup failures
Troubleshooting
Token Refresh Not Working
Symptom: Tokens expiring without refresh
Causes:
- Encryption service unavailable
- Invalid OAuth credentials
- Provider API down
Resolution:
- Check encryption key configuration
- Verify OAuth client ID/secret are correct
- Check provider API status
Webhook Deliveries Failing
Symptom: All webhooks timing out or failing
Causes:
- Recipient endpoint down
- Firewall blocking outbound requests
- Network connectivity issues
Resolution:
- Test webhook URL manually with
curl - Check firewall/security group rules
- Verify network connectivity
State Tables Growing
Symptom: oauth_states or saml_states table size increasing
Causes:
- Cleanup job not running
- Database errors preventing deletion
- Expiration logic misconfigured
Resolution:
- Check job logs for errors
- Verify database write permissions
- Inspect state expiration timestamps
Performance Characteristics
Resource Usage
| Job | CPU | Memory | I/O |
|---|---|---|---|
| Token Refresh | Low | Low | Low |
| Webhook Delivery | Medium | Low | Medium (network) |
| OAuth State Cleanup | Low | Low | Low |
| SAML State Cleanup | Low | Low | Low |
| WAL Checkpointing | Low | Low | High (writes) |
Database Load
Background jobs are designed to minimize database impact:
- Token refresh: Processes tokens efficiently in batches
- Webhook delivery: Processes up to 100 webhooks per cycle
- State cleanup: Efficient single-query cleanup operations
- Database optimization: Lightweight performance maintenance
Scalability
All jobs scale with platform usage:
- Token refresh: Scales with active users
- Webhook delivery: Scales with event volume
- State cleanup: Scales with authentication rate
- WAL checkpoint: Independent of usage
Configuration
Background jobs run with the following schedules:
Token Refresh
- Frequency: Every 5 minutes
- Refresh Window: Tokens expiring within 1 hour
- Providers Supported: Microsoft, Google (GitHub refresh tokens are not used)
Webhook Delivery
- Frequency: Every 30 seconds
- Max Retries: 5 attempts per webhook
- Retry Strategy: Exponential backoff from 5 seconds to 30 minutes
State Cleanup
- Frequency: Every 10 minutes
- Cleanup Target: Expired OAuth and SAML authentication states
Database Optimization
- Frequency: Every 10 seconds
- Purpose: Performance optimization for read-heavy workloads
Production Considerations
Logging
All jobs log to standard output/error. Configure your deployment to:
- Capture and aggregate logs (e.g., CloudWatch, Datadog)
- Set up alerts for error patterns
- Monitor job execution frequency
Encryption Service
Token refresh requires encryption service for BYOO tokens. Ensure:
ENCRYPTION_KEYenvironment variable is set- Encryption key is properly rotated
- Key ID matches encrypted token metadata
Database Backups
WAL checkpointing ensures database consistency:
- Safe to back up main database file
- WAL file can be backed up separately
- Point-in-time recovery supported
High Availability
For multi-instance deployments:
- Each worker has a unique
worker_idfor traceability - Atomic job claiming with
FOR UPDATE SKIP LOCKEDprevents duplicate processing - Multiple workers can safely process jobs concurrently
- Database-level locking ensures exactly-once delivery
- No leader election required—all workers are equal
Related Documentation
- Health Checks - Monitor service health
- Webhooks - Configure webhooks for events
- Authentication Concepts - OAuth and token management