LookOut Backend Architecture Deep Dive: Real-World Implementation

Monitoring Infrastructure at Scale

LookOut handles endpoint monitoring through a sophisticated scheduling system that balances precision, performance, and resource efficiency. Here's how the core components work together.

The Scheduler Engine

Event-Driven Cache Architecture

The scheduler operates on a zero-database-reads model during normal operation. All endpoint configurations are loaded once at startup and maintained in memory:

# One-time database read at startup
async def _load_endpoints_from_database(self) -> None:
    response = self.supabase.table('endpoints').select('*').eq('is_active', True).execute()
    
    current_time = time.time()
    for endpoint_data in response.data:
        endpoint_id = str(endpoint_data['id'])
        frequency_seconds = endpoint_data.get('frequency_minutes', 5) * 60
        next_check = current_time + frequency_seconds
        
        cache_entry = {
            **endpoint_data,
            'next_check_time': next_check
        }
        self.endpoint_cache[endpoint_id] = cache_entry

After initialization, all configuration changes propagate through event handlers:

def on_endpoint_created(self, endpoint_data: Dict[str, Any]) -> None:
    endpoint_id = str(endpoint_data['id'])
    frequency_seconds = endpoint_data.get('frequency_minutes', 5) * 60
    next_check = time.time() + 10  # Start checking in 10 seconds for new endpoints
    
    cache_entry = {**endpoint_data, 'next_check_time': next_check}
    self.endpoint_cache[endpoint_id] = cache_entry

This design eliminates polling overhead and ensures consistent performance regardless of database load.

Precision Scheduling Algorithm

The scheduler runs every 30 seconds and uses precise timing calculations to determine which endpoints need checking:

async def _scheduler_loop(self) -> None:
    while self.is_running:
        # Check system health first
        if self.health_monitor and not await self.health_monitor.check_system_health():
            await asyncio.sleep(self.scheduler_interval)
            continue
        
        # Find due endpoints
        due_endpoints = self._find_due_endpoints()
        
        # Queue due endpoints for worker processing
        for endpoint_id, scheduled_time in due_endpoints:
            await self.check_queue.put((endpoint_id, scheduled_time))
        
        await asyncio.sleep(self.scheduler_interval)

def _find_due_endpoints(self) -> List[Tuple[str, float]]:
    current_time = time.time()
    due_endpoints = []
    
    for endpoint_id, cache_entry in self.endpoint_cache.items():
        if not cache_entry.get('is_active', True):
            continue
        
        next_check_time = cache_entry.get('next_check_time', 0)
        
        if current_time >= next_check_time:
            due_endpoints.append((endpoint_id, next_check_time))
            # Update next check time immediately
            frequency_seconds = cache_entry.get('frequency_minutes', 5) * 60
            cache_entry['next_check_time'] = current_time + frequency_seconds
    
    return due_endpoints

The algorithm ensures endpoints are checked as close to their scheduled time as possible, with a maximum deviation of 30 seconds (the scheduler interval).

Concurrent Worker Pool

HTTP Session Management

The system uses 12 concurrent workers sharing a single HTTP session pool optimized for monitoring workloads:

async def _initialize_http_session(self) -> None:
    connector = aiohttp.TCPConnector(
        limit=self.worker_count * 2,  # 24 total connections
        limit_per_host=10,            # Max 10 connections per host
        ttl_dns_cache=300,            # 5-minute DNS cache
        use_dns_cache=True,
        enable_cleanup_closed=True
    )
    
    timeout = aiohttp.ClientTimeout(total=self.http_timeout)  # 20 seconds
    
    self.http_session = aiohttp.ClientSession(
        connector=connector,
        timeout=timeout,
        headers={'User-Agent': 'LookOut-Monitor/1.0'}
    )

Worker Processing Logic

Each worker processes the shared queue independently, performing HTTP checks and database writes:

async def _worker(self, worker_id: int) -> None:
    while self.is_running:
        try:
            # Get next endpoint to check (with timeout for clean shutdown)
            endpoint_id, scheduled_time = await asyncio.wait_for(
                self.check_queue.get(), timeout=1.0
            )
            
            # Perform the HTTP check
            result = await self._perform_check(endpoint_config)
            
            # Write result to database immediately
            await self._save_check_result(endpoint_id, result)
            
        except asyncio.TimeoutError:
            continue  # No work available, check again
        except Exception as e:
            self.logger.error("Worker error", worker_id=worker_id, error=str(e))

Workers operate independently without coordination overhead, using the shared queue for load distribution.

Database Write Strategy

Individual ACID Operations

Each monitoring result is written as a separate ACID transaction, ensuring data consistency:

async def _save_check_result(self, endpoint_id: str, result: Dict[str, Any]) -> None:
    # Step 1: Insert check result
    check_data = {
        'endpoint_id': endpoint_id,
        'status_code': result.get('status_code'),
        'response_time_ms': result['response_time_ms'],
        'success': result['success'],
        'error_message': result.get('error'),
        'checked_at': 'NOW()'
    }
    
    self.supabase.table('check_results').insert(check_data).execute()
    
    # Step 2: Update endpoint metadata
    consecutive_failures = 0 if result['success'] else (
        self.endpoint_cache.get(endpoint_id, {}).get('consecutive_failures', 0) + 1
    )
    
    self.supabase.table('endpoints').update({
        'last_check_at': 'NOW()',
        'consecutive_failures': consecutive_failures
    }).eq('id', endpoint_id).execute()
    
    # Step 3: Update in-memory cache
    if endpoint_id in self.endpoint_cache:
        self.endpoint_cache[endpoint_id]['consecutive_failures'] = consecutive_failures

This approach prioritizes data integrity over write performance, ensuring each result is permanently recorded even if subsequent operations fail.

Data Management Strategy

Intelligent Cleanup System

LookOut implements a probabilistic cleanup system that runs at the database level. On every check_results insert, a SQL function executes:

-- Simplified version of the cleanup logic
IF random() < [secret_threshold_value] THEN
    DELETE FROM check_results 
    WHERE checked_at < NOW() - INTERVAL '7 days' 
    AND success = true;
END IF;

Why 7 days? Monitoring dashboards focus on recent trends, and longer historical data provides diminishing value for operational decisions. Why keep failures but delete successes? Failure patterns are critical for incident analysis, while success records primarily contribute to uptime statistics that can be derived from aggregated data.

This probabilistic approach distributes cleanup load over time rather than requiring scheduled maintenance windows.

Caching Layer Architecture

The system implements multi-layer caching with different TTLs based on data volatility:

# High-frequency data - 5 minute cache
@redis_cache(ttl=300, key_prefix="workspace_stats")
async def get_workspace_stats(self, workspace_id: UUID, user_id: str):
    # Expensive aggregation queries cached here
    
# User-level data - 10 minute cache  
@redis_cache(ttl=600, key_prefix="user_stats")
async def get_user_dashboard_data(self, user_id: str):
    # Dashboard overview data cached here

Redis LRU Eviction: With the 30MB Redis free tier, the system relies on Least Recently Used eviction. Frequently accessed workspace stats stay in cache while stale data is automatically removed.

Circuit Breaker Implementation

System Health Monitoring

The health monitor prevents cascade failures through proactive system state management:

class SystemHealthMonitor:
    async def check_system_health(self) -> bool:
        current_time = time.time()
        
        # Throttle health checks to every 2 minutes
        if current_time - self.last_health_check < self.check_interval:
            return self.is_system_healthy
        
        self.last_health_check = current_time
        
        # Test database connectivity
        try:
            await self._test_database_connection()
            await self._test_internet_connectivity()
            
            if not self.is_system_healthy:
                await self._handle_recovery()
            else:
                self.consecutive_successes += 1
                
            return self.is_system_healthy
            
        except Exception as e:
            await self._handle_failure(str(e))
            return False

Failure State Management

The circuit breaker uses configurable thresholds to determine when to trip:

Failure Threshold: 3 consecutive failures trigger circuit opening
Success Threshold: 3 consecutive successes enable recovery
Queue Overwhelm: 1000+ queued items pause scheduling

When the circuit opens, the scheduler skips work cycles, allowing the system to recover without accumulating additional load.

Performance Optimizations

Memory Efficiency

The in-memory cache scales efficiently with endpoint count. Each cache entry contains:

cache_entry = {
    'id': endpoint_id,                    # UUID: 36 bytes
    'name': endpoint_name,                # ~30 bytes average
    'url': endpoint_url,                  # ~100 bytes average
    'method': http_method,                # 3-4 bytes
    'frequency_minutes': check_frequency, # 4 bytes
    'next_check_time': calculated_timestamp, # 8 bytes
    'consecutive_failures': failure_count,   # 4 bytes
    'headers': request_headers,           # 0-200 bytes typical
    'body': request_body,                 # 0-1KB typical
    # Additional endpoint configuration...
}

Storage Breakdown:

Typical GET: ~463 bytes per endpoint
GET with headers: ~663 bytes per endpoint
POST with body: ~1.7KB per endpoint
Weighted average: ~676 bytes per endpoint

Current memory usage calculation:

cache_size_mb = sys.getsizeof(self.endpoint_cache) / (1024 * 1024)

Calculated Memory Scaling:

Typical GET endpoint: ~463 bytes
GET with headers: ~663 bytes
POST with headers/body: ~1.7KB
Weighted average: ~676 bytes per endpoint

Supporting thousands of endpoints within typical VM memory limits (1000 endpoints = ~676KB cache).

Connection Pool Optimization

HTTP connection pooling reduces establishment overhead:

Total Pool: 24 connections (worker_count * 2)
Per-Host Limit: 10 connections maximum
DNS Caching: 5-minute TTL reduces lookup latency
Connection Reuse: Persistent connections for frequently checked endpoints

Integration Points

Event-Driven Updates

API operations trigger immediate cache updates without database polling:

# API endpoint creation
@router.post("/endpoints")
async def create_endpoint(endpoint_data: EndpointCreate):
    # Save to database
    new_endpoint = await endpoint_service.create_endpoint(endpoint_data)
    
    # Trigger scheduler cache update
    scheduler.on_endpoint_created(new_endpoint.dict())
    
    return new_endpoint

Notification Integration

The monitoring system integrates with notification triggers:

async def _save_check_result(self, endpoint_id: str, result: Dict[str, Any]):
    # ... database operations ...
    
    # Trigger notification system for failures
    await notification_trigger.handle_endpoint_check(endpoint_id, result)

Failure Threshold Logic: Notifications trigger when consecutive_failures >= failure_threshold (default: 2) and user has notifications enabled.

Real-World Performance Characteristics

Throughput Metrics

Scheduler Precision: ±30 seconds maximum deviation from scheduled time
Worker Concurrency: 12 simultaneous HTTP checks
Database Writes: ~1 write per successful check, ~2 writes per failed check
Cache Hit Rate: >90% for dashboard queries (5-minute TTL)

Resource Utilization

Memory: Linear scaling with endpoint count (~1KB per endpoint)
CPU: I/O bound workload, minimal computation
Network: Outbound HTTP requests + database connections
Database: Write-heavy workload, ~1MB per 1000 checks

Scaling Characteristics

The current architecture handles growth through:

Vertical Scaling: More workers + larger connection pools on bigger VMs
Intelligent Throttling: Circuit breaker prevents resource exhaustion
Efficient Caching: Reduces database load as user base grows
Event-Driven Design: Configuration changes don't impact monitoring performance

Operational Reliability

Error Handling Patterns

# Graceful degradation for endpoint deletions
if 'check_results_endpoint_id_fkey' in error_str:
    if endpoint_id in self.endpoint_cache:
        del self.endpoint_cache[endpoint_id]
        print(f"🗑️ Removed deleted endpoint {endpoint_id} from cache")
    return

The system handles common failure modes:

Deleted Endpoints: Automatic cache cleanup on foreign key violations
Network Timeouts: Per-request 20-second timeouts prevent worker blocking
Database Failures: Circuit breaker pauses operations during outages
Memory Pressure: Fixed-size queues prevent unbounded growth

Monitoring Integration

Health status is exposed through API endpoints:

@router.get("/health")
async def get_scheduler_health():
    return {
        "enabled": True,
        "endpoints_monitored": len(scheduler.endpoint_cache),
        "queue_size": scheduler.check_queue.qsize(),
        "is_healthy": scheduler.health_monitor.is_system_healthy,
        "worker_count": len(scheduler.worker_tasks)
    }

Architecture Benefits

Technical Advantages

Predictable Performance: Event-driven cache eliminates database polling overhead
Resource Efficiency: Shared HTTP sessions and connection pooling minimize resource usage
Fault Tolerance: Circuit breaker pattern prevents cascade failures
Data Consistency: Individual ACID transactions ensure reliable result storage
Operational Simplicity: Single-process architecture reduces deployment complexity

Business Impact

Cost Control: Free tier compatibility through intelligent resource management
User Experience: Sub-100ms API responses through aggressive caching
Reliability: Circuit breaker maintains service during infrastructure issues
Scalability: Event-driven design supports growth without architectural changes

The implementation demonstrates how sophisticated monitoring infrastructure can be built within strict resource constraints while maintaining enterprise-grade reliability patterns.