LookOut Backend Architecture Deep Dive: Real-World Implementation
Monitoring Infrastructure at Scale
LookOut handles endpoint monitoring through a sophisticated scheduling system that balances precision, performance, and resource efficiency. Here's how the core components work together.
The Scheduler Engine
Event-Driven Cache Architecture
The scheduler operates on a zero-database-reads model during normal operation. All endpoint configurations are loaded once at startup and maintained in memory:
# One-time database read at startup
async def _load_endpoints_from_database(self) -> None:
response = self.supabase.table('endpoints').select('*').eq('is_active', True).execute()
current_time = time.time()
for endpoint_data in response.data:
endpoint_id = str(endpoint_data['id'])
frequency_seconds = endpoint_data.get('frequency_minutes', 5) * 60
next_check = current_time + frequency_seconds
cache_entry = {
**endpoint_data,
'next_check_time': next_check
}
self.endpoint_cache[endpoint_id] = cache_entry
After initialization, all configuration changes propagate through event handlers:
def on_endpoint_created(self, endpoint_data: Dict[str, Any]) -> None:
endpoint_id = str(endpoint_data['id'])
frequency_seconds = endpoint_data.get('frequency_minutes', 5) * 60
next_check = time.time() + 10 # Start checking in 10 seconds for new endpoints
cache_entry = {**endpoint_data, 'next_check_time': next_check}
self.endpoint_cache[endpoint_id] = cache_entry
This design eliminates polling overhead and ensures consistent performance regardless of database load.
Precision Scheduling Algorithm
The scheduler runs every 30 seconds and uses precise timing calculations to determine which endpoints need checking:
async def _scheduler_loop(self) -> None:
while self.is_running:
# Check system health first
if self.health_monitor and not await self.health_monitor.check_system_health():
await asyncio.sleep(self.scheduler_interval)
continue
# Find due endpoints
due_endpoints = self._find_due_endpoints()
# Queue due endpoints for worker processing
for endpoint_id, scheduled_time in due_endpoints:
await self.check_queue.put((endpoint_id, scheduled_time))
await asyncio.sleep(self.scheduler_interval)
def _find_due_endpoints(self) -> List[Tuple[str, float]]:
current_time = time.time()
due_endpoints = []
for endpoint_id, cache_entry in self.endpoint_cache.items():
if not cache_entry.get('is_active', True):
continue
next_check_time = cache_entry.get('next_check_time', 0)
if current_time >= next_check_time:
due_endpoints.append((endpoint_id, next_check_time))
# Update next check time immediately
frequency_seconds = cache_entry.get('frequency_minutes', 5) * 60
cache_entry['next_check_time'] = current_time + frequency_seconds
return due_endpoints
The algorithm ensures endpoints are checked as close to their scheduled time as possible, with a maximum deviation of 30 seconds (the scheduler interval).
Concurrent Worker Pool
HTTP Session Management
The system uses 12 concurrent workers sharing a single HTTP session pool optimized for monitoring workloads:
async def _initialize_http_session(self) -> None:
connector = aiohttp.TCPConnector(
limit=self.worker_count * 2, # 24 total connections
limit_per_host=10, # Max 10 connections per host
ttl_dns_cache=300, # 5-minute DNS cache
use_dns_cache=True,
enable_cleanup_closed=True
)
timeout = aiohttp.ClientTimeout(total=self.http_timeout) # 20 seconds
self.http_session = aiohttp.ClientSession(
connector=connector,
timeout=timeout,
headers={'User-Agent': 'LookOut-Monitor/1.0'}
)
Worker Processing Logic
Each worker processes the shared queue independently, performing HTTP checks and database writes:
async def _worker(self, worker_id: int) -> None:
while self.is_running:
try:
# Get next endpoint to check (with timeout for clean shutdown)
endpoint_id, scheduled_time = await asyncio.wait_for(
self.check_queue.get(), timeout=1.0
)
# Perform the HTTP check
result = await self._perform_check(endpoint_config)
# Write result to database immediately
await self._save_check_result(endpoint_id, result)
except asyncio.TimeoutError:
continue # No work available, check again
except Exception as e:
self.logger.error("Worker error", worker_id=worker_id, error=str(e))
Workers operate independently without coordination overhead, using the shared queue for load distribution.
Database Write Strategy
Individual ACID Operations
Each monitoring result is written as a separate ACID transaction, ensuring data consistency:
async def _save_check_result(self, endpoint_id: str, result: Dict[str, Any]) -> None:
# Step 1: Insert check result
check_data = {
'endpoint_id': endpoint_id,
'status_code': result.get('status_code'),
'response_time_ms': result['response_time_ms'],
'success': result['success'],
'error_message': result.get('error'),
'checked_at': 'NOW()'
}
self.supabase.table('check_results').insert(check_data).execute()
# Step 2: Update endpoint metadata
consecutive_failures = 0 if result['success'] else (
self.endpoint_cache.get(endpoint_id, {}).get('consecutive_failures', 0) + 1
)
self.supabase.table('endpoints').update({
'last_check_at': 'NOW()',
'consecutive_failures': consecutive_failures
}).eq('id', endpoint_id).execute()
# Step 3: Update in-memory cache
if endpoint_id in self.endpoint_cache:
self.endpoint_cache[endpoint_id]['consecutive_failures'] = consecutive_failures
This approach prioritizes data integrity over write performance, ensuring each result is permanently recorded even if subsequent operations fail.
Data Management Strategy
Intelligent Cleanup System
LookOut implements a probabilistic cleanup system that runs at the database level. On every check_results
insert, a SQL function executes:
-- Simplified version of the cleanup logic
IF random() < [secret_threshold_value] THEN
DELETE FROM check_results
WHERE checked_at < NOW() - INTERVAL '7 days'
AND success = true;
END IF;
Why 7 days? Monitoring dashboards focus on recent trends, and longer historical data provides diminishing value for operational decisions. Why keep failures but delete successes? Failure patterns are critical for incident analysis, while success records primarily contribute to uptime statistics that can be derived from aggregated data.
This probabilistic approach distributes cleanup load over time rather than requiring scheduled maintenance windows.
Caching Layer Architecture
The system implements multi-layer caching with different TTLs based on data volatility:
# High-frequency data - 5 minute cache
@redis_cache(ttl=300, key_prefix="workspace_stats")
async def get_workspace_stats(self, workspace_id: UUID, user_id: str):
# Expensive aggregation queries cached here
# User-level data - 10 minute cache
@redis_cache(ttl=600, key_prefix="user_stats")
async def get_user_dashboard_data(self, user_id: str):
# Dashboard overview data cached here
Redis LRU Eviction: With the 30MB Redis free tier, the system relies on Least Recently Used eviction. Frequently accessed workspace stats stay in cache while stale data is automatically removed.
Circuit Breaker Implementation
System Health Monitoring
The health monitor prevents cascade failures through proactive system state management:
class SystemHealthMonitor:
async def check_system_health(self) -> bool:
current_time = time.time()
# Throttle health checks to every 2 minutes
if current_time - self.last_health_check < self.check_interval:
return self.is_system_healthy
self.last_health_check = current_time
# Test database connectivity
try:
await self._test_database_connection()
await self._test_internet_connectivity()
if not self.is_system_healthy:
await self._handle_recovery()
else:
self.consecutive_successes += 1
return self.is_system_healthy
except Exception as e:
await self._handle_failure(str(e))
return False
Failure State Management
The circuit breaker uses configurable thresholds to determine when to trip:
- Failure Threshold: 3 consecutive failures trigger circuit opening
- Success Threshold: 3 consecutive successes enable recovery
- Queue Overwhelm: 1000+ queued items pause scheduling
When the circuit opens, the scheduler skips work cycles, allowing the system to recover without accumulating additional load.
Performance Optimizations
Memory Efficiency
The in-memory cache scales efficiently with endpoint count. Each cache entry contains:
cache_entry = {
'id': endpoint_id, # UUID: 36 bytes
'name': endpoint_name, # ~30 bytes average
'url': endpoint_url, # ~100 bytes average
'method': http_method, # 3-4 bytes
'frequency_minutes': check_frequency, # 4 bytes
'next_check_time': calculated_timestamp, # 8 bytes
'consecutive_failures': failure_count, # 4 bytes
'headers': request_headers, # 0-200 bytes typical
'body': request_body, # 0-1KB typical
# Additional endpoint configuration...
}
Storage Breakdown:
- Typical GET: ~463 bytes per endpoint
- GET with headers: ~663 bytes per endpoint
- POST with body: ~1.7KB per endpoint
- Weighted average: ~676 bytes per endpoint
Current memory usage calculation:
cache_size_mb = sys.getsizeof(self.endpoint_cache) / (1024 * 1024)
Calculated Memory Scaling:
- Typical GET endpoint: ~463 bytes
- GET with headers: ~663 bytes
- POST with headers/body: ~1.7KB
- Weighted average: ~676 bytes per endpoint
Supporting thousands of endpoints within typical VM memory limits (1000 endpoints = ~676KB cache).
Connection Pool Optimization
HTTP connection pooling reduces establishment overhead:
- Total Pool: 24 connections (worker_count * 2)
- Per-Host Limit: 10 connections maximum
- DNS Caching: 5-minute TTL reduces lookup latency
- Connection Reuse: Persistent connections for frequently checked endpoints
Integration Points
Event-Driven Updates
API operations trigger immediate cache updates without database polling:
# API endpoint creation
@router.post("/endpoints")
async def create_endpoint(endpoint_data: EndpointCreate):
# Save to database
new_endpoint = await endpoint_service.create_endpoint(endpoint_data)
# Trigger scheduler cache update
scheduler.on_endpoint_created(new_endpoint.dict())
return new_endpoint
Notification Integration
The monitoring system integrates with notification triggers:
async def _save_check_result(self, endpoint_id: str, result: Dict[str, Any]):
# ... database operations ...
# Trigger notification system for failures
await notification_trigger.handle_endpoint_check(endpoint_id, result)
Failure Threshold Logic: Notifications trigger when consecutive_failures >= failure_threshold
(default: 2) and user has notifications enabled.
Real-World Performance Characteristics
Throughput Metrics
- Scheduler Precision: ±30 seconds maximum deviation from scheduled time
- Worker Concurrency: 12 simultaneous HTTP checks
- Database Writes: ~1 write per successful check, ~2 writes per failed check
- Cache Hit Rate: >90% for dashboard queries (5-minute TTL)
Resource Utilization
- Memory: Linear scaling with endpoint count (~1KB per endpoint)
- CPU: I/O bound workload, minimal computation
- Network: Outbound HTTP requests + database connections
- Database: Write-heavy workload, ~1MB per 1000 checks
Scaling Characteristics
The current architecture handles growth through:
- Vertical Scaling: More workers + larger connection pools on bigger VMs
- Intelligent Throttling: Circuit breaker prevents resource exhaustion
- Efficient Caching: Reduces database load as user base grows
- Event-Driven Design: Configuration changes don't impact monitoring performance
Operational Reliability
Error Handling Patterns
# Graceful degradation for endpoint deletions
if 'check_results_endpoint_id_fkey' in error_str:
if endpoint_id in self.endpoint_cache:
del self.endpoint_cache[endpoint_id]
print(f"🗑️ Removed deleted endpoint {endpoint_id} from cache")
return
The system handles common failure modes:
- Deleted Endpoints: Automatic cache cleanup on foreign key violations
- Network Timeouts: Per-request 20-second timeouts prevent worker blocking
- Database Failures: Circuit breaker pauses operations during outages
- Memory Pressure: Fixed-size queues prevent unbounded growth
Monitoring Integration
Health status is exposed through API endpoints:
@router.get("/health")
async def get_scheduler_health():
return {
"enabled": True,
"endpoints_monitored": len(scheduler.endpoint_cache),
"queue_size": scheduler.check_queue.qsize(),
"is_healthy": scheduler.health_monitor.is_system_healthy,
"worker_count": len(scheduler.worker_tasks)
}
Architecture Benefits
Technical Advantages
- Predictable Performance: Event-driven cache eliminates database polling overhead
- Resource Efficiency: Shared HTTP sessions and connection pooling minimize resource usage
- Fault Tolerance: Circuit breaker pattern prevents cascade failures
- Data Consistency: Individual ACID transactions ensure reliable result storage
- Operational Simplicity: Single-process architecture reduces deployment complexity
Business Impact
- Cost Control: Free tier compatibility through intelligent resource management
- User Experience: Sub-100ms API responses through aggressive caching
- Reliability: Circuit breaker maintains service during infrastructure issues
- Scalability: Event-driven design supports growth without architectural changes
The implementation demonstrates how sophisticated monitoring infrastructure can be built within strict resource constraints while maintaining enterprise-grade reliability patterns.