What a mess

2025-11-07 15:27:13 +01:00
parent 0b79b3ae59
commit 33cc9586c2
130 changed files with 29819 additions and 1175 deletions
--- a/docs/PERFORMANCE_IMPROVEMENTS.md
+++ b/docs/PERFORMANCE_IMPROVEMENTS.md
@@ -0,0 +1,305 @@
+# Performance Improvement Recommendations for Echoes of the Ashes
+
+## Current Performance Baseline
+- **Throughput**: 212 req/s (with 8 workers)
+- **Success Rate**: 94% (6% failures under load)
+- **Bottleneck**: Database connection pool and complex queries
+
+## Quick Wins (Immediate Implementation)
+
+### 1. Increase Database Connection Pool ⚡ **HIGH IMPACT**
+
+**Current**: Default pool size (~10-20 connections per worker)
+**Problem**: 8 workers competing for limited connections
+
+```python
+# In api/database.py, update engine creation:
+engine = create_async_engine(
+    DATABASE_URL, 
+    echo=False,
+    pool_size=20,          # Increased from default 5
+    max_overflow=30,       # Allow bursts up to 50 total connections
+    pool_timeout=30,       # Wait up to 30s for connection
+    pool_recycle=3600,     # Recycle connections every hour
+    pool_pre_ping=True     # Verify connections before use
+)
+```
+
+**Expected Impact**: +30-50% throughput, reduce failures to <2%
+
+### 2. Add Database Indexes 🚀 **HIGH IMPACT**
+
+**Current**: Missing indexes on frequently queried columns
+
+```sql
+-- Run these in PostgreSQL:
+
+-- Player lookups (auth)
+CREATE INDEX IF NOT EXISTS idx_players_username ON players(username);
+CREATE INDEX IF NOT EXISTS idx_players_telegram_id ON players(telegram_id);
+
+-- Location queries (most expensive operation)
+CREATE INDEX IF NOT EXISTS idx_players_location_id ON players(location_id);
+CREATE INDEX IF NOT EXISTS idx_dropped_items_location ON dropped_items(location_id);
+CREATE INDEX IF NOT EXISTS idx_wandering_enemies_location ON wandering_enemies(location_id);
+
+-- Combat queries
+CREATE INDEX IF NOT EXISTS idx_active_combats_player_id ON active_combats(player_id);
+
+-- Inventory queries
+CREATE INDEX IF NOT EXISTS idx_inventory_player_id ON inventory(player_id);
+CREATE INDEX IF NOT EXISTS idx_inventory_item_id ON inventory(item_id);
+
+-- Compound index for item pickups
+CREATE INDEX IF NOT EXISTS idx_inventory_player_item ON inventory(player_id, item_id);
+```
+
+**Expected Impact**: 50-70% faster location queries (730ms → 200-300ms)
+
+### 3. Implement Redis Caching Layer 💾 **MEDIUM IMPACT**
+
+Cache frequently accessed, rarely changing data:
+
+```python
+# Install: pip install redis aioredis
+import aioredis
+import json
+
+redis = await aioredis.create_redis_pool('redis://localhost')
+
+# Cache item definitions (never change)
+async def get_item_cached(item_id: str):
+    cached = await redis.get(f"item:{item_id}")
+    if cached:
+        return json.loads(cached)
+    
+    item = ITEMS_MANAGER.get_item(item_id)
+    await redis.setex(f"item:{item_id}", 3600, json.dumps(item))
+    return item
+
+# Cache location data (5 second TTL for NPCs/items)
+async def get_location_cached(location_id: str):
+    cached = await redis.get(f"location:{location_id}")
+    if cached:
+        return json.loads(cached)
+    
+    location = await get_location_from_db(location_id)
+    await redis.setex(f"location:{location_id}", 5, json.dumps(location))
+    return location
+```
+
+**Expected Impact**: +40-60% throughput for read-heavy operations
+
+### 4. Optimize Location Query 📊 **HIGH IMPACT**
+
+**Current Issue**: Location endpoint makes 5+ separate DB queries
+
+**Solution**: Use a single JOIN query or batch operations
+
+```python
+async def get_location_data(location_id: str, player_id: int):
+    """Optimized single-query location data fetch"""
+    async with DatabaseSession() as session:
+        # Single query with JOINs instead of 5 separate queries
+        stmt = select(
+            dropped_items,
+            wandering_enemies,
+            players
+        ).where(
+            or_(
+                dropped_items.c.location_id == location_id,
+                wandering_enemies.c.location_id == location_id,
+                players.c.location_id == location_id
+            )
+        )
+        
+        result = await session.execute(stmt)
+        # Process all data in one go
+```
+
+**Expected Impact**: 60-70% faster location queries
+
+## Medium-Term Improvements
+
+### 5. Database Read Replicas 🔄
+
+Set up PostgreSQL read replicas for location queries (read-heavy):
+
+```yaml
+# docker-compose.yml
+echoes_db_replica:
+  image: postgres:15
+  environment:
+    POSTGRES_REPLICATION_MODE: slave
+    POSTGRES_MASTER_HOST: echoes_of_the_ashes_db
+```
+
+Route read-only queries to replicas, writes to primary.
+
+**Expected Impact**: 2x throughput for read operations
+
+### 6. Batch Processing for Item Operations
+
+Instead of individual item pickup/drop operations:
+
+```python
+# Current: N queries for N items
+for item in items:
+    await db.add_to_inventory(player_id, item)
+
+# Optimized: 1 query for N items
+await db.batch_add_to_inventory(player_id, items)
+```
+
+### 7. Optimize Status Effects Query
+
+Current player status effects might be queried inefficiently:
+
+```python
+# Use eager loading
+stmt = select(players).options(
+    selectinload(players.status_effects)
+).where(players.c.id == player_id)
+```
+
+### 8. Connection Pooling at Application Level
+
+Use PgBouncer in transaction mode:
+
+```yaml
+pgbouncer:
+  image: pgbouncer/pgbouncer
+  environment:
+    DATABASES: echoes_db=host=echoes_of_the_ashes_db port=5432 dbname=echoes
+    POOL_MODE: transaction
+    MAX_CLIENT_CONN: 1000
+    DEFAULT_POOL_SIZE: 25
+```
+
+**Expected Impact**: Better connection management, +20-30% throughput
+
+## Long-Term / Infrastructure Improvements
+
+### 9. Horizontal Scaling
+
+- Load balancer in front of multiple API containers
+- Shared Redis session store
+- Database connection pooler (PgBouncer)
+
+### 10. Database Query Optimization
+
+Monitor slow queries:
+
+```sql
+-- Enable slow query logging
+ALTER DATABASE echoes SET log_min_duration_statement = 100;
+
+-- Find slow queries
+SELECT query, calls, mean_exec_time, max_exec_time 
+FROM pg_stat_statements 
+ORDER BY mean_exec_time DESC 
+LIMIT 10;
+```
+
+### 11. Asynchronous Task Queue
+
+Offload heavy operations to background workers:
+
+```python
+# Use Celery/RQ for:
+- Combat damage calculations
+- Loot generation
+- Statistics updates
+- Email notifications
+```
+
+### 12. CDN for Static Assets
+
+Move images to CDN (CloudFlare, AWS CloudFront)
+
+## Implementation Priority
+
+### Phase 1 (Today - 1 hour work)
+1. ✅ **Add database indexes** (30 min)
+2. ✅ **Increase connection pool** (5 min)
+3. ⚠️ Test and verify improvements
+
+**Expected Result**: 300-400 req/s, <2% failures
+
+### Phase 2 (This Week)
+1. Implement Redis caching for items/NPCs
+2. Optimize location query to single JOIN
+3. Add PgBouncer connection pooler
+
+**Expected Result**: 500-700 req/s
+
+### Phase 3 (Next Sprint)
+1. Add database read replicas
+2. Implement batch operations
+3. Set up monitoring (Prometheus/Grafana)
+
+**Expected Result**: 1000+ req/s
+
+## Monitoring Recommendations
+
+Add performance monitoring:
+
+```python
+# Add to api/main.py
+from prometheus_client import Counter, Histogram
+import time
+
+request_duration = Histogram('http_request_duration_seconds', 'HTTP request latency')
+request_count = Counter('http_requests_total', 'Total HTTP requests')
+
+@app.middleware("http")
+async def monitor_requests(request, call_next):
+    start = time.time()
+    response = await call_next(request)
+    duration = time.time() - start
+    request_duration.observe(duration)
+    request_count.inc()
+    return response
+```
+
+## Quick Performance Test Commands
+
+```bash
+# Test current performance
+cd /opt/dockers/echoes_of_the_ashes
+timeout 300 .venv/bin/python load_test.py
+
+# Monitor database connections
+docker exec echoes_of_the_ashes_db psql -U your_user -d echoes -c \
+  "SELECT count(*) as connections FROM pg_stat_activity;"
+
+# Check slow queries
+docker exec echoes_of_the_ashes_db psql -U your_user -d echoes -c \
+  "SELECT query, mean_exec_time FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 5;"
+
+# Monitor API CPU/Memory
+docker stats echoes_of_the_ashes_api
+```
+
+## Cost vs Benefit Analysis
+
+| Improvement | Time to Implement | Performance Gain | Complexity |
+|-------------|-------------------|------------------|------------|
+| Database Indexes | 30 minutes | +50-70% | Low |
+| Connection Pool | 5 minutes | +30-50% | Low |
+| Optimize Location Query | 2 hours | +60-70% | Medium |
+| Redis Caching | 4 hours | +40-60% | Medium |
+| PgBouncer | 1 hour | +20-30% | Low |
+| Read Replicas | 1 day | +100% reads | High |
+| Batch Operations | 4 hours | +30-40% | Medium |
+
+## Conclusion
+
+**Most Impact for Least Effort**:
+1. Add database indexes (30 min) → +50-70% faster queries
+2. Increase connection pool (5 min) → +30-50% throughput
+3. Add PgBouncer (1 hour) → +20-30% throughput
+
+Combined: **Could reach 400-500 req/s with just a few hours of work**
+
+Current bottleneck is definitely the **database** (not the API workers anymore). Focus optimization there first.