Jocaru/echoes-of-the-ash

Fork 0

Files

Joan 33cc9586c2 What a mess

2025-11-07 15:27:13 +01:00

8.5 KiB

Raw Blame History

Performance Improvement Recommendations for Echoes of the Ashes

Current Performance Baseline

Throughput: 212 req/s (with 8 workers)
Success Rate: 94% (6% failures under load)
Bottleneck: Database connection pool and complex queries

Quick Wins (Immediate Implementation)

1. Increase Database Connection Pool ⚡ HIGH IMPACT

Current: Default pool size (~10-20 connections per worker) Problem: 8 workers competing for limited connections

# In api/database.py, update engine creation:
engine = create_async_engine(
    DATABASE_URL, 
    echo=False,
    pool_size=20,          # Increased from default 5
    max_overflow=30,       # Allow bursts up to 50 total connections
    pool_timeout=30,       # Wait up to 30s for connection
    pool_recycle=3600,     # Recycle connections every hour
    pool_pre_ping=True     # Verify connections before use
)

Expected Impact: +30-50% throughput, reduce failures to <2%

2. Add Database Indexes 🚀 HIGH IMPACT

Current: Missing indexes on frequently queried columns

-- Run these in PostgreSQL:

-- Player lookups (auth)
CREATE INDEX IF NOT EXISTS idx_players_username ON players(username);
CREATE INDEX IF NOT EXISTS idx_players_telegram_id ON players(telegram_id);

-- Location queries (most expensive operation)
CREATE INDEX IF NOT EXISTS idx_players_location_id ON players(location_id);
CREATE INDEX IF NOT EXISTS idx_dropped_items_location ON dropped_items(location_id);
CREATE INDEX IF NOT EXISTS idx_wandering_enemies_location ON wandering_enemies(location_id);

-- Combat queries
CREATE INDEX IF NOT EXISTS idx_active_combats_player_id ON active_combats(player_id);

-- Inventory queries
CREATE INDEX IF NOT EXISTS idx_inventory_player_id ON inventory(player_id);
CREATE INDEX IF NOT EXISTS idx_inventory_item_id ON inventory(item_id);

-- Compound index for item pickups
CREATE INDEX IF NOT EXISTS idx_inventory_player_item ON inventory(player_id, item_id);

Expected Impact: 50-70% faster location queries (730ms → 200-300ms)

3. Implement Redis Caching Layer 💾 MEDIUM IMPACT

Cache frequently accessed, rarely changing data:

# Install: pip install redis aioredis
import aioredis
import json

redis = await aioredis.create_redis_pool('redis://localhost')

# Cache item definitions (never change)
async def get_item_cached(item_id: str):
    cached = await redis.get(f"item:{item_id}")
    if cached:
        return json.loads(cached)
    
    item = ITEMS_MANAGER.get_item(item_id)
    await redis.setex(f"item:{item_id}", 3600, json.dumps(item))
    return item

# Cache location data (5 second TTL for NPCs/items)
async def get_location_cached(location_id: str):
    cached = await redis.get(f"location:{location_id}")
    if cached:
        return json.loads(cached)
    
    location = await get_location_from_db(location_id)
    await redis.setex(f"location:{location_id}", 5, json.dumps(location))
    return location

Expected Impact: +40-60% throughput for read-heavy operations

4. Optimize Location Query 📊 HIGH IMPACT

Current Issue: Location endpoint makes 5+ separate DB queries

Solution: Use a single JOIN query or batch operations

async def get_location_data(location_id: str, player_id: int):
    """Optimized single-query location data fetch"""
    async with DatabaseSession() as session:
        # Single query with JOINs instead of 5 separate queries
        stmt = select(
            dropped_items,
            wandering_enemies,
            players
        ).where(
            or_(
                dropped_items.c.location_id == location_id,
                wandering_enemies.c.location_id == location_id,
                players.c.location_id == location_id
            )
        )
        
        result = await session.execute(stmt)
        # Process all data in one go

Expected Impact: 60-70% faster location queries

Medium-Term Improvements

5. Database Read Replicas 🔄

Set up PostgreSQL read replicas for location queries (read-heavy):

# docker-compose.yml
echoes_db_replica:
  image: postgres:15
  environment:
    POSTGRES_REPLICATION_MODE: slave
    POSTGRES_MASTER_HOST: echoes_of_the_ashes_db

Route read-only queries to replicas, writes to primary.

Expected Impact: 2x throughput for read operations

6. Batch Processing for Item Operations

Instead of individual item pickup/drop operations:

# Current: N queries for N items
for item in items:
    await db.add_to_inventory(player_id, item)

# Optimized: 1 query for N items
await db.batch_add_to_inventory(player_id, items)

7. Optimize Status Effects Query

Current player status effects might be queried inefficiently:

# Use eager loading
stmt = select(players).options(
    selectinload(players.status_effects)
).where(players.c.id == player_id)

8. Connection Pooling at Application Level

Use PgBouncer in transaction mode:

pgbouncer:
  image: pgbouncer/pgbouncer
  environment:
    DATABASES: echoes_db=host=echoes_of_the_ashes_db port=5432 dbname=echoes
    POOL_MODE: transaction
    MAX_CLIENT_CONN: 1000
    DEFAULT_POOL_SIZE: 25

Expected Impact: Better connection management, +20-30% throughput

Long-Term / Infrastructure Improvements

9. Horizontal Scaling

Load balancer in front of multiple API containers
Shared Redis session store
Database connection pooler (PgBouncer)

10. Database Query Optimization

Monitor slow queries:

-- Enable slow query logging
ALTER DATABASE echoes SET log_min_duration_statement = 100;

-- Find slow queries
SELECT query, calls, mean_exec_time, max_exec_time 
FROM pg_stat_statements 
ORDER BY mean_exec_time DESC 
LIMIT 10;

11. Asynchronous Task Queue

Offload heavy operations to background workers:

# Use Celery/RQ for:
- Combat damage calculations
- Loot generation
- Statistics updates
- Email notifications

12. CDN for Static Assets

Move images to CDN (CloudFlare, AWS CloudFront)

Implementation Priority

Phase 1 (Today - 1 hour work)

✅ Add database indexes (30 min)
✅ Increase connection pool (5 min)
⚠️ Test and verify improvements

Expected Result: 300-400 req/s, <2% failures

Phase 2 (This Week)

Implement Redis caching for items/NPCs
Optimize location query to single JOIN
Add PgBouncer connection pooler

Expected Result: 500-700 req/s

Phase 3 (Next Sprint)

Add database read replicas
Implement batch operations
Set up monitoring (Prometheus/Grafana)

Expected Result: 1000+ req/s

Monitoring Recommendations

Add performance monitoring:

# Add to api/main.py
from prometheus_client import Counter, Histogram
import time

request_duration = Histogram('http_request_duration_seconds', 'HTTP request latency')
request_count = Counter('http_requests_total', 'Total HTTP requests')

@app.middleware("http")
async def monitor_requests(request, call_next):
    start = time.time()
    response = await call_next(request)
    duration = time.time() - start
    request_duration.observe(duration)
    request_count.inc()
    return response

Quick Performance Test Commands

# Test current performance
cd /opt/dockers/echoes_of_the_ashes
timeout 300 .venv/bin/python load_test.py

# Monitor database connections
docker exec echoes_of_the_ashes_db psql -U your_user -d echoes -c \
  "SELECT count(*) as connections FROM pg_stat_activity;"

# Check slow queries
docker exec echoes_of_the_ashes_db psql -U your_user -d echoes -c \
  "SELECT query, mean_exec_time FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 5;"

# Monitor API CPU/Memory
docker stats echoes_of_the_ashes_api

Cost vs Benefit Analysis

Improvement	Time to Implement	Performance Gain	Complexity
Database Indexes	30 minutes	+50-70%	Low
Connection Pool	5 minutes	+30-50%	Low
Optimize Location Query	2 hours	+60-70%	Medium
Redis Caching	4 hours	+40-60%	Medium
PgBouncer	1 hour	+20-30%	Low
Read Replicas	1 day	+100% reads	High
Batch Operations	4 hours	+30-40%	Medium

Conclusion

Most Impact for Least Effort:

Add database indexes (30 min) → +50-70% faster queries
Increase connection pool (5 min) → +30-50% throughput
Add PgBouncer (1 hour) → +20-30% throughput

Combined: Could reach 400-500 req/s with just a few hours of work

Current bottleneck is definitely the database (not the API workers anymore). Focus optimization there first.

8.5 KiB Raw Blame History