What a mess
This commit is contained in:
305
docs/PERFORMANCE_IMPROVEMENTS.md
Normal file
305
docs/PERFORMANCE_IMPROVEMENTS.md
Normal file
@@ -0,0 +1,305 @@
|
||||
# Performance Improvement Recommendations for Echoes of the Ashes
|
||||
|
||||
## Current Performance Baseline
|
||||
- **Throughput**: 212 req/s (with 8 workers)
|
||||
- **Success Rate**: 94% (6% failures under load)
|
||||
- **Bottleneck**: Database connection pool and complex queries
|
||||
|
||||
## Quick Wins (Immediate Implementation)
|
||||
|
||||
### 1. Increase Database Connection Pool ⚡ **HIGH IMPACT**
|
||||
|
||||
**Current**: Default pool size (~10-20 connections per worker)
|
||||
**Problem**: 8 workers competing for limited connections
|
||||
|
||||
```python
|
||||
# In api/database.py, update engine creation:
|
||||
engine = create_async_engine(
|
||||
DATABASE_URL,
|
||||
echo=False,
|
||||
pool_size=20, # Increased from default 5
|
||||
max_overflow=30, # Allow bursts up to 50 total connections
|
||||
pool_timeout=30, # Wait up to 30s for connection
|
||||
pool_recycle=3600, # Recycle connections every hour
|
||||
pool_pre_ping=True # Verify connections before use
|
||||
)
|
||||
```
|
||||
|
||||
**Expected Impact**: +30-50% throughput, reduce failures to <2%
|
||||
|
||||
### 2. Add Database Indexes 🚀 **HIGH IMPACT**
|
||||
|
||||
**Current**: Missing indexes on frequently queried columns
|
||||
|
||||
```sql
|
||||
-- Run these in PostgreSQL:
|
||||
|
||||
-- Player lookups (auth)
|
||||
CREATE INDEX IF NOT EXISTS idx_players_username ON players(username);
|
||||
CREATE INDEX IF NOT EXISTS idx_players_telegram_id ON players(telegram_id);
|
||||
|
||||
-- Location queries (most expensive operation)
|
||||
CREATE INDEX IF NOT EXISTS idx_players_location_id ON players(location_id);
|
||||
CREATE INDEX IF NOT EXISTS idx_dropped_items_location ON dropped_items(location_id);
|
||||
CREATE INDEX IF NOT EXISTS idx_wandering_enemies_location ON wandering_enemies(location_id);
|
||||
|
||||
-- Combat queries
|
||||
CREATE INDEX IF NOT EXISTS idx_active_combats_player_id ON active_combats(player_id);
|
||||
|
||||
-- Inventory queries
|
||||
CREATE INDEX IF NOT EXISTS idx_inventory_player_id ON inventory(player_id);
|
||||
CREATE INDEX IF NOT EXISTS idx_inventory_item_id ON inventory(item_id);
|
||||
|
||||
-- Compound index for item pickups
|
||||
CREATE INDEX IF NOT EXISTS idx_inventory_player_item ON inventory(player_id, item_id);
|
||||
```
|
||||
|
||||
**Expected Impact**: 50-70% faster location queries (730ms → 200-300ms)
|
||||
|
||||
### 3. Implement Redis Caching Layer 💾 **MEDIUM IMPACT**
|
||||
|
||||
Cache frequently accessed, rarely changing data:
|
||||
|
||||
```python
|
||||
# Install: pip install redis aioredis
|
||||
import aioredis
|
||||
import json
|
||||
|
||||
redis = await aioredis.create_redis_pool('redis://localhost')
|
||||
|
||||
# Cache item definitions (never change)
|
||||
async def get_item_cached(item_id: str):
|
||||
cached = await redis.get(f"item:{item_id}")
|
||||
if cached:
|
||||
return json.loads(cached)
|
||||
|
||||
item = ITEMS_MANAGER.get_item(item_id)
|
||||
await redis.setex(f"item:{item_id}", 3600, json.dumps(item))
|
||||
return item
|
||||
|
||||
# Cache location data (5 second TTL for NPCs/items)
|
||||
async def get_location_cached(location_id: str):
|
||||
cached = await redis.get(f"location:{location_id}")
|
||||
if cached:
|
||||
return json.loads(cached)
|
||||
|
||||
location = await get_location_from_db(location_id)
|
||||
await redis.setex(f"location:{location_id}", 5, json.dumps(location))
|
||||
return location
|
||||
```
|
||||
|
||||
**Expected Impact**: +40-60% throughput for read-heavy operations
|
||||
|
||||
### 4. Optimize Location Query 📊 **HIGH IMPACT**
|
||||
|
||||
**Current Issue**: Location endpoint makes 5+ separate DB queries
|
||||
|
||||
**Solution**: Use a single JOIN query or batch operations
|
||||
|
||||
```python
|
||||
async def get_location_data(location_id: str, player_id: int):
|
||||
"""Optimized single-query location data fetch"""
|
||||
async with DatabaseSession() as session:
|
||||
# Single query with JOINs instead of 5 separate queries
|
||||
stmt = select(
|
||||
dropped_items,
|
||||
wandering_enemies,
|
||||
players
|
||||
).where(
|
||||
or_(
|
||||
dropped_items.c.location_id == location_id,
|
||||
wandering_enemies.c.location_id == location_id,
|
||||
players.c.location_id == location_id
|
||||
)
|
||||
)
|
||||
|
||||
result = await session.execute(stmt)
|
||||
# Process all data in one go
|
||||
```
|
||||
|
||||
**Expected Impact**: 60-70% faster location queries
|
||||
|
||||
## Medium-Term Improvements
|
||||
|
||||
### 5. Database Read Replicas 🔄
|
||||
|
||||
Set up PostgreSQL read replicas for location queries (read-heavy):
|
||||
|
||||
```yaml
|
||||
# docker-compose.yml
|
||||
echoes_db_replica:
|
||||
image: postgres:15
|
||||
environment:
|
||||
POSTGRES_REPLICATION_MODE: slave
|
||||
POSTGRES_MASTER_HOST: echoes_of_the_ashes_db
|
||||
```
|
||||
|
||||
Route read-only queries to replicas, writes to primary.
|
||||
|
||||
**Expected Impact**: 2x throughput for read operations
|
||||
|
||||
### 6. Batch Processing for Item Operations
|
||||
|
||||
Instead of individual item pickup/drop operations:
|
||||
|
||||
```python
|
||||
# Current: N queries for N items
|
||||
for item in items:
|
||||
await db.add_to_inventory(player_id, item)
|
||||
|
||||
# Optimized: 1 query for N items
|
||||
await db.batch_add_to_inventory(player_id, items)
|
||||
```
|
||||
|
||||
### 7. Optimize Status Effects Query
|
||||
|
||||
Current player status effects might be queried inefficiently:
|
||||
|
||||
```python
|
||||
# Use eager loading
|
||||
stmt = select(players).options(
|
||||
selectinload(players.status_effects)
|
||||
).where(players.c.id == player_id)
|
||||
```
|
||||
|
||||
### 8. Connection Pooling at Application Level
|
||||
|
||||
Use PgBouncer in transaction mode:
|
||||
|
||||
```yaml
|
||||
pgbouncer:
|
||||
image: pgbouncer/pgbouncer
|
||||
environment:
|
||||
DATABASES: echoes_db=host=echoes_of_the_ashes_db port=5432 dbname=echoes
|
||||
POOL_MODE: transaction
|
||||
MAX_CLIENT_CONN: 1000
|
||||
DEFAULT_POOL_SIZE: 25
|
||||
```
|
||||
|
||||
**Expected Impact**: Better connection management, +20-30% throughput
|
||||
|
||||
## Long-Term / Infrastructure Improvements
|
||||
|
||||
### 9. Horizontal Scaling
|
||||
|
||||
- Load balancer in front of multiple API containers
|
||||
- Shared Redis session store
|
||||
- Database connection pooler (PgBouncer)
|
||||
|
||||
### 10. Database Query Optimization
|
||||
|
||||
Monitor slow queries:
|
||||
|
||||
```sql
|
||||
-- Enable slow query logging
|
||||
ALTER DATABASE echoes SET log_min_duration_statement = 100;
|
||||
|
||||
-- Find slow queries
|
||||
SELECT query, calls, mean_exec_time, max_exec_time
|
||||
FROM pg_stat_statements
|
||||
ORDER BY mean_exec_time DESC
|
||||
LIMIT 10;
|
||||
```
|
||||
|
||||
### 11. Asynchronous Task Queue
|
||||
|
||||
Offload heavy operations to background workers:
|
||||
|
||||
```python
|
||||
# Use Celery/RQ for:
|
||||
- Combat damage calculations
|
||||
- Loot generation
|
||||
- Statistics updates
|
||||
- Email notifications
|
||||
```
|
||||
|
||||
### 12. CDN for Static Assets
|
||||
|
||||
Move images to CDN (CloudFlare, AWS CloudFront)
|
||||
|
||||
## Implementation Priority
|
||||
|
||||
### Phase 1 (Today - 1 hour work)
|
||||
1. ✅ **Add database indexes** (30 min)
|
||||
2. ✅ **Increase connection pool** (5 min)
|
||||
3. ⚠️ Test and verify improvements
|
||||
|
||||
**Expected Result**: 300-400 req/s, <2% failures
|
||||
|
||||
### Phase 2 (This Week)
|
||||
1. Implement Redis caching for items/NPCs
|
||||
2. Optimize location query to single JOIN
|
||||
3. Add PgBouncer connection pooler
|
||||
|
||||
**Expected Result**: 500-700 req/s
|
||||
|
||||
### Phase 3 (Next Sprint)
|
||||
1. Add database read replicas
|
||||
2. Implement batch operations
|
||||
3. Set up monitoring (Prometheus/Grafana)
|
||||
|
||||
**Expected Result**: 1000+ req/s
|
||||
|
||||
## Monitoring Recommendations
|
||||
|
||||
Add performance monitoring:
|
||||
|
||||
```python
|
||||
# Add to api/main.py
|
||||
from prometheus_client import Counter, Histogram
|
||||
import time
|
||||
|
||||
request_duration = Histogram('http_request_duration_seconds', 'HTTP request latency')
|
||||
request_count = Counter('http_requests_total', 'Total HTTP requests')
|
||||
|
||||
@app.middleware("http")
|
||||
async def monitor_requests(request, call_next):
|
||||
start = time.time()
|
||||
response = await call_next(request)
|
||||
duration = time.time() - start
|
||||
request_duration.observe(duration)
|
||||
request_count.inc()
|
||||
return response
|
||||
```
|
||||
|
||||
## Quick Performance Test Commands
|
||||
|
||||
```bash
|
||||
# Test current performance
|
||||
cd /opt/dockers/echoes_of_the_ashes
|
||||
timeout 300 .venv/bin/python load_test.py
|
||||
|
||||
# Monitor database connections
|
||||
docker exec echoes_of_the_ashes_db psql -U your_user -d echoes -c \
|
||||
"SELECT count(*) as connections FROM pg_stat_activity;"
|
||||
|
||||
# Check slow queries
|
||||
docker exec echoes_of_the_ashes_db psql -U your_user -d echoes -c \
|
||||
"SELECT query, mean_exec_time FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 5;"
|
||||
|
||||
# Monitor API CPU/Memory
|
||||
docker stats echoes_of_the_ashes_api
|
||||
```
|
||||
|
||||
## Cost vs Benefit Analysis
|
||||
|
||||
| Improvement | Time to Implement | Performance Gain | Complexity |
|
||||
|-------------|-------------------|------------------|------------|
|
||||
| Database Indexes | 30 minutes | +50-70% | Low |
|
||||
| Connection Pool | 5 minutes | +30-50% | Low |
|
||||
| Optimize Location Query | 2 hours | +60-70% | Medium |
|
||||
| Redis Caching | 4 hours | +40-60% | Medium |
|
||||
| PgBouncer | 1 hour | +20-30% | Low |
|
||||
| Read Replicas | 1 day | +100% reads | High |
|
||||
| Batch Operations | 4 hours | +30-40% | Medium |
|
||||
|
||||
## Conclusion
|
||||
|
||||
**Most Impact for Least Effort**:
|
||||
1. Add database indexes (30 min) → +50-70% faster queries
|
||||
2. Increase connection pool (5 min) → +30-50% throughput
|
||||
3. Add PgBouncer (1 hour) → +20-30% throughput
|
||||
|
||||
Combined: **Could reach 400-500 req/s with just a few hours of work**
|
||||
|
||||
Current bottleneck is definitely the **database** (not the API workers anymore). Focus optimization there first.
|
||||
Reference in New Issue
Block a user