echoes-of-the-ash/docs/development/SCALABILITY_ANALYSIS.md

# Scalability Analysis - Background Tasks

**Date:** October 21, 2025
**Scope:** Performance analysis for 10,000+ concurrent players

## Executive Summary

⚠️ **Current implementation has SEVERE scalability issues** at 10,000 players:

| Function | Current | 10K Players Impact | Risk Level |
|----------|---------|-------------------|------------|
| `regenerate_stamina()` | **O(n)** fetch-all + loop | ~10K DB queries every 5min | 🔴 **CRITICAL** |
| `check_combat_timers()` | **O(n)** fetch-all + loop | Fetch all combats every 30s | 🟡 **HIGH** |
| `decay_dropped_items()` | **O(1)** single DELETE | ~1 query every 5min | 🟢 **LOW** |

## Detailed Analysis

---

### 1. `regenerate_stamina()` - 🔴 CRITICAL ISSUE

**Current Implementation:**
```python
async def regenerate_all_players_stamina() -> int:
    # 1. SELECT ALL players below max stamina
    result = await conn.execute(
        players.select().where(
            (players.c.is_dead == False) &
            (players.c.stamina < players.c.max_stamina)
        )
    )
    players_to_update = result.fetchall()  # Load ALL into memory

    # 2. Loop through EACH player (O(n))
    for player in players_to_update:
        # Calculate recovery per player
        base_recovery = 1
        endurance_bonus = player.endurance // 10
        total_recovery = base_recovery + endurance_bonus
        new_stamina = min(player.stamina + total_recovery, player.max_stamina)

        # 3. Individual UPDATE query per player (O(n) queries!)
        await conn.execute(
            players.update()
            .where(players.c.telegram_id == player.telegram_id)
            .values(stamina=new_stamina)
        )
```

**Performance at Scale:**
- **10,000 active players** with stamina < max
- Runs every **5 minutes** (288 times per day)
- **Operations per cycle:**
  - 1 SELECT query → 10K rows loaded into memory
  - 10K individual UPDATE queries
  - **Total: 10,001 queries per cycle**
- **Daily load:** 2,880,000+ queries just for stamina regeneration!

**Memory Impact:**
- Loading 10K player objects into Python: ~5-10 MB per cycle
- Holding them during UPDATE loop: memory spike every 5 minutes

**Database Impact:**
- 10K sequential UPDATE queries = **MASSIVE lock contention**
- Each UPDATE acquires row locks
- Other queries (player actions) get blocked
- **Potential cascading failures** under load

**Network Latency:**
- If DB has 5ms latency: 10K × 5ms = **50 seconds** per cycle
- Blocks the async loop for 50+ seconds
- Other background tasks starve

---

### 2. `check_combat_timers()` - 🟡 HIGH RISK

**Current Implementation:**
```python
async def check_combat_timers():
    # Every 30 seconds:
    idle_combats = await database.get_all_idle_combats(idle_threshold)

    # In database.py:
    stmt = active_combats.select().where(
        active_combats.c.turn_started_at < idle_threshold
    )
    result = await conn.execute(stmt)
    return [row._asdict() for row in result.fetchall()]  # Load ALL

    # Loop through each combat
    for combat in idle_combats:
        await combat_logic.npc_attack(combat['player_id'])
```

**Performance at Scale:**
- Assume 5% of players in combat at any time: **500 combats**
- Runs every **30 seconds** (2,880 times per day)
- **Operations per cycle:**
  - 1 SELECT query → 500 rows
  - 500 × `npc_attack()` calls (each does multiple DB queries)
  - **Estimate: 500-1000 queries per cycle**

**Problems:**
- If combat rate increases (10% in combat): **1000 combats**
- `npc_attack()` itself does multiple DB operations:
  - Update combat state
  - Update player HP
  - Check for death
  - Potential inventory operations
- **Cascading load** during peak hours

**Edge Case Risk:**
- If many players go AFK simultaneously (server maintenance, network issue)
- Could have 1000+ idle combats to process at once
- 30-second cycle time becomes 5+ minutes
- Combats pile up, system collapses

---

### 3. `decay_dropped_items()` - 🟢 LOW RISK (Optimal)

**Current Implementation:**
```python
async def remove_expired_dropped_items(timestamp_limit: float) -> int:
    stmt = dropped_items.delete().where(
        dropped_items.c.drop_timestamp < timestamp_limit
    )
    result = await conn.execute(stmt)
    await conn.commit()
    return result.rowcount
```

**Performance at Scale:**
- **Single DELETE query** with WHERE clause
- Database handles filtering efficiently (indexed timestamp)
- **O(1) in terms of queries** (regardless of player count)
- Only cleanup work scales with number of expired items (which is constant per time window)

**Why This Works:**
- ✅ Single query, database-side filtering
- ✅ Indexed timestamp column
- ✅ No data loaded into Python memory
- ✅ Scales to millions of items

---

## Scalability Comparison Table

| Metric | `regenerate_stamina()` | `check_combat_timers()` | `decay_dropped_items()` |
|--------|------------------------|-------------------------|------------------------|
| **Queries/cycle** | 10,001 (10K players) | 500-1000 (500 combats) | 1 |
| **Memory usage** | 5-10 MB | 1-2 MB | <1 KB |
| **Cycle time** | 50+ seconds | 5-10 seconds | <100ms |
| **Lock contention** | **SEVERE** | Moderate | Minimal |
| **Network overhead** | **MASSIVE** | High | Low |
| **Scalability** | **O(n) queries** | O(m) queries | **O(1) queries** |
| **10K players** | 🔴 Breaks | 🟡 Struggles | 🟢 Fine |
| **100K players** | 💀 Dead | 💀 Dead | 🟢 Fine |

---

## Recommended Solutions

### 🔴 CRITICAL: Fix `regenerate_stamina()`

**Option 1: Single UPDATE Query (Best)**
```sql
-- PostgreSQL supports calculated updates
UPDATE players
SET stamina = LEAST(
    stamina + 1 + (endurance / 10),  -- base + endurance bonus
    max_stamina
)
WHERE is_dead = FALSE
  AND stamina < max_stamina
RETURNING telegram_id;
```

**Benefits:**
- **1 query instead of 10,001**
- Database calculates per-row (no Python loop)
- Atomic operation (no race conditions)
- **~1000x faster**

**Implementation:**
```python
async def regenerate_all_players_stamina() -> int:
    async with engine.connect() as conn:
        stmt = text("""
            UPDATE players
            SET stamina = LEAST(
                stamina + 1 + (endurance / 10),
                max_stamina
            )
            WHERE is_dead = FALSE
              AND stamina < max_stamina
        """)
        result = await conn.execute(stmt)
        await conn.commit()
        return result.rowcount
```

**Performance Gain:**
- 10K queries → **1 query**
- 50 seconds → **<1 second**
- No memory bloat
- No lock contention

---

**Option 2: Batch Updates (Good)**
If you need custom Python logic per player:
```python
async def regenerate_all_players_stamina() -> int:
    async with engine.connect() as conn:
        # Still fetch all (1 query)
        result = await conn.execute(
            players.select().where(
                (players.c.is_dead == False) &
                (players.c.stamina < players.c.max_stamina)
            )
        )
        players_to_update = result.fetchall()

        # Build batch update
        updates = []
        for player in players_to_update:
            base_recovery = 1
            endurance_bonus = player.endurance // 10
            total_recovery = base_recovery + endurance_bonus
            new_stamina = min(player.stamina + total_recovery, player.max_stamina)

            if new_stamina > player.stamina:
                updates.append({
                    'telegram_id': player.telegram_id,
                    'stamina': new_stamina
                })

        # Single bulk update (PostgreSQL specific)
        if updates:
            await conn.execute(
                players.update(),
                updates
            )

        await conn.commit()
        return len(updates)
```

**Performance Gain:**
- 10K queries → **2 queries** (1 SELECT + 1 bulk UPDATE)
- 50 seconds → **1-2 seconds**
- Still loads data into memory (not ideal)

---

### 🟡 HIGH: Optimize `check_combat_timers()`

**Option 1: Limit + Pagination**
```python
async def check_combat_timers():
    BATCH_SIZE = 100
    while not shutdown_event.is_set():
        try:
            await asyncio.wait_for(shutdown_event.wait(), timeout=30)
        except asyncio.TimeoutError:
            idle_threshold = time.time() - 300
            offset = 0

            while True:
                # Process in batches
                idle_combats = await database.get_idle_combats_paginated(
                    idle_threshold,
                    limit=BATCH_SIZE,
                    offset=offset
                )

                if not idle_combats:
                    break

                for combat in idle_combats:
                    try:
                        from bot import combat as combat_logic
                        if combat['turn'] == 'player':
                            await database.update_combat(combat['player_id'], {
                                'turn': 'npc',
                                'turn_started_at': time.time()
                            })
                            await combat_logic.npc_attack(combat['player_id'])
                    except Exception as e:
                        logger.error(f"Error processing idle combat: {e}")

                offset += BATCH_SIZE
```

**Benefits:**
- Processes 100 at a time instead of all
- Prevents memory spikes
- Other tasks can interleave

---

**Option 2: Database-Side Auto-Timeout**
```sql
-- Add trigger to auto-switch turns
CREATE OR REPLACE FUNCTION auto_timeout_combat()
RETURNS trigger AS $$
BEGIN
    IF NEW.turn_started_at < (EXTRACT(EPOCH FROM NOW()) - 300) THEN
        NEW.turn := CASE
            WHEN NEW.turn = 'player' THEN 'npc'
            ELSE 'player'
        END;
        NEW.turn_started_at := EXTRACT(EPOCH FROM NOW());
    END IF;
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;
```

**Benefits:**
- No Python loop needed
- Database handles it automatically
- Zero application load

---

### 🟢 `decay_dropped_items()` - Already Optimal

No changes needed. This is the **gold standard** for background tasks.

---

## Performance Projections

### Current System (Before Optimization)

| Players | Stamina Regen Time | Combat Check Time | Total Background Load |
|---------|-------------------|-------------------|---------------------|
| 100     | 0.5s              | 0.1s              | Negligible          |
| 1,000   | 5s                | 1s                | Manageable          |
| 10,000  | **50s+**          | **10s+**          | 🔴 **Breaking**     |
| 100,000 | **500s+**         | **100s+**         | 💀 **Dead**         |

### After Optimization (Single-Query Approach)

| Players | Stamina Regen Time | Combat Check Time | Total Background Load |
|---------|-------------------|-------------------|---------------------|
| 100     | 0.1s              | 0.1s              | Negligible          |
| 1,000   | 0.2s              | 0.5s              | Low                 |
| 10,000  | **0.5s**          | **2s**            | 🟢 **Good**         |
| 100,000 | **2s**            | **10s**           | 🟡 **Acceptable**   |

---

## Additional Recommendations

### 1. Add Database Indexes
```sql
-- Speed up stamina regeneration query
CREATE INDEX idx_players_stamina_regen
ON players(is_dead, stamina)
WHERE is_dead = FALSE AND stamina < max_stamina;

-- Speed up idle combat check
CREATE INDEX idx_combat_turn_time
ON active_combats(turn_started_at);

-- Already optimal for dropped items
CREATE INDEX idx_dropped_items_timestamp
ON dropped_items(drop_timestamp);
```

### 2. Add Monitoring
```python
import time

async def regenerate_stamina():
    while not shutdown_event.is_set():
        try:
            await asyncio.wait_for(shutdown_event.wait(), timeout=300)
        except asyncio.TimeoutError:
            start_time = time.time()
            logger.info("Running stamina regeneration...")

            players_updated = await database.regenerate_all_players_stamina()

            elapsed = time.time() - start_time
            logger.info(
                f"Regenerated stamina for {players_updated} players "
                f"in {elapsed:.2f}s"
            )

            # Alert if slow
            if elapsed > 5.0:
                logger.warning(
                    f"⚠️ Stamina regeneration took {elapsed:.2f}s "
                    f"(threshold: 5s)"
                )
```

### 3. Add Connection Pooling
```python
# In database.py
from sqlalchemy.pool import NullPool, QueuePool

engine = create_async_engine(
    DATABASE_URL,
    poolclass=QueuePool,
    pool_size=20,          # Max 20 connections
    max_overflow=10,       # Allow 10 more if needed
    pool_pre_ping=True,    # Test connections before use
)
```

### 4. Consider Redis for Hot Data
For frequently accessed data (player stats, combat state):
```python
import redis.asyncio as redis

# Cache player stamina in Redis
async def get_player_cached(player_id: int):
    cached = await redis_client.get(f"player:{player_id}")
    if cached:
        return json.loads(cached)

    # Fetch from DB, cache for 1 minute
    player = await database.get_player(player_id)
    await redis_client.setex(
        f"player:{player_id}",
        60,
        json.dumps(player)
    )
    return player
```

---

## Implementation Priority

1. **🔴 IMMEDIATE:** Fix `regenerate_stamina()` with single-query approach
2. **🟡 HIGH:** Add batching to `check_combat_timers()`
3. **🟢 MEDIUM:** Add database indexes
4. **🟢 MEDIUM:** Add performance monitoring
5. **🔵 LOW:** Consider Redis caching (only if needed)

---

## Conclusion

**Current state at 10,000 players:**
- ❌ `regenerate_stamina()`: **WILL BREAK** (50+ seconds per cycle, 10K queries)
- ⚠️ `check_combat_timers()`: **WILL STRUGGLE** (500-1000 queries per cycle)
- ✅ `decay_dropped_items()`: **WORKS PERFECTLY** (1 query, optimal design)

**After optimization:**
- ✅ All tasks complete in **<5 seconds** total
- ✅ Scales to **100,000+ players**
- ✅ Minimal database load
- ✅ No memory bloat

**Bottom line:** The single-query approach for `regenerate_stamina()` is **CRITICAL** for any production deployment beyond 1000 players.