firmware-base/docs/seq-heating.md

# AmperageBudgetManager Sequencing Analysis

## Problem Statements

1. **Premature Abortion**: Devices stop heating earlier than expected.
2. **Reduced Concurrency**: "Sometimes heating only 1 instead of set max".
3. **Erratic Behavior**: General instability in sequencing modes.

## Analysis of `E_AM_CYCLE_ALL`

### 1. Transient Error Sensitivity

The `_checkHeatup` function returns `false` immediately if `device->hasError()` is true. This can be triggered by a single failed Modbus transaction (e.g., packet loss).

- **Consequence**: If a running device encounters a transient error, it is immediately stopped (`device->stop()`), and the sequencer advances to the next index (`advanceCurrentIndex()`).
- **Impact**: The device loses its slot in the current window. It must wait for the window to cycle all the way around before it can heat again.

### 2. "Heatup" vs "Oscillating" Mode Instability

The logic distinguishes between "Heatup" (aggressive, long duration) and "Oscillating" (maintenance, short duration) using `_deviceInHeatup`.

```cpp
bool isPastHeatup = !_deviceInHeatup[deviceIndex];
uint32_t maxDuration = isPastHeatup ? m_maxHeatingDurationOscillatingS.getValue() : m_maxHeatingDurationS.getValue();
```

- **Scenario**: A device is heating in "Heatup" mode (e.g., running for 100s, limit 300s).
- **Trigger**: A transient condition causes `_checkHeatup` or `isHeatup()` to return `false` momentarily (e.g., PV spikes near SP, or momentary error).
- **Effect**: `isPastHeatup` becomes `true`. The `maxDuration` switches to `Oscillating` limit (default 15s).
- **Consequence**: Since 100s > 15s, the check `now_s - start > maxDuration` immediately passes. The device is stopped, and the index advances.
- **Result**: "Premature abortion" of the heatup phase.

### 3. Concurrency Reduction (Sliding Window Race)

When a device is stopped (due to error or timeout), `advanceCurrentIndex()` is called immediately.

- If we have devices [A, B, C] and MaxSim=2. Window=[A, B].
- A aborts (transient error). Index moves to B. Window=[B, C].
- A is now outside the window.
- If A recovers immediately, it cannot restart because it is outside the window.
- It leaves B (and potentially C) heating.
- If this happens frequently, devices effectively take turns crashing out of the window, reducing effective concurrency.

## Proposed Fixes

### 1. Add Debounce / Grace Period for Errors and Completion

Do not stop a device immediately on a single cycle of `!_checkHeatup`. Requires a persistence counter or time-based hysteresis.

**Implementation Plan**:

- Introduce `_deviceStateInSyncCount[MAX_MANAGED_DEVICES]` or similar.
- Require `N` consecutive cycles of "Not Heatup" or "Error" before taking stop action.
- Alternatively, check `lastUpdate` timestamp of the device to ignore stale errors.

### 2. Sticky "Heatup" State

Once a device is considered "In Heatup", it should probably remain in that logical state for the duration of its heating slot, or require significant evidence to drop to "Oscillating".
However, `_deviceInHeatup` is refreshed every `loop()`.

**Better Approach**: Use the `maxDuration` based on the state *at the start of the heating cycle* or maintain a latched state that is only cleared after a stop.

### 3. Decouple "Stop" from "Advance" on Error

If a device errors out, we might want to Stop it for safety, but *NOT* advance the index immediately, giving it a chance to recover within its slot?

- Risk: If it's dead, we block the slot.
- Compromise: Advance only after `N` seconds of error.

## Recommended Immediate Changes (Low Risk)

1. **Modify `_checkHeatup`** to ignore `hasError()` if the error is transient/recent (or simply log and return true if data is stale but not ancient). *Actually, safer to just rely on debounce.*
2. **Debounce State Changes**: In `updateDevice`, do not act on `!_checkHeatup` unless it persists for X cycles (e.g., 3 cycles = 1.5 seconds).

### 4. Fix Latching Error Logic (High Priority)

The use of `device->hasError()` (which checks `errorCount > 0`) is incorrect for real-time control.

- **Fix**: Change `_checkHeatup` to ignore `device->hasError()` (cumulative). Instead, check `device->getLastErrorCode()`.
- **Refinement**: Only treat **Timeout (224/0xE0)** as a critical failure. Ignore CRC/Collision errors.
- **Proposed Logic**: `bool isDead = device->getLastErrorCode() == (uint16_t)MB_Error::Timeout;`
- This ensures we only Abort/Stop when the device is truly unreachable, not just experiencing noise.

```cpp
// Pseudo-code
if (_deviceHeating[i]) {
    bool currentNeed = _checkHeatup(device);
    if (!currentNeed) {
        _stopConfirmationCount[i]++;
        if (_stopConfirmationCount[i] < DEBOUNCE_THRESHOLD) return false; // Ignore for now
    } else {
        _stopConfirmationCount[i] = 0;
    }
    // ... proceed to time checks ...
}
```