Problem
Our LLM-based agent (ReAct-style reasoning + short-term memory of last 10 actions + binary progress feedback) shows a high action repetition rate (~60% over 250 episodes). The agent frequently selects the same action with similar parameters, even when it does not lead to state-space growth. This suggests policy collapse, insufficient exploration, and/or weak credit assignment from the current progress signal.
Top 3 Approaches
Stagnation-Aware Repetition Penalty / Action Masking
Add an explicit penalty or temporary mask for repeating the same action when no progress (Δstate ≤ 0) occurs. This discourages local loops while preserving exploitation when the action is effective.
Exploration & Novelty Incentives
Introduce entropy regularization, ε-greedy sampling, or a lightweight novelty bonus (e.g., state- or (state, action)-count–based reward). This promotes policy diversity and reduces collapse to a single overused action
Use Markov Chain matrices
Instead of let the LLM free for selecting the action uses the prob matrix learnt to guide the selection. Think about possible uses of this to reduce the repetition
Problem
Our LLM-based agent (ReAct-style reasoning + short-term memory of last 10 actions + binary progress feedback) shows a high action repetition rate (~60% over 250 episodes). The agent frequently selects the same action with similar parameters, even when it does not lead to state-space growth. This suggests policy collapse, insufficient exploration, and/or weak credit assignment from the current progress signal.
Top 3 Approaches
Stagnation-Aware Repetition Penalty / Action Masking
Add an explicit penalty or temporary mask for repeating the same action when no progress (Δstate ≤ 0) occurs. This discourages local loops while preserving exploitation when the action is effective.
Exploration & Novelty Incentives
Introduce entropy regularization, ε-greedy sampling, or a lightweight novelty bonus (e.g., state- or (state, action)-count–based reward). This promotes policy diversity and reduces collapse to a single overused action
Use Markov Chain matrices
Instead of let the LLM free for selecting the action uses the prob matrix learnt to guide the selection. Think about possible uses of this to reduce the repetition