Natural Language Actor-Critic: Scalable Off-Policy Learning in Language Space

Joey Hong¹, Kang Liu², Zhan Ling², Jiecao Chen², Sergey Levine¹,

¹UC Berkeley, ²ByteDance Seed

Natural Language Actor Critic (NLAC) performs policy evaluation and improvement purely in natural language space

Approach

Our method can be considered an instance of Natural Language Reinforcement Learning algorithms, where state-action values are in language space. Key novelties of our approach are:

The introduction of a Language Successor Model that is learned to predict the future in language space using a novel Language Bellman Backup. This prevents the need to sample and aggregate transitions on-policy during evaluation.
The use of refinement/self-correction in order to improve the policy. This prevents the need to enumerate large action spaces during policy improvement.

Training pipeline. — Schematic illustrating how our critic and policy (which are implemented using a shared LLM) are learned in NLAC

Results

Mathematical Reasoning Example

In the mathematical reasoning, our learned critic acts as a process reward model, identifying incorrect steps in long reasoning traces. This enables the refinement policy to correct hard-to-find mistakes in just one refinement.

$Math Example.$

20 Questions Example

In 20 Questions, the base policy often resort to linear search over some specific characteristic of an object, when it is likely more optimal to further explore over other discriminators. In the example, the refinement policy explicitly corrects this linear strategy when it occurs.

Customer Service Example

Finally, in Tau-Bench, one of the most common failure modes is partial resolution of complex requests, especially when the policy must also follow complicated dynamics and rules. In the example, the policy is told that the user wants to make ``a couple of exchanges," but according to policy guidelines, modifications to the database can only be done via one tool-call per rollout. This results in a nuanced error where the policy prematurely modifies the database before resolving all components of a request. Our learned critic exactly identifies which policy guideline would be violated, allowing for the refinement policy to easily understand and correct the error.