Our method can be considered an instance of Natural Language Reinforcement Learning algorithms, where state-action values are in language space. Key novelties of our approach are:
In the mathematical reasoning, our learned critic acts as a process reward model, identifying incorrect steps in long reasoning traces. This enables the refinement policy to correct hard-to-find mistakes in just one refinement.
In 20 Questions, the base policy often resort to linear search over some specific characteristic of an object, when it is likely more optimal to further explore over other discriminators. In the example, the refinement policy explicitly corrects this linear strategy when it occurs.
Finally, in Tau-Bench, one of the most common failure modes is partial resolution of complex requests, especially when the policy must also follow complicated dynamics and rules. In the example, the policy is told that the user wants to make ``a couple of exchanges," but according to policy guidelines, modifications to the database can only be done via one tool-call per rollout. This results in a nuanced error where the policy prematurely modifies the database before resolving all components of a request. Our learned critic exactly identifies which policy guideline would be violated, allowing for the refinement policy to easily understand and correct the error.