Key Account Solutions Manager

Unlock business potential through effective first dataset management solutions.
Post Reply
rriiffaatt77
Posts: 5
Joined: Mon Dec 23, 2024 3:47 pm

Key Account Solutions Manager

Post by rriiffaatt77 »

It seems that the effect increases with the increase of computing power, which is the so-called RL scaling law. This is actually the original meaning of tree search, I think calling this RL scaling law is a bit unworthy of the name. . Tencent Technology Zhou Xiaoyan and Hao Boyang's assumption: PRM performs MCTS-style search only when the response is unacceptable or uses the more economical Beam search. In terms of response time and token consumption,According to calculations by developers using the API on Hackernews, the token used by o for thinking is times larger than the token used by the response, which is the token used by GPT-o mini without thinking.



chaining 6-fold. If you use a forward brazil email list lookup that can look up three steps and form 5 candidates in each step, a single-layer depth Lookahead search will consume 5 times more tokens. But if the thinking chain requires a forward lookup at each step, it will far exceed the token. In addition, given the large amount of MCTS computation, the current feedback time of o is far from sufficient. But if you use only the thinking chain, even for very complex problems, the token consumption will be up to 5-fold. Its consumption of 6 times is too large.



. Peking University Matching Team Assumption: The key technology used by o is a search and learning engine for reinforcement learning. Let the model learn to reason, and then use enough powerful computation to implement scaling in the post-training phase. Similar to an extended version of -a. The model learns the process of generating reasonable reasoning, and the role of MCTS is to induce the generation of reasonable reasoning processes or to construct appropriate pairs of partial orders to form fine-grained reward signals, rather than directly searching for the process and the final response. To optimize this process, a number of methods have been developed, including providing reward signals at the token or clause level to help the model adjust the generated responses.
Post Reply