目录
- Horizon Generalization in Reinforcement Learning
- HIQL: Offline Goal-Conditioned RL with Latent States as Actions
- Contrastive Preference Learning: Learning from Human Feedback without RL
- Controlled Diversity with Preference: Towards Learning a Diverse Set of Desired Skills
- Human-Aligned Skill Discovery Balancing Behaviour Exploration and Alignment
- Few is More: Task-Efficient Skill-Discovery for Multi-Task Offline Multi-Agent Reinforcement Learning
- SMAC-R1: The Emergence of Intelligence in Decision-Making Tasks
- Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables
- VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning
- Rethinking Reward Modeling in Preference-based Large Language Model Alignment
- DOPL: Direct Online Preference Learning for Restless Bandits with Preference Feedback
- Fewer May Be Better: Enhancing Offline Reinforcement Learning with Reduced Dataset
- Data Center Cooling System Optimization Using Offline Reinforcement Learning
- SpikeLLM: Scaling up Spiking Neural Network to Large Language Models via Saliency-based Spiking
- Rethinking Inverse Reinforcement Learning: from Data Alignment to Task Alignment
- Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
- Why Distillation can Outperform Zero-RL: The Role of Flexible Reasoning
- Thinkless: LLM Learns When to Think
- Learning to Reason without External Rewards
Horizon Generalization in Reinforcement Learning
- arxiv:https://arxiv.org/abs/2501.02709
- website:https://horizon-generalization.github.io/
- 来源:Benjamin Eysenbach 的新作,是一篇 arxiv paper,同学说有趣。
- 主要内容:
HIQL: Offline Goal-Conditioned RL with Latent States as Actions
- arxiv:https://arxiv.org/abs/2307.11949
- website:https://seohong.me/projects/hiql/
- 来源:合作者推荐的文章,好像也是 Benjamin Eysenbach 发表的。
Contrastive Preference Learning: Learning from Human Feedback without RL
- arxiv:https://arxiv.org/abs/2310.13639
- GitHub:https://github.com/jhejna/cpl
- 来源:无意中搜到的文章,ICLR 2024,好像之前读过。
- 主要内容:
Controlled Diversity with Preference: Towards Learning a Diverse Set of Desired Skills
- arxiv:https://arxiv.org/abs/2303.04592
- 来源:[mask]
Human-Aligned Skill Discovery Balancing Behaviour Exploration and Alignment
- arxiv:https://arxiv.org/abs/2501.17431
- 来源:[mask]
Few is More: Task-Efficient Skill-Discovery for Multi-Task Offline Multi-Agent Reinforcement Learning
- arxiv:https://arxiv.org/abs/2502.08985
- 来源:同学的最新工作。
- 主要内容:
- 这篇文章关注的 setting 是 offline multi-task MARL;特别的,agent 只在(比如说)三个人合作的场景上训练,然后就可以泛化到任意多个人合作的场景。同学讲的故事是,用 transformer 作为一个翻译器,把三个人的合作动作翻译为多个人的,感觉这个故事听起来非常好。
SMAC-R1: The Emergence of Intelligence in Decision-Making Tasks
- arxiv:https://arxiv.org/abs/2410.16024
- 来源:在知乎看到的,但现在知乎帖子好像找不到了)
- 主要内容:
- 用 LLM 生成打 smac 的 python 决策树代码。
- 具体 method:
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables
- arxiv:https://arxiv.org/abs/1903.08254
- 来源:[mask]
- 主要内容:
- 这篇文章提出了 PERAL 方法。
VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning
- arxiv:https://arxiv.org/abs/1910.08348
- 来源:[mask]
- 主要内容:
- 这篇文章提出了 VariBAD 方法。
Rethinking Reward Modeling in Preference-based Large Language Model Alignment
- arxiv:https://arxiv.org/abs/2411.04991
- OpenReview:https://openreview.net/forum?id=rfdblE10qm
- 来源:ICLR 2025 oral。
- 主要内容:
- 这篇文章关注 LLM 的 RLHF。据说不采用 bradley-terry model 来建模 reward model,而是直接训一个分类器,学习一个 (x,y) 是好的还剩坏的,然后使用分类器的概率 logit 作为 RLHF 的 reward。
- 是否使用了非成对的比较 \((x_1, y_1^+, x_2, y_2^-)\),而非把成对比较 \((x, y^+, y^-)\) 打乱(?)
- 实验是否过于 toy(?)理论大概说了什么(?)
DOPL: Direct Online Preference Learning for Restless Bandits with Preference Feedback
- arxiv:https://arxiv.org/abs/2410.05527
- open review:https://openreview.net/forum?id=2iYVBqRHK4
- 来源:合作者推荐的文章。
- 主要内容:
- preference-based index policy(?)
Fewer May Be Better: Enhancing Offline Reinforcement Learning with Reduced Dataset
- 来源:师兄的文章。
Data Center Cooling System Optimization Using Offline Reinforcement Learning
- arxiv:https://arxiv.org/pdf/2501.15085
- 来源:xianyuan zhan 组的新文章。
- 主要内容:
- T-symmetry。
SpikeLLM: Scaling up Spiking Neural Network to Large Language Models via Saliency-based Spiking
- arxiv:https://arxiv.org/abs/2407.04752
- 来源:师兄推荐的神秘文章,ICLR 2025 poster。
Rethinking Inverse Reinforcement Learning: from Data Alignment to Task Alignment
- arxiv:https://arxiv.org/abs/2410.23680
- 来源:偶然看到的文章。
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
- 来源:师兄偶然提到,系里其他人的文章。
Why Distillation can Outperform Zero-RL: The Role of Flexible Reasoning
- arxiv:https://arxiv.org/abs/2505.21067
- 来源:偶然看到的文章。
Thinkless: LLM Learns When to Think
- arxiv:https://arxiv.org/abs/2505.13379
- 来源:偶然看到的文章。
Learning to Reason without External Rewards
- arxiv:https://arxiv.org/abs/2505.19590
- 来源:偶然看到的文章。