感觉今天 paper 质量不是很高.
SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention
https://arxiv.org/abs/2509.24006
SLA, a trainable attention method combining sparse and linear attention, accelerates Diffusion Transformer models for video generation with minimal quality loss.
发现 diffusion transformers 中的注意力权重可以分解为两个矩阵: 一小部分 high-rank 的大权重和一小部分 low-rank 剩余权重.
然后定义 critical, marginal, negligible 三个部分, critical 就直接做, marginal 用线性注意力, negligible 舍去.
6+/10
GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts
https://arxiv.org/abs/2509.25160
GSM8K-V is a new visual multi-image mathematical reasoning benchmark that highlights the limitations of current vision language models in handling visual mathematical problems.
GSM8K + AI 画图.
5-/10
GRPO-MA: Multi-Answer Generation in GRPO for Stable and Efficient Chain-of-Thought Training
https://arxiv.org/abs/2509.24494
GRPO-MA improves the training of Chain-of-Thought reasoning in LLMs and VLMs by addressing gradient coupling, sparse rewards, and unstable advantage estimation through multi-answer generation.
将 GRPO Pipeline 中生成单一 thought-answer pair 计算优势函数改为 thought 和 muti-answers 对应, 提供更为丰富的监督.
就像摘要介绍一样 simple yet theoretically grounded.
6/10
