Generative Skill Composition for LLM Agents

Zhao, Xinyu; Tan, Zhen; Tadiparthi, Vaishnav; Agarwal, Nakul; Lee, Kwonjoon; Moradi Pari, Ehsan; Nourkhiz Mahjoub, Hossein; Chen, Tianlong

Generative Skill Composition for LLM Agents

Xinyu Zhao¹ Zhen Tan² Vaishnav Tadiparthi³ Nakul Agarwal³ Kwonjoon Lee³ Ehsan Moradi Pari³ Hossein Nourkhiz Mahjoub³ Tianlong Chen¹^✉

¹The University of North Carolina at Chapel Hill ²Arizona State University ³Honda Research Institute USA

^✉ Corresponding author

Paper Code

TL;DR

As skill libraries grow, the bottleneck shifts from obtaining skills to composing them—deciding which skills to load, how many, and in what order. SkillComposer casts this as task-conditioned skill sequence prediction: a small constrained autoregressive decoder resolves all three jointly in a single pass, matching the gold-skill upper bound at lower prompt-token cost.

**Structured skill composition with SkillComposer.** **(A)** Large skill libraries create a composition bottleneck—an agent must decide not only *which* skills to use, but also their exact *count* and execution *order*. **(B)** Existing paradigms leave composition implicit: direct reasoning produces an unstructured trace, and retrieval returns an unordered subset. SkillComposer instead predicts an ordered, executable skill plan. **(C)** Structuring composition improves both plan Set F1 and downstream task success on SkillsBench.

Abstract

Recent LLM agents benefit from skills for solving complex tasks. Skills encapsulate modular packages of procedural knowledge and instructions for performing specialized tasks, such as setting up a sandboxed environment, running a test suite, or refactoring a function across multiple files. As skill libraries grow and become reusable across tasks and domains, selecting an appropriate skill composition has emerged as a central bottleneck. Existing approaches fall into two categories: one exposes the agent's reasoning to the entire skill collection; the other performs skill retrieval via embeddings or LLM-based rerankers. Both provide useful insights; however, they miss the structural nature of skill composition, which is a joint decision over which skills, how many, and in what order—three dimensions that cannot be decoupled. We formalize this as structured skill composition: given a task and a skill library, predict an executable skill plan that jointly specifies the activated subset, count, and execution order. We propose SkillComposer, which instantiates structured skill composition as task-conditioned skill sequence prediction. SkillComposer uses a constrained autoregressive decoder over skill identifiers, so subset, count, and order emerge jointly from a single decoding pass, and dependencies between successive skills are captured naturally. We build a training set of task–composition pairs from a real, human-curated skill library, and evaluate SkillComposer along two axes: composition quality on a held-out test set, and downstream task success on SkillsBench across two production-grade coding agents. On {GPT-5.2-Codex, Gemini-3-Pro-Preview}, SkillComposer raises the pass rate by {+23.1, +18.2} pp over the no-skill baseline, surpassing top-3 retrieval and matching the gold-skill retrieval upper bound at lower prompt-token cost.

Contributions

SkillComposer reframes inference-time skill use as a structured prediction problem over a fixed, reusable skill library.

Structured Composition

We formalize inference-time skill use as a structured prediction problem: the output plan jointly determines which skills to activate, how many are needed, and in what order they execute—three coupled dimensions that retrieval and direct reasoning leave implicit.

Library-Grounded Data

9,872 task–composition records over a real, human-curated 196-skill library. Starting from real task-composition seeds, we build a skill dependency graph and use layered synthesis with quality filtering for both single-skill grounding and dependency-aware multi-skill chains.

Generative Composer

A task-conditioned constrained autoregressive decoder over skill identifiers. Auxiliary cardinality and set-membership heads, plus a retrieval-augmented decoding prior, unify subset, count, and ordering while guaranteeing every output is an executable library skill.

Method

SkillComposer encodes the task–library context with a frozen retrieval-tuned encoder, then decodes a variable-length, ordered sequence of skill indices—fusing retrieval and set-membership priors into the decoder logits at inference time.

Key Results

Across two production-grade coding agents, better composition predictions translate into better agent execution—at a smaller prompt budget than retrieval or flooding the context.

+23.1

pp pass-rate gain on GPT-5.2-Codex over the no-skill baseline

+18.2

pp pass-rate gain on Gemini-3-Pro-Preview over the no-skill baseline

3.9M

trainable parameters — ~154× fewer than the 600M SFT baseline

+19.3

pp Set F1 over SFT on held-out real tasks under distribution shift

Composition is structural, not just selection. A ranked list does not say how many skills to use or in what order. SkillComposer resolves subset, count, and order jointly in one decoding pass, capturing dependencies between successive skills.

A small specialist beats a big generalist under shift. With only 3.9M trainable parameters, SkillComposer matches a 600M SFT model in-distribution and beats it by +19.3 pp Set F1 on held-out real software-engineering tasks—SFT memorizes the synthetic template and collapses.

Better plans translate to better agents. Upstream composition gains carry through to execution: +23.1 / +18.2 pp pass rate on {Codex, Gemini}, closing roughly 80% of the headroom to the gold-skill upper bound.

Accuracy at a lower token budget. SkillComposer matches or exceeds oracle retrieval while using the smallest prompt budget among skill-loaded conditions (1.03M Codex tokens). Flooding the context with all 196 skills recovers little and inflates the prompt to 1.27M tokens.

Downstream pass rate on SkillsBench (75 tasks, 3 attempts each) across two production-grade coding agents. **Tok.** is the average input prompt tokens per non-errored trial. *Gold Skills* is the oracle upper bound; best non-oracle pass rate in **bold**.
Skill condition	GPT-5.2-Codex		Gemini-3-Pro
Skill condition	Pass (%) ↑	Tok. ↓	Pass (%) ↑	Tok. ↓
Gold Skills (oracle)	51.1	1.12M	48.4	1.18M
No Skills	22.2	0.94M	25.8	0.99M
All Skills	29.3	1.27M	38.7	1.33M
Retrieval (top-3)	44.0	1.09M	41.8	1.14M
SkillComposer (ours)	45.3	1.03M	44.0	1.08M

Analysis

SkillComposer is Pareto-optimal among predicted-k methods and is most robust exactly where over-emitting skills is most costly.

Compute-accuracy frontier — **Compute–accuracy frontier.** On the in-distribution test split, SkillComposer is Pareto-optimal: it matches or beats the 600M SFT model with ~154× fewer trainable parameters and ~25× less training compute, and clearly dominates retrieval and the LLM-judge.

**Cardinality robustness.** Set F1 stratified by gold cardinality k. SkillComposer's advantage concentrates at k = 1, where over-emitting skills is most punished, yielding the best macro-averaged Set F1.

BibTeX

@misc{zhao2026skillcomposer,
  title  = {Generative Skill Composition for LLM Agents},
  author = {Zhao, Xinyu and Tan, Zhen and Tadiparthi, Vaishnav and Agarwal, Nakul
            and Lee, Kwonjoon and Moradi Pari, Ehsan and Nourkhiz Mahjoub, Hossein
            and Chen, Tianlong},
  year   = {2026},
  note   = {Preprint},
  url    = {https://skill-composer.github.io/}
}