

The SKILL.md file has become one of the most consequential new artifacts in software development. Anthropic published the Agent Skills specification as an open standard in December 2025, and within 90 days, 32 tools had adopted it, including Codex, Copilot, Gemini CLI, JetBrains, and Spring AI. Separately, the AGENTS.md format, which provides persistent project context rather than on-demand skills, has been adopted by over 60,000 open source repositories and was donated to the Linux Foundation's Agentic AI Foundation. Together, these Markdown-based instruction formats reflect a practical consensus: AI agents need persistent, composable instruction sets to produce consistent work.
The scale of adoption makes this more than a niche concern. The JetBrains AI Pulse survey of over 10,000 developers found that 74% had adopted specialized AI coding tools by January 2026. The SonarSource State of Code survey of over 1,100 developers reported that 64% had started using AI agents specifically. These agents are reading skills and context files in production, not in experiments.
But the conversation about skills has been almost entirely about creation. How to write a good SKILL.md. How to structure the YAML frontmatter. How to design descriptions that trigger at the right moment. The harder problem, and the one that deserves more attention, is what happens after you ship the skill. The maintenance. The drift. The gradual degradation of agent behavior when the instructions underneath it go stale.
This is not a theoretical concern. At Forte Group, we maintain custom skills for document generation, podcast preparation, CV creation, meeting follow-up drafting, and repository analysis. Every one of them has required revision, sometimes because the underlying tooling changed, sometimes because we learned through usage that the instructions produced subtly wrong outputs, and sometimes because a downstream process shifted and the skill simply stopped matching reality. If you are building agentic workflows for production use, skills maintenance is already one of your operating costs. You should treat it that way.
The reason skill maintenance is harder than it looks comes down to a property unique to this new artifact type: skills are not code, but they behave like code. A SKILL.md file executes inside a language model's context window, not a runtime, which means its failure modes are probabilistic rather than deterministic. Anthropic's own context engineering documentation describes "context rot," the phenomenon where a model's ability to accurately recall information decreases as token count increases, even before the hard context limit is reached. When a skill degrades, the agent does not throw an error. It produces output that is slightly worse, follows a path that is almost right, or applies an instruction that was accurate six weeks ago but no longer reflects the current state of the codebase, the API, or the team's conventions.
This is the core challenge. Traditional software maintenance has the advantage of compilation errors, test failures, and stack traces. Skill maintenance has none of these. A stale skill is the context engineering equivalent of silent data corruption in a database: the system keeps running, the outputs look plausible, and nobody raises an alarm. Lightrun's 2026 State of AI-Powered Engineering Report, surveying 200 senior SRE and DevOps leaders at large enterprises, found that 43% of AI-generated code changes required manual debugging in production even after passing QA. The Faros AI Engineering Report found that pull requests merged without any review, human or agentic, increased 31.3% year over year. These are not skill-specific findings, but they describe the environment in which stale skills operate: a system where plausible-looking output passes through review unchallenged. A stale skill can run for weeks before anyone notices the output quality has slipped, because the agent will do its best to follow whatever instructions it has, and its best is often good enough to pass a casual review.
The problem compounds at scale. A single project might have skills for frontend component generation, API testing, deployment workflows, database migrations, documentation standards, and code review checklists. Each of these skills references assumptions about the project's current state. When any of those assumptions change and the skill does not, the agent is working from a model of the project that no longer exists. The Digital Applied developer survey found that engineers now spend 11.4 hours per week reviewing AI-generated code versus 9.8 hours writing new code, a reversal of the pattern from two years ago. Review time is already the bottleneck. Adding stale skills to the mix means that review effort is being spent on output shaped by outdated instructions.
The practitioners who are doing this well have converged on several patterns, even if the tooling has not yet caught up.
Treat skills like code, not documentation. This is the most important mental model shift. Skills belong in version control. They should go through code review. Changes to project architecture, testing frameworks, deployment pipelines, or API contracts should trigger a review of every skill that references those systems. The Vercel team's experience with a Next.js docs skill illustrates a related point: in their eval suite (documented in their engineering blog, "AGENTS.md Outperforms Skills in Our Agent Evals"), the skill was never invoked by the agent in 56% of cases and actually degraded test performance compared to baseline on some metrics. The skill existed, the agent could use it, and the agent chose not to. This is the kind of failure you only catch through systematic evaluation, not through casual observation. And if you are not tracking skills in version control with the same rigor as your application code, you will not know when to run that evaluation.
Keep always-on context minimal and push depth into skills. The ETH Zurich AGENTbench study (Gloaguen et al., February 2026) found that AGENTS.md files often decreased agent pass rates compared to agents that simply explored the repository themselves. The researchers evaluated four coding agents across both SWE-bench Lite and a new benchmark of repositories with developer-committed context files, and attributed the performance decrease to attention dilution: filling the context window with a large block of project description diluted the model's focus on the actual task. Separate research from Dometrain and HumanLayer converged on a practical guideline: context files under 60 to 100 lines are more reliably followed than longer ones. The implication for skills is straightforward. Keep your CLAUDE.md or AGENTS.md file short, covering only the information that applies to every task, and push specialized knowledge into skills that load on demand through progressive disclosure. This is a quality concern, not just an efficiency one. A focused, task-relevant skill loaded at the right moment will outperform a monolithic instruction file that tries to cover everything.
There is a genuine tension in the current evidence that is worth acknowledging. Vercel found that passive context in AGENTS.md outperformed on-demand skills for framework documentation, because agents frequently failed to invoke the skill when they should have. ETH Zurich found that passive context in AGENTS.md itself could degrade performance through attention dilution. These findings are not contradictory; they point to the same underlying problem from different angles: the agent needs the right information at the right time, and neither "always loaded" nor "loaded on demand" guarantees that outcome today. Skill maintenance is not just about keeping files current but about the ongoing strategic question of where context should live and how reliably the agent will use it.
Write descriptions as if they are routing logic, because they are. The description field in a skill's YAML frontmatter is not a human-readable summary but the primary mechanism by which the language model decides whether to load the skill. A vague description means the skill fires when it should not or, worse, never fires at all. The description should specify what the skill does, when to use it, and when not to use it. Negative examples are as important as positive ones. If your skill handles Word document generation but not PDF creation, say so explicitly.
Run evals against your skills, not just your agents. The Vercel results should be a warning to anyone who writes a skill and assumes it works. If you are not testing invocation precision, description accuracy, and output correctness against current project state, you are guessing. Skills that are not evaluated are skills you do not understand.
Compress aggressively when embedding reference material. The Vercel team reduced a 40KB documentation injection to 8KB, an 80% reduction, using a pipe-delimited index structure, and maintained a 100% pass rate on their eval suite. The agent needed an index that told it where to find specific information and the ability to read those files on demand, not the full documentation in context. This is the same principle behind progressive disclosure, but applied to the reference material within skills rather than to the skills themselves.
Not everything belongs in a SKILL.md file. This is a mistake I see teams make repeatedly, and it is worth stating plainly: skills are for workflows. Documents are for knowledge.
A skill should encode a procedure: a series of steps the agent follows to produce an output. "When the user asks to create a React component, follow these steps to generate it according to our design system." That is a skill: it has a trigger condition, a sequence of actions, and a defined output.
A document, by contrast, encodes facts that the agent might need to reference during any number of different workflows. A database schema. An API specification. A style guide. A list of approved vendor integrations. A glossary of domain terms. These are reference material, not workflows. Putting them into SKILL.md files forces the agent to load an entire skill just to look up a fact, wasting context window capacity and creating unnecessary coupling between the reference material and a specific workflow.
The practical rule is this:
If the content describes how to do something, it is a skill. If the content describes what something is, it is a document.
If you find yourself writing a SKILL.md that is mostly reference material with a thin wrapper of instructions around it, split it. Put the reference material in a document file that multiple skills can access, and keep the skill focused on the procedure.
This matters because of how progressive disclosure works. Skills load their full SKILL.md body only when triggered. Documents referenced by a skill load only when a specific step calls for them. This two-stage loading keeps the context window lean. But if you pack documents into skills, you lose the second stage. The agent loads everything at once, and you are back to the bloated context problem.
There is also a maintenance argument. Reference material changes on a different cadence than workflows. Your API specification might change weekly. Your component generation workflow might change quarterly. If they live in the same file, every API change triggers a review of the workflow instructions, and every workflow change puts you in proximity to reference material you did not intend to modify. Separation of concerns is not just a software architecture principle. It applies to agent instruction design with equal force.
All of these best practices require discipline because the industry has built the runtime for skills but not the static analysis layer. Thirty-two tools can read and execute a SKILL.md. None of them can audit one. There is no way to run a type check against a SKILL.md to verify that every file path it references still exists. There is no linter that warns you when a skill's instructions reference a deprecated API. There is no automated way to detect that a skill's description has drifted out of alignment with what the skill actually does.
This will likely improve. The Agent Skills specification is developing security models and signing mechanisms. The major platform vendors are investing in skills as a first-class primitive, and the Agentic AI Foundation provides a governance structure for the surrounding standards. But the tooling gap is real today, and teams building production agentic workflows need to bridge it with process until better tooling arrives.
The teams that will do this well are the ones that recognize skills as a hybrid of code and documentation that inherits maintenance obligations from both. The organizations that treat skill maintenance as an afterthought will wonder why their agents produce inconsistent output. The ones that build maintenance into their development workflow from the beginning will get progressively more value from agentic development as their skill libraries mature.
A SKILL.md file is a small Markdown file with some YAML at the top. But the gap between a well-maintained skill and a stale one is the gap between an agent that makes your team faster and an agent that makes your team less confident in what it will produce. Managing that gap is not optional if you are serious about agentic workflows in production.
Gloaguen et al. "Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?" ETH Zurich, February 2026. arxiv.org/abs/2602.11988
Vercel Engineering. "AGENTS.md Outperforms Skills in Our Agent Evals." vercel.com/blog/agents-md-outperforms-skills-in-our-agent-evals
Lightrun. "2026 State of AI-Powered Engineering Report." Survey of 200 senior SRE and DevOps leaders, January-February 2026. Reported by VentureBeat, April 2026.
Faros AI. "The AI Engineering Report 2026: The AI Acceleration Whiplash." Telemetry from 22,000 developers across 4,000 teams. faros.ai/blog/ai-acceleration-whiplash-takeaways
JetBrains Research. "Which AI Coding Tools Do Developers Actually Use at Work?" AI Pulse survey, 10,000+ developers, January 2026. blog.jetbrains.com/research/2026/04/which-ai-coding-tools-do-developers-actually-use-at-work
SonarSource. "State of Code Developer Survey Report 2026." Survey of 1,100+ developers. sonarsource.com/resources/developer-survey-report
Digital Applied. "AI Coding Tool Adoption 2026: Developer Survey Results." Survey of 2,847 respondents across 320 agencies and teams. digitalapplied.com/blog/ai-coding-tool-adoption-2026-developer-survey
Anthropic. "Context Engineering: Memory, Compaction, and Tool Clearing." Claude Cookbook, March 2026. platform.claude.com/cookbook/tool-use-context-engineering-context-engineering-tools
Anthropic. "Agent Skills Specification." agentskills.io