
Minions - Part II: When AI Workflows Start Talking to Each Other
Minions - Part II: When AI Workflows Start Talking to Each Other
Mar 5, 2026
Jayanth Krishnaprakash







The Ceiling
After a few weeks of running skills, we hit a wall. Each skill was good at its job. The commit skill formatted messages consistently. The implementation skill enforced TDD. The review skill caught real bugs. Individually, they worked. But they didn't know about each other.
An engineer would plan a feature, implement it, open a PR, and request a review. Four workflows, four conversations, four contexts. The review skill didn't know what the original plan said. The PR description didn't reference the implementation decisions. We had automated islands. We needed them connected.
Composition: Skills That Call Skills
The fix was straightforward: let skills call other skills. The PR review skill doesn't just read the diff. It pulls in automated code review findings from a separate tool. It checks if there's a plan document. It cross-references its own analysis with the external one. Then it presents everything together.
The planning skill doesn't just write a document. It creates tracked issues, sets up the branch, and hands off context to the implementation skill. A senior engineer reviews code the same way. They check it against the original requirements. They remember that similar code broke last month. They carry context between stages. Skills that compose do the same thing.
No stage starts from zero.
Cross-Referencing
One AI analysis is an opinion. Two independent analyses compared against each other is signal. We run automated code review tools alongside our own skill-based analysis. They look at the same PR independently. Then we compare. When both flag the same issue, confidence is high. When they disagree, that's where the interesting findings are.
Sometimes our analysis catches something the tool missed because of domain knowledge. The tool doesn't know that a specific API always returns that field because of an upstream contract. The skill knows, because it learned that three PRs ago. Sometimes the tool catches something our analysis missed. Different heuristics, different perspective. We bucket every finding into three categories:
Both agree | High confidence. Almost always a real issue. |
|---|---|
Only one source flagged it | Needs human judgment. |
Sources contradict | Usually reveals a nuance worth documenting. |
Don't trust any single AI analysis as ground truth. Layer them. Compare them.
The False Positive Problem
Automated code review tools are noisy. After enough false positives, you start ignoring everything, including the real issues. We had the same problem. The tool would flag something. Our cross-reference would disagree. The reviewer would dismiss it. Next PR, same flag. Same dismissal. So we built a feedback loop.
Every time a finding gets dismissed, the system records the pattern. Not the specific instance. The pattern. "This tool flags X in situation Y, and it's been wrong about it three times across three PRs." Next time that pattern appears, the system annotates it: "This was dismissed before. Here's why." After three dismissals of the same pattern, the system asks: "Want to suppress this?"
The system gets quieter over time, not louder. No one maintains a suppression list manually. If you told a junior engineer "that's not a bug" three times, they'd stop flagging it. Your automated tools should do the same.
Human Approval as Architecture
This lesson cost us the most. Early on, the review skill would analyze a PR and post findings directly to GitHub. Fast. And completely wrong as an approach. The skill posted a comment that was technically correct but contextually wrong. The code was intentionally written that way. The team had agreed on it in a meeting two days prior. The skill didn't know. It posted publicly on a PR the author had already discussed with the team lead.
Trust eroded. "The AI bot doesn't understand our codebase" became a thing people said. The fix wasn't better analysis. It was a gate. Nothing gets posted without human approval. The skill runs its analysis, presents findings to the reviewer privately, and waits. The reviewer edits, removes, or approves. Only then does anything become visible.
The review still takes minutes. But now the reviewer is curating, not damage-controlling. The cost of asking "should I post this?" is near zero. The cost of posting something wrong is trust.
Skills Beyond Code
The pattern works outside of code too. We built a skill for cloud operations. Same structure: a workflow with phases and gates, a learnings file for edge cases. But instead of "write test, then implement," it's "assess risk, check guardrails, then execute." Every operation gets classified:
Level | Example | Gate |
|---|---|---|
Low | List instances, check logs | Auto-approved |
Medium | Modify security group, update config | Confirmation required |
High | Terminate instances, modify IAM | Explicit approval + audit log |
Forbidden | Delete production databases | Blocked |
The learnings file accumulated things like: "This instance type takes 4 minutes to stop, not 30 seconds." "This region doesn't support that service." "Three other services reference this security group." Same self-improvement loop. Same team sync. Different domain. If your team has a repeatable process with consequences when done wrong, it's a skill waiting to be written. Incident response. Database migrations. Release management.
The Full Lifecycle
A single feature goes through five stages: plan, implement, open PR, review, ship.
Each stage reads the output of the previous one. The implementation skill reads the plan. The PR skill reads the plan and the implementation decisions. The review skill checks code against the original plan. The release skill audits everything before merging. No stage starts from zero. No engineer re-explains context.
By the time a feature ships, the system has the full history: what was planned, what was built, what was flagged, what was resolved. One chain from idea to production. Most teams lose context at every handoff. The reviewer hasn't seen the plan. The person shipping hasn't seen the review. Composed skills fix that.
What Broke
The comment that shouldn't have been posted. The review skill flagged error handling on a PR. Technically correct. But the team had agreed to handle it differently two days prior. The skill posted anyway. That's when we added the gate. A reviewer would have caught it in five seconds.
The tool that cried wolf. An automated tool flagged the same pattern on twelve consecutive PRs. Wrong every time. After the third dismissal, the feedback loop offered to suppress it. Noise dropped roughly 40% overnight.
The integration that wasn't. We spent hours debugging a protocol mismatch with an external service. Dozens of other teams had hit the same issue. Known bug. We rebuilt the skill with a fallback. Primary method: simple API calls. Fancy protocol: optional when it works.
The schema merge that broke everything (again). An engineer renamed a database field. Three services broke. We'd documented this problem before, but in a different skill. Now schema learnings propagate to every skill that touches the database.
What's Next
Skill evaluation. Testing the skills themselves. Does the review skill catch bugs? We're building benchmarks that measure quality and track regressions when models update.
Cross-repository knowledge. Skills share knowledge within a team. Next step: sharing across projects. "Every team using queue processors hits this edge case" shouldn't need rediscovering.
Institutional memory beyond skills. Engineering knowledge lives in debugging sessions, architecture discussions, and design decisions too. We want any engineer to ask "has anyone debugged this before?" and get an answer.
Key Takeaways
Compose, don't isolate. Context should flow between stages, not restart at each one.
Cross-reference everything. Two independent analyses compared are better than one trusted blindly.
Build feedback loops. Record dismissals. Let the system learn from corrections. It should get quieter over time.
Gate everything visible. AI that acts without human approval is a liability.
The pattern is universal. Code review, infrastructure, releases, incident response. Anything repeatable with consequences.
Every failure is a learning. Each one made the system better for the next engineer.
The skills are better today than last month. Not because we're rewriting them. Because they're rewriting themselves.
The Ceiling
After a few weeks of running skills, we hit a wall. Each skill was good at its job. The commit skill formatted messages consistently. The implementation skill enforced TDD. The review skill caught real bugs. Individually, they worked. But they didn't know about each other.
An engineer would plan a feature, implement it, open a PR, and request a review. Four workflows, four conversations, four contexts. The review skill didn't know what the original plan said. The PR description didn't reference the implementation decisions. We had automated islands. We needed them connected.
Composition: Skills That Call Skills
The fix was straightforward: let skills call other skills. The PR review skill doesn't just read the diff. It pulls in automated code review findings from a separate tool. It checks if there's a plan document. It cross-references its own analysis with the external one. Then it presents everything together.
The planning skill doesn't just write a document. It creates tracked issues, sets up the branch, and hands off context to the implementation skill. A senior engineer reviews code the same way. They check it against the original requirements. They remember that similar code broke last month. They carry context between stages. Skills that compose do the same thing.
No stage starts from zero.
Cross-Referencing
One AI analysis is an opinion. Two independent analyses compared against each other is signal. We run automated code review tools alongside our own skill-based analysis. They look at the same PR independently. Then we compare. When both flag the same issue, confidence is high. When they disagree, that's where the interesting findings are.
Sometimes our analysis catches something the tool missed because of domain knowledge. The tool doesn't know that a specific API always returns that field because of an upstream contract. The skill knows, because it learned that three PRs ago. Sometimes the tool catches something our analysis missed. Different heuristics, different perspective. We bucket every finding into three categories:
Both agree | High confidence. Almost always a real issue. |
|---|---|
Only one source flagged it | Needs human judgment. |
Sources contradict | Usually reveals a nuance worth documenting. |
Don't trust any single AI analysis as ground truth. Layer them. Compare them.
The False Positive Problem
Automated code review tools are noisy. After enough false positives, you start ignoring everything, including the real issues. We had the same problem. The tool would flag something. Our cross-reference would disagree. The reviewer would dismiss it. Next PR, same flag. Same dismissal. So we built a feedback loop.
Every time a finding gets dismissed, the system records the pattern. Not the specific instance. The pattern. "This tool flags X in situation Y, and it's been wrong about it three times across three PRs." Next time that pattern appears, the system annotates it: "This was dismissed before. Here's why." After three dismissals of the same pattern, the system asks: "Want to suppress this?"
The system gets quieter over time, not louder. No one maintains a suppression list manually. If you told a junior engineer "that's not a bug" three times, they'd stop flagging it. Your automated tools should do the same.
Human Approval as Architecture
This lesson cost us the most. Early on, the review skill would analyze a PR and post findings directly to GitHub. Fast. And completely wrong as an approach. The skill posted a comment that was technically correct but contextually wrong. The code was intentionally written that way. The team had agreed on it in a meeting two days prior. The skill didn't know. It posted publicly on a PR the author had already discussed with the team lead.
Trust eroded. "The AI bot doesn't understand our codebase" became a thing people said. The fix wasn't better analysis. It was a gate. Nothing gets posted without human approval. The skill runs its analysis, presents findings to the reviewer privately, and waits. The reviewer edits, removes, or approves. Only then does anything become visible.
The review still takes minutes. But now the reviewer is curating, not damage-controlling. The cost of asking "should I post this?" is near zero. The cost of posting something wrong is trust.
Skills Beyond Code
The pattern works outside of code too. We built a skill for cloud operations. Same structure: a workflow with phases and gates, a learnings file for edge cases. But instead of "write test, then implement," it's "assess risk, check guardrails, then execute." Every operation gets classified:
Level | Example | Gate |
|---|---|---|
Low | List instances, check logs | Auto-approved |
Medium | Modify security group, update config | Confirmation required |
High | Terminate instances, modify IAM | Explicit approval + audit log |
Forbidden | Delete production databases | Blocked |
The learnings file accumulated things like: "This instance type takes 4 minutes to stop, not 30 seconds." "This region doesn't support that service." "Three other services reference this security group." Same self-improvement loop. Same team sync. Different domain. If your team has a repeatable process with consequences when done wrong, it's a skill waiting to be written. Incident response. Database migrations. Release management.
The Full Lifecycle
A single feature goes through five stages: plan, implement, open PR, review, ship.
Each stage reads the output of the previous one. The implementation skill reads the plan. The PR skill reads the plan and the implementation decisions. The review skill checks code against the original plan. The release skill audits everything before merging. No stage starts from zero. No engineer re-explains context.
By the time a feature ships, the system has the full history: what was planned, what was built, what was flagged, what was resolved. One chain from idea to production. Most teams lose context at every handoff. The reviewer hasn't seen the plan. The person shipping hasn't seen the review. Composed skills fix that.
What Broke
The comment that shouldn't have been posted. The review skill flagged error handling on a PR. Technically correct. But the team had agreed to handle it differently two days prior. The skill posted anyway. That's when we added the gate. A reviewer would have caught it in five seconds.
The tool that cried wolf. An automated tool flagged the same pattern on twelve consecutive PRs. Wrong every time. After the third dismissal, the feedback loop offered to suppress it. Noise dropped roughly 40% overnight.
The integration that wasn't. We spent hours debugging a protocol mismatch with an external service. Dozens of other teams had hit the same issue. Known bug. We rebuilt the skill with a fallback. Primary method: simple API calls. Fancy protocol: optional when it works.
The schema merge that broke everything (again). An engineer renamed a database field. Three services broke. We'd documented this problem before, but in a different skill. Now schema learnings propagate to every skill that touches the database.
What's Next
Skill evaluation. Testing the skills themselves. Does the review skill catch bugs? We're building benchmarks that measure quality and track regressions when models update.
Cross-repository knowledge. Skills share knowledge within a team. Next step: sharing across projects. "Every team using queue processors hits this edge case" shouldn't need rediscovering.
Institutional memory beyond skills. Engineering knowledge lives in debugging sessions, architecture discussions, and design decisions too. We want any engineer to ask "has anyone debugged this before?" and get an answer.
Key Takeaways
Compose, don't isolate. Context should flow between stages, not restart at each one.
Cross-reference everything. Two independent analyses compared are better than one trusted blindly.
Build feedback loops. Record dismissals. Let the system learn from corrections. It should get quieter over time.
Gate everything visible. AI that acts without human approval is a liability.
The pattern is universal. Code review, infrastructure, releases, incident response. Anything repeatable with consequences.
Every failure is a learning. Each one made the system better for the next engineer.
The skills are better today than last month. Not because we're rewriting them. Because they're rewriting themselves.
