
TL;DR
Generative systems can assist anywhere, but agentic autonomy requires verifiability.
This is the design principle behind the most successful agentic products today (e.g., Copilot, Agentforce, LangGraph workflows):
Background
- “We want to use AI!”
- “I have a toilsome task I want to automate with AI!”
- “My boss says USE AI!”
- “I need to use AI or get left behind!”
We hear these things all of the time. THE key consideration in response to these animated complaints is problem selection.
Which tasks are suitable for “traditional software” (e.g., an apex function, a Python script) and which are suitable for a GPT or an agent? My job involves these kinds of assessments and I’ve spent some time looking into it.
Interesting research (see references) tells us that:
Verifiability == Good Match for Generative AI
Highly Verifiable Tasks
A task is highly verifiable when:
- The output can be checked mechanically, automatically, or by inspection
- There is a right/wrong answer (or small set of acceptable answers) BUT NOT A SINGLE CORRECT ANSWER
- Ground truth exists outside the model
- Precision is required and testable
Examples:
Some examples of highly verifiable tasks include:
- Classify an email into the correct queue
- Propose next-best actions from structured opportunity data
- Generate clean JSON for an API
- Write a draft SOQL/SQL query and validate by executing it
- Summarize and/or score a call transcript against a rubric
Why it’s ideal for agentic tech:
- The agent can execute → check itself → retry
- Generative reasoning is used to propose answers, not guarantee correctness
- Verification ensures safety and correctness
- Humans no longer have to “trust the magic” — they only trust the checker
Moderate Verifiability
A task is moderately verifiable when:
- There is no single right answer, but human judgment can assess quality quickly
- There are heuristics, rubrics, or constraints that aid evaluation
- Correctness is subjective but improvable through iteration
Examples:
- Drafting an AI agent charter
- Creating a discovery plan or issue tree
- Drafting SDR sequences
- Summaries that need nuance
- First-draft slides or narrative documents
Why it works:
- The model accelerates creation
- Humans evaluate and nudge
- Output can be improved with structured rubrics (your ARC-IP idea lives here)
This is “copilot territory” rather than fully autonomous agent territory.
Low Verifiability
A task exhibits low verifiability when:
- There is no objective ground truth
- The output cannot be checked for correctness
- Success is inherently subjective or emergent
- Stakes are high and errors are costly
Examples:
- Strategic decisions with incomplete data
- Opinions, values, philosophical positions
- “Tell me what my activation org should become long-term”
- “Design a data model no one has built before with no constraints”
- Open-ended creative work with no rubric
These tasks benefit from LLMs for idea generation, but not autonomy.
Why agents struggle:
- The model cannot know when it’s wrong
- No reliable automated evaluation
- Risk compounds as reasoning chains get longer
This is where you stay in a human-driven loop, using the LLM as a partner.
And—critically—it shapes customer conversations about that level of assistance GPTs can provide on tasks. You can say:
“We agentify the tasks you can verify.
We accelerate the tasks you can evaluate.
And we amplify the tasks that require judgment.”
That’s an incredibly clean framework for enterprise customers.
References
- Myllyaho, L. S., Raatikainen, M., Männistö, T., Mikkonen, T., & Nurminen, J. K. (2021). Systematic literature review of validation methods for AI systems. The Journal of Systems and Software, 181, Article 111050. https://doi.org/10.1016/j.jss.2021.111050
- METR. (2025, March 19). Measuring AI ability to complete long tasks. METR. https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/
- Huang, K., & Huang, J. (2025). Audited skill-graph self-improvement for agentic LLMs via verifiable rewards, experience synthesis, and continual memory (arXiv:2512.23760). arXiv. https://arxiv.org/abs/2512.23760



Leave a Reply