Verifiability as the Central Filter for Agentic/GPT Work

TL;DR

Generative systems can assist anywhere, but agentic autonomy requires verifiability.

This is the design principle behind the most successful agentic products today (e.g., Copilot, Agentforce, LangGraph workflows):

Background

“We want to use AI!”
“I have a toilsome task I want to automate with AI!”
“My boss says USE AI!”
“I need to use AI or get left behind!”

We hear these things all of the time. THE key consideration in response to these animated complaints is problem selection.

Which tasks are suitable for “traditional software” (e.g., an apex function, a Python script) and which are suitable for a GPT or an agent? My job involves these kinds of assessments and I’ve spent some time looking into it.

Interesting research (see references) tells us that:

Verifiability == Good Match for Generative AI

Highly Verifiable Tasks

A task is highly verifiable when:

The output can be checked mechanically, automatically, or by inspection
There is a right/wrong answer (or small set of acceptable answers) BUT NOT A SINGLE CORRECT ANSWER
Ground truth exists outside the model
Precision is required and testable

Examples:

Some examples of highly verifiable tasks include:

Classify an email into the correct queue
Propose next-best actions from structured opportunity data
Generate clean JSON for an API
Write a draft SOQL/SQL query and validate by executing it
Summarize and/or score a call transcript against a rubric

Why it’s ideal for agentic tech:

The agent can execute → check itself → retry
Generative reasoning is used to propose answers, not guarantee correctness
Verification ensures safety and correctness
Humans no longer have to “trust the magic” — they only trust the checker

Moderate Verifiability

A task is moderately verifiable when:

There is no single right answer, but human judgment can assess quality quickly
There are heuristics, rubrics, or constraints that aid evaluation
Correctness is subjective but improvable through iteration

Examples:

Drafting an AI agent charter
Creating a discovery plan or issue tree
Drafting SDR sequences
Summaries that need nuance
First-draft slides or narrative documents

Why it works:

The model accelerates creation
Humans evaluate and nudge
Output can be improved with structured rubrics (your ARC-IP idea lives here)

This is “copilot territory” rather than fully autonomous agent territory.

Low Verifiability

A task exhibits low verifiability when:

There is no objective ground truth
The output cannot be checked for correctness
Success is inherently subjective or emergent
Stakes are high and errors are costly

Examples:

Strategic decisions with incomplete data
Opinions, values, philosophical positions
“Tell me what my activation org should become long-term”
“Design a data model no one has built before with no constraints”
Open-ended creative work with no rubric

These tasks benefit from LLMs for idea generation, but not autonomy.

Why agents struggle:

The model cannot know when it’s wrong
No reliable automated evaluation
Risk compounds as reasoning chains get longer

This is where you stay in a human-driven loop, using the LLM as a partner.

And—critically—it shapes customer conversations about that level of assistance GPTs can provide on tasks. You can say:

“We agentify the tasks you can verify.

We accelerate the tasks you can evaluate.

And we amplify the tasks that require judgment.”

That’s an incredibly clean framework for enterprise customers.

References

Myllyaho, L. S., Raatikainen, M., Männistö, T., Mikkonen, T., & Nurminen, J. K. (2021). Systematic literature review of validation methods for AI systems. The Journal of Systems and Software, 181, Article 111050. https://doi.org/10.1016/j.jss.2021.111050
METR. (2025, March 19). Measuring AI ability to complete long tasks. METR. https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/ 
Huang, K., & Huang, J. (2025). Audited skill-graph self-improvement for agentic LLMs via verifiable rewards, experience synthesis, and continual memory (arXiv:2512.23760). arXiv. https://arxiv.org/abs/2512.23760

Thinking Out Loud