The Hard Part of Building an Autonomous AI Teammate

Building a useful autonomous AI teammate is less about the chat interface and more about routing, trust, memory, governance, and workflow reliability. We built a marketing operator to understand this better.

Jun 03, 2026

The chat box is just the doorway. The product is everything that happens after the request walks in.

Everyone wants to build autonomous teammates. The AI agent that picks up a vague request, figures out what you actually meant, does the work, and hands back something useful without you babysitting it. I spent a while designing and building one, and the lesson I keep coming back to is that the chat box is the least interesting part of it.

The thing we built happened to be a marketing operator, so that's the case I'll lean on throughout. But almost nothing in here is specific to marketing. Swap in support, sales, finance ops, or internal data and the hard parts barely move. They're properties of putting an autonomous agent in front of real work, not of any one domain.

The chat box is just where the request walks in. The actual product is everything that happens after: figuring out what kind of work the request is, gathering enough context (and tools, or figure out skills too) to act, checking whether the person is even allowed to ask, doing the data work, iterating on the output, remembering what worked last time, scheduling the thing to recur, and quietly logging what happened when the agent got confused. None of that demos well.

Most teams I've watched start at the wrong end. They begin with the agent's personality. Should it be friendly? Clever? Does it use emojis? Those questions matter eventually, but they have nothing to do with whether anyone trusts the agent. Trust shows up when someone fires off a half-formed question into an already chaotic Slack thread and the system still does the right thing.

What the job actually is

In my opinion, "Answer questions" is too vague to build against, whatever the domain. The real job is closer to this: take a messy (natural, organic) request from wherever work is already happening, work out which kind of work it represents, gather the minimum useful context, run the right specialist workflow, show progress along the way, respect who's allowed to see what, return something useful, and leave enough of a trace that the next run is a little better than the last. That is definitely not just chat assistant.

The incoming requests look deceptively simple. Write campaign copy. Explain why a metric moved. Pull a channel or social insight. Turn a recurring ask into a scheduled update. Remember a preference. Ask for access when the work touches something sensitive. Each one is its own small product. Stitch them together and you have an operating layer, not a chatbot. The list of verbs (specialist, deterministic flows) changes if you point this at a support team or a finance team, but the shape doesn't.

Slack is not the easy mode you think it is

It's tempting to assume Slack (or any enterprise communication tool) saves you work because you skip building a web app. In practice Slack removes one kind of surface area and hands you a worse one.

A normal app owns its screens. Slack owns nothing. Threads are messy, context arrives out of order, and people talk over each other. Someone @-mentions the agent, someone else replies, the original person drops a file, and then a fourth person says "ignore that, use the latest plan." Before the agent can do anything it has to figure out whether it's being asked to act, being handed context, or just being politely cc'd into the noise.

For us, the first version that actually felt good was not the one with the smartest model. It's the one with the best manners. The agent should be present without being needy. It should carry a thread forward when it owns the work, and stay out of the way when it doesn't. It should know the difference between a question worth asking and an assumption that's safe to make. And it should show progress on long-running work without turning the channel into an AI call center.

"Running the query" beats silence. "Waiting on access approval" beats a vague failure. "I couldn't find enough evidence to say that" beats a confident hallucination every single time. Most of the UX work here isn't about delight. It's about lowering the reader's uncertainty about what the agent is doing.

Routing is a feature, not plumbing

The second hard problem is routing, and it's where a lot of our attempts quietly fell apart.

A user should never have to know whether their request is a copywriting job, a data job, a social-listening job, a scheduler, or just a quick answer. The moment they have to name the internal tool, the agent has leaked its own architecture.

Take a request like: "Can you check why yesterday underperformed and write a short update for the launch channel?" That's not one task. It's a data investigation, a narrative summary, probably a copywriting pass, and maybe a recurring follow-up. The system has to decide what the primary job is, what context is missing, which workflows this particular user is allowed to run, what can happen now, and what needs a human to look at it first. A generic assistant can answer that request. It usually can't operate on it.

What worked for us was a supervisor sitting in front of specialist workflows. The supervisor isn't a genius brain (not anytime soon); it's a disciplined router. Its only job is deciding whether the request needs copy, data, social insight, scheduling, memory, or no response at all. The specialist behind it does the real work under its own constraints.

One voice on the surface. Several operating loops underneath, each with its own success criteria.

That separation matters, because copy and data have nothing in common as far as success criteria go. A copy workflow should optimize for usable variants, brand fit, and a fast feedback loop. A data workflow should optimize for safe queries, verified claims, honest caveats, and traceability. One voice on the surface, several different loops underneath.

Data answers need a seatbelt

Marketing teams ask data questions that sound trivial. Why did this metric move? Which cohort changed? Was the campaign actually responsible? What do we say in the review?

The risky part was never the SQL. Models write fine SQL (maybe, maybe not). The risk is the story they tell after the query comes back. A data agent can be syntactically perfect and still be confidently, productively wrong: querying the wrong grain, comparing mismatched windows, ignoring seasonality, narrating noise as if it were signal, or turning a sample of twelve into a board-ready conclusion.

So the workflow needs a seatbelt. The version I trust doesn't go straight from question to answer. It plans the investigation, retrieves the schema, writes bounded SQL, runs the query async, analyzes the result, and only then audits every claim against what the data actually showed before publishing. All this, while maintaining org context, and provided guardrails.

The audit step is the whole point. Every concrete claim in the final answer has to survive a check against what the data actually showed. If a claim doesn't hold up, the system rewrites it, caveats it, or drops it. This is slower than a one-shot answer, and that's fine, because the slowness is the difference between a demo and a tool someone is willing to make a decision on. Making the agent sound analytical is easy. Making it earn the right to is the hard part.

Async work quietly rewrites your architecture

The moment work takes time, the request-response fantasy falls apart. Queries run long. PDFs run long. External tools fail and recover. Scheduled jobs fire long after the original conversation has gone cold. Approval flows freeze the agent until someone else makes a call.

Once any of that is true, the agent needs durable state. It has to remember the thread, the user, the workflow, the step it's on, the query it's waiting on, the approval it requested, and where the final answer is supposed to land. And it has to resume cleanly without bleeding old state into a fresh turn from a different user.

Three lines on the happy path. A dozen on the real one, plus the log nobody ever sees.

This is where prototypes die. The happy path is three lines: receive message, call model, reply. The real path is: receive, route, start the workflow, checkpoint, dispatch the async query, send progress, wait for the callback, resume from the checkpoint, analyze, audit, maybe ask for approval, resume again, send the final answer, log the outcome. None of that is glamorous. If the user only ever sees one clean reply, it's because a lot of unglamorous machinery didn't leak. That's the goal.

Governance is the product, not the speed bump

There's a school of thought that treats governance as the thing slowing the product down. For internal agents I think that's backwards. Governance is one of the main reasons anyone trusts the product at all.

Marketing work brushes up against sensitive things constantly: customer segments, pricing, partner information, campaign claims, external messaging. "Let the model decide" is the wrong default for any of that. The right default is explicit boundaries. Some workflows are open, some require approval, some are admin-only, and some should never be delegated at all.

The approval flow has to be usable rather than ceremonial. An approver should get a compact ask: who wants it, what capability they need, why, and what kind of grant they're requesting. The person who made the original request should be able to see that the work is paused on approval, not left wondering whether the agent died.

And denial has to be a good experience too. "You don't have access" isn't enough. A better answer names the boundary, says whether approval has already been requested, and gives the person a next step. A capable agent should also know, cleanly, when the answer is no.

Memory is only useful when it's scoped

Every agent eventually discovers memory, and the pitch writes itself: remember past campaigns, preferences, approved copy, recurring feedback, and old decisions, and the thing gets better over time. That's genuinely true. It's also where a lot of trust gets quietly burned.

Memory that's too eager gets creepy. Memory that's too broad becomes a privacy problem. Memory that's stale becomes actively harmful. Memory you can't inspect becomes a trust problem on its own. The useful version is scoped. Personal memory should behave differently from channel memory. A private preference shouldn't silently graduate into a shared team fact. A campaign learning should stay tied to the surface it came from. Feedback should be stored as a signal, not as gospel.

Approved copy is the example I'd point to. If the system can pull up past approved variants, its first drafts get noticeably better. Feedback is another: if people keep rejecting a particular tone, the agent should stop reaching for it. But memory has to earn its slot in the workflow. Remember the things that improve the next run, keep the boundary legible to the user, and resist the urge to remember everything just because you can. Otherwise it's just another hidden source of surprising behavior.

Scheduling turns usage into operations

One-off usage is nice. Recurring usage is when the agent starts becoming infrastructure.

Marketing is full of repetition: weekly metric checks, campaign summaries, daily monitoring, performance readouts, social listening, launch reminders, the same stakeholder update every Friday. The moment an agent can turn a repeated ask into a schedule, it stops being purely reactive.

But scheduling isn't cron with a friendlier sentence wrapped around it. The agent has to understand the task, the cadence, where the output goes, and how it'll be validated. Risky or ambiguous tasks should dry-run before they're allowed to recur. Anything posting into a shared channel raises the correctness bar. Anything depending on data access needs that access settled up front, or the schedule just fails silently three weeks later.

The UX matters as much as the mechanics. Nobody should see a raw cron expression or an internal job ID. They should see what will run, when, where it'll post, what assumption it made, and how to change or kill it. Scheduling looks like a small feature, but it's really a promise: I'll come back later and do this without making you re-explain it. A promise like that only works if it's reliable. Every mistake tanks trust.

The hardest call is what not to automate

The more capable the system gets, the more tempting it is to let it just act. That's exactly where restraint earns its keep. The progression shouldn't jump from summarizing to full automation. Progress must be incremental, each step must prove its worth before earning the right to advance to the next.

Read, rank, draft, recommend, then act inside tight boundaries. Only the most autonomous step approaches the ceiling, and that ceiling is where the human stays.

In the marketing case the human stays on the hook for taste, judgment, claims, positioning, and accountability. Point this at another domain and the nouns change, but the principle holds: there's a judgment layer the agent shouldn't own. Its job is to lower the activation energy around that layer, gather the context, write the draft, run the bounded analysis, surface the caveats, prepare the artifact, and then get out of the way.

Which is also why "human in the loop" has to be designed in. Which actions need approval? Which claims need evidence behind them? Which workflows get a blast-radius limit? Which domains are zero-tolerance? Which failures pages a human instead of retrying into the void? If you don't have answers to those, the agent isn't ready for more autonomy. It's just optimistic software.

What I'd do again

If I built this from scratch again, I'd keep the first version narrower than my instincts wanted. One surface. A handful of genuinely high-value workflows. Strong boundaries, good progress states, real logs, clear approvals, and a feedback loop on every output. And I'd fight the urge to make it look magical.

The magic was never the agent producing a tidy paragraph. The magic is the whole system taking a messy request, routing it right, gathering context, running the correct workflow, showing its work where it counts, refusing when it should, and getting a bit better from the result. That's much harder to put in a demo, and much more useful to the people actually doing the work.

The internal tools that win aren't the theatrical ones. They're the ones that quietly take the drag out of work people already care about. They make the blank page cheaper and context easier to rebuild. They make recurring work less manual and make the data questions safer to ask. They turn informal approvals into explicit ones.

That's what an autonomous teammate should do, in marketing or anywhere else. Not replace the person, but remove the parts of the work that never should have demanded so much human attention in the first place. Small surface area, strong guardrails, honest memory, boring reliability. That's the version I'd bet on.