Why do AI products suck?

I can’t be the only one who thinks that today’s AI products are pretty bad. Every Twitter demo that introduces a “novel” AI product always ends up being an user experience full of hallucinations and unhelpful outputs. Every incumbent that tries to take a step into AI tends to be overly conservative, applying so many guardrails and constraints until models are making decisions with the cognitive ability of a 12-year-old. There’s currently not much middle ground between making a chaotic AI product and an unintelligent one.

“AI products will get better when models improve.”

Models improve every day on universally-benchmarked actions and hopefully we will continue trending towards generally intelligent models. But, models with generic intelligence don't mean much if there isn’t proper tuning for specific users and applications. To put it this way, what is “correct” to a user with different preferences and expectations from every other user?

Just as OpenAI’s o1 unlocked increased model capability at test-time with better chain-of-thought, I believe there’s a lot of low-hanging fruit for improving capability of AI products, simply through better model orchestration and product design. I’ve written on the topic of AI reliability and human-agent interaction before, but this is my simplified take on how AI products can evolve to be both helpful and safe.

1. Personalization (context & memory)

AI can only be so intelligent when it doesn’t know anything about the user.

First, AI products must have context on the user. Who is the user and what are their goals? What decisions and actions do they need to make to achieve those goals? How can we automate more of the user’s workloads to help them achieve their goals faster? To answer these questions, we need to gather context about the user both within the product experience and outside of it.

When it comes to gathering context, Ramp does a great job. For every Ramp AI feature, multiple pieces of user context are taken into account. When we generate a transaction memo or route transactions to specific spend limits/budgets, we track in-product context (merchant, transaction amount, location, etc.) as well as external context (relevant calendar events from Google Calendar) to create holistic context during inference. With these pieces of context, our LLMs have an understanding of the transaction that the user made and what the user was doing at the time of the transaction itself.

Second, AI products must be able to actually be personalized with this context. Retrieving better context at inference time can generally improve model outputs, but is there a way to capture granular user preferences, patterns, and details over time to improve a product’s overall context on the user? We can achieve this by building out memory systems that can distill historical user context into actionable information.

This is something I worked on building at NOX: what does a memory system look like for an AI personal assistant? We built a structured memory system that passively captured details about the user, their relationships, and their preferences throughout conversations between the user and the assistant. This information was consistently updated in a compact schema that could be referenced in every generation, enabling the LLM to always have an up-to-date picture on the user. This memory system enabled the NOX assistant to make user-preferable decisions on-the-fly, avoid non-preferred behavior, and “grow” a data mind map of the user.

2. Control (steering & takeover)

Model error will exist as long as human error exists. Intelligent systems can’t avoid mistakes if the human instruction is faulty. When these systems make mistakes, users should have the control to either (1) nudge the system in the preferred direction or (2) take over themselves.

Steering, taken from mechanistic interpretability, can enable users to semantically steer models and apply the intelligence of AI in alignment with their preferences. For example, if I am trying to iteratively improve a piece of long-form generated content, I would much prefer to steer the model’s thought process and specific parts of the generation rather than re-generating the entire thing. Steering enables more granular control over model outputs, preserving good aspects of the output while modifying the bad aspects.

The design question of steering still stands: how do you build steering into products in a way that is both effective and intuitive? When inference goes wrong, how can we design interfaces where non-technical users can dissect a system’s thought process, pinpoint the mistakes, and steer the system in the right direction? Steering opens up completely new paradigms in human-computer interaction, product personalization, and even generative UI.

Though the work in steering is early, seeing companies like Anthropic and Tilde pioneer production-ready steering into models is a huge step forward.

Personalization + Control

Personalization empowers AI to be more intelligent, and control empowers users to be more decisive. But, an interesting product direction is what these two facets look like together as a flywheel.

Context can enable personalized steering and control mechanisms. Steering provides user preference datapoints that can be stored in memory for better personalization. Context & memory can improve the steering process, and vice versa. As applications start to adopt more of these practices, flywheels like these can even create moats in UX.

----

I don’t think that there aren’t many well-built AI products out there right now. Building a well-designed non-deterministic system is hard, and most developers choose to ignore it. Yet, it will likely be the differentiating factor between products that succeed and products that don’t.

At the end of the day, users just want to achieve their goals faster, cheaper, and better. If we can automate more of a user’s workload, while still giving them control over these automations, we’ll be taking the clearest next step towards a more productive, intelligent software ecosystem.