Measuring AI Search Visibility: A Practical Framework

Most brands cannot answer a simple question: when someone asks an AI assistant about our category, do we show up?

Search used to hand you a ranked list of links, and a decade of tooling grew up to measure your place on it. Rankings, impressions, clicks. AI answers work differently. The assistant reads across sources and returns one synthesized response, and your link may never appear at all. The old metrics still run, but they describe a surface that fewer people look at every quarter. You can be doing fine on Google and be invisible inside the answer a buyer actually reads.

This is the gap that Generative Engine Optimization (GEO) and Answer Engine Optimization (AEO) try to close. Most of the conversation about them is about tactics. This piece is about the part that comes first and gets skipped: measurement. You cannot improve what you cannot see, and right now most teams cannot see their AI visibility at all.

Here is a framework I use to measure it, and how to start without buying anything.

The short version

Four things matter in an AI answer: whether you are present, whether you are cited, where you sit in the response, and how you are framed. You collect the data by running a fixed set of buyer-style prompts across the major engines on a regular cadence, segmented by topic and region. The single most common mistake is sloppy tag logic that inflates your numbers. Once you can see the picture, the gaps tell you exactly what to build.

Why classic SEO metrics miss this

A ranking measures your position on a page of ten blue links. An AI answer has no page of ten blue links. It has a paragraph, sometimes with a handful of sources attached, sometimes with none. So three things break at once.

Impressions stop meaning much, because the user may never see a list at all. Clicks stop telling the whole story, because the assistant may use your information and never send the click. And rank, the metric the entire industry is built on, has no clean equivalent. You are not ranked. You are either woven into the answer or you are not.

That does not make measurement impossible. It means you measure the answer itself, not the list that used to sit in front of it.

The four dimensions

1. Presence

Across a set of prompts that matter to your category, how often does the assistant mention you at all? This is the closest thing to impressions, but it is binary and answer-level. You are in the response or you are not. Track it as a rate: out of your prompt set, what share of answers include you. That share is your floor metric, and it is the first thing that moves when your work is landing.

2. Citation

When you are mentioned, are you named as a source, with a link or attribution the user could follow? Presence without citation means the model absorbed your information and sent the click somewhere else. Citation is the nearest thing AI search has to a backlink — the trust signal and the traffic path in one. It is the dimension that matters most.

3. Position

Where in the answer do you land? Being the first option named, or the one the assistant recommends, is a different outcome from being the fourth item in a list. Position maps loosely to old rank, but it is about prominence inside a paragraph, not a row on a page. A simple version: note whether you are mentioned first, mentioned among others, or mentioned only as an afterthought.

4. Framing

How are you described? An assistant might mention you accurately, or hedge, or attach a quiet caveat. “X is popular, though some find it expensive” is a very different sentence from “X is the standard choice for teams that need this.” Framing is invisible to classic SEO, and it shapes what a buyer believes before they ever reach your site. Of the four dimensions it is the least measured, and the one with the most room to move.

How to collect the data

Build a prompt set

Write the questions real buyers ask, in natural language, not keywords. “What is the best CDN for live video” rather than “CDN live video.” Cover the funnel: broad category questions, head-to-head comparisons, problem-shaped questions, and branded questions about you specifically. Then freeze the set so you can watch it change over time. Thirty to fifty prompts is a sensible starting range.

Choose your engines

ChatGPT, Gemini, Perplexity, Google’s AI Overviews, and Claude all retrieve and synthesize differently. A brand can be strong in one and absent in another. Measure them separately. An average across engines hides the exact thing you need to know, which is where you are winning and where you are missing.

Sample over time

One snapshot is noise. The same prompt can return different answers from one run to the next, and from one week to the next, as models update and sources shift. Run your set on a fixed cadence, weekly or monthly, so you are watching a trend rather than a single moment. The cadence matters more than the volume.

Segment

Break the results down by topic cluster and by region. AI answers vary by geography, and strength in one corner of your category can hide weakness in another. Segmentation is what turns a vague “we are doing okay” into a specific “we own the comparison questions and we are absent from the fundamentals.”

The mistake that quietly corrupts the data

The most common data mistake has nothing to do with the models. It is in how you tag and filter.

When you group prompts or brands with tags and then filter on them, the logic you use matters more than it looks. Say you tag prompts with both “cloud” and “security,” then filter for either tag. You have just pulled a much larger, looser set than prompts that are actually about cloud security. OR logic inflates the count and flatters the result. Your dashboard looks healthy because it is answering an easier question than the one you meant to ask.

The fix is to decide the logic at the tagging stage, not the filtering stage. Build the combined tag you actually mean, “cloud security,” so the AND relationship is baked in and cannot be loosened later by a careless filter. It sounds like a small bit of hygiene. In practice it is the difference between a metric you can trust and one that drifts upward on its own and tells you a comfortable lie.

From measurement to action

Measurement only matters if it changes what you do. Three moves come straight out of the data.

Find the whitespace: Look for prompts where the whole category is thin, or where one competitor owns the answer and you are absent. That gap is a content brief, already written for you by the gap itself.

Study the dominant source: When one brand owns the answer to a question that matters to you, find out why. It is usually a single well-structured, frequently-cited page, not a marketing campaign. That page is a template.

Decide what to build, and how: The measurement tells you which pages to write and, just as important, how to structure them so they are easy to retrieve and quote. Clear definitions, clean headings, a short summary up top, schema that names the entity. The same hygiene that makes a page citable for an assistant makes it useful for a person.

Tooling, honestly

A category of tools now exists to track brand presence across AI engines against a prompt set, sometimes sold as AI visibility or brand radar. Some established SEO platforms are bolting AI-answer tracking onto what they already do. All of them save time once you scale.

You do not need any of them to start. You can send your prompt set to each engine on a schedule, by hand or with a short script, and store the answers in a spreadsheet. The discipline of a fixed prompt set and a fixed cadence is what produces a signal. The tool only makes that discipline cheaper to maintain.

How to start this week

Write twenty prompts a real buyer in your category would ask.
Run them through two engines today and read the answers properly.
For each answer, note three things: were you mentioned, were you cited, and how were you framed.
Do it again next week.

That is a measurement program. Everything above is refinement on top of those four steps.

FAQ

Is GEO the same as SEO? They are related but not the same. SEO optimizes for ranked links. GEO and AEO optimize for being retrieved and cited inside a synthesized answer. The content quality that helps both overlaps a lot. The measurement and the targets do not.

How often should I measure? Monthly is a sensible floor. Move to weekly if your category shifts quickly or you are running active experiments and need to see them land.

Do I need a paid tool to do this? No, not to start. A fixed prompt set and a spreadsheet will carry you a long way. Tools earn their place when you outgrow what is comfortable to run by hand.

Teams that start measuring now will keep finding gaps their competitors cannot see yet. The framework is simple on purpose. The hard part is starting, and the cost of starting is one afternoon and twenty good questions.