Almost Timely News: 🗞️ How to Test AI Models (2025-08-10)

I test out OpenAI GPT-5... again...

Aug 10, 2025

Almost Timely News: 🗞️ How to Test AI Models (2025-08-10) :: View in Browser

The Big Plug

👁️ Catch my opening keynote from the American Marketing Association called Never Think Alone: How AI Has Changed Marketing Forever.

👉 My new book, Almost Timeless: 48 Foundation Principles of Generative AI is now available!

Content Authenticity Statement

100% of this week's newsletter was generated by me, the human. You will see bountiful AI outputs in the video. Learn why this kind of disclosure is a good idea and might be required for anyone doing business in any capacity with the EU in the near future.

Watch This Newsletter On YouTube 📺

Click here for the video 📺 version of this newsletter on YouTube »

Click here for an MP3 audio 🎧 only version »

What's On My Mind: How I Do AI Model Testing

This week let's talk about the rolling out of OpenAI's GPT-5 model and how I test AI models. Hopefully by the end of this newsletter you'll have a sense of what AI testing is, why you should do it, and how to get started.

Part 1: Why Test AI Models?

To start, all generative AI models are probabilistic in nature. That means that the way they function is never 100% guaranteed. There's always the potential for an element of randomness in them. You can give the same prompt to a large language model or to an image generation model and get similarly themed responses, but you'll almost never get the same response twice.

To ensure that we're getting at least the correctly themed responses, we need to test. We need to make sure that the models are going to deliver results that are within the boundaries of what we deem acceptable output. That can mean responses using certain language, using a certain format, etc.

Beyond that, one of the things we need to understand is whether or not a model is well suited for our purposes. To do so, we should develop tests that will help us understand a model's capabilities. All the major AI models are subjected to a set of standard benchmarks. You'll see crazy names like MMLU-Pro, GPQA Diamond, Humanity's Last Exam, SciCode, IFBench, AIME 2025, AA-LCR, and others.

These benchmarks are good and they are important for doing apples to apples comparisons of models for very common use cases or for specific domains. However, these tests and benchmarks aren't good at testing our specific use cases for generative AI.

For example, a fiction writer will care a lot more about adherence to tone and style than a Python coder. A Python coder will care far more about code syntax and proper coding methods than a journalist. None of the major benchmarks could possibly test all of the use cases that we use generative AI for in our work.

In turn, that means that if a model performs poorly on some benchmarks, it doesn't necessarily mean that that model is a poor choice for our work. It simply means it didn't do well on that test. As anyone who has ever hired employees knows, just because someone has a great grade point average doesn't mean they're going to be a good employee.

Generally, if a model does poorly across all the tests it's given compared to other benchmarks, that is a pretty good indicator that it's not a great AI model. But there are still use cases where it might have above average capability in a very specific narrow domain.

That's why we test. That's why we build our own tests so that we can evaluate models against the things that matter to us and the tasks that we perform with generative AI. When we have our own tests, we can evaluate models apples to apples against our own use cases and our own data. This has the added advantage of ensuring that as long as we don't publish our tests in public, AI model making companies will not be able to cheat on our specific tests.

This has been something of a problem in the last couple of years as companies have built models specifically to pass the benchmarks, but once they are deployed in real life, it turns out those models kind of suck. (Meta Llama 4, looking at you)

Part 2: What Am I Testing?

Let's move into the specific tests that I conduct. As above, the tests that I use are specific to my use cases and to the way that I want AI to work for me. My tests probably are not applicable to anyone else except me.

I have two sets of tests. The first set is a short competency test that I will often post publicly, looking at things like current knowledge availability, reasoning, problem solving, mathematical understanding at a basic level, and coding.

For Trust Insights clients who order it it, I also have a longer battery of tests that push models on the seven major use case categories of generative AI, as well as testing for biases so that we can evaluate whether a model is best suited for a wide variety of tasks.

For my short competency test, these are the basic challenges:

Current knowledge of a broad topic. I'll ask a question about some relatively current topic that's happened in the last couple of years, such as the current status of the illegal Russian invasion of Ukraine. This is an exceptionally useful test because there's a lot of public knowledge about it, and I can judge where the model's knowledge cutoff actually is.
Current knowledge of a specific topic. I'll often ask a question about the most recent issue of this newsletter that a model knows about and what the topic is. This tests its ability to use tools, if they're available, and if not, to at least understand whether or not it even knows who I am and my domain.
Mathematical exercise. I'll often ask a combined mathematical exercise that requires two or more stages of mathematical computation in tandem to see if a model can keep the numbers straight. For example, I might ask it to adjust the sale price of an item and convert it from dollars to Euros, or rescale a recipe and convert it from imperial to metric. Whatever the mathematical problem is, it's something that a basic calculator could do, but a transformers-based model is going to have a hard time doing if it doesn't invoke tools properly.
Reasoning exercise. I'll ask a question that requires subject matter knowledge and the ability to reason through it and think through specific details, such as making a food substitution in a recipe, or choosing a different marketing channel when a marketing channel is not performing well. To do this exercise a model has to understand the domain, and has to understand the substitutions or changes it needs to make to get things working again.
Coding exercise. I'll give a model an improperly short prompt that is generally insufficient in detail, but has specific keywords that should invoke a complete response, to do something like build a game, or build a calculator, or build something in the browser that can run but has both internal and external dependencies. To do this well, the model would have to have a knowledge of those external dependencies and invoke them properly.
Writing exercise. I'll give a model a specific writing style and a prompt plus some data, have it generate text based on that writing style, and judge how well it adheres to the writing style. Writing style itself can condition how well a model writes, so by providing my writing style in a very specific format (YAML), I can see whether or not the model has solid writing fluency to turn insufficient data and a robust writing style into a solid product.

These are not the comprehensive tests, nor are they tests that are appropriate for everyone to use. They're the tests I created for myself because these are the things that I care about. There's one additional set of tests that's confidential, but this is a good place to start.

There are also two different ways to test models. The first is to use the web interface that the average consumer uses, and the second is to use the API, which allows you to use the model directly.

The web interface for consumers is like its own car. The engine is inside, but it also has a lot of other amenities, like seats and a steering wheel and a radio. In generative AI, the web interface has a lot of additional rules and conditions that help the average user get more out of the model, but also can sometimes influence and create unexpected outputs. I test in the web interface to replicate the experience the average non-technical user is going to have.

The API is like just getting the engine without the rest of the car. You have to provide the interface to the API, but you can test the model purely and truly by itself without any hidden system instructions or other niceties that the model maker might have put in their web interface. I test via the API to replicate the experience that an app using that model will have, or a programmatic system, or a coding tool will have.

It's important to test both so that you understand what the raw model itself is capable of and what additional niceties the manufacturer might have put in the web interface to make things easier typically for non-technical users.

Part 3: Testing Protocols

In terms of actually running the tests, this is relatively straightforward. You make sure that you have all of your prompts and all of the background data stored locally on disk so that you can copy and paste easily or drop files in as needed.

Like laboratory testing, it's important that all the prompts be identical so that when you test from model to model, it is an apples to apples comparison. If you change a prompt in your testing suite, you have to rerun the test on all the models to make sure the results are fair.

It's a good idea to have prompts be sufficiently robust and difficult enough that you don't have to change the prompts from testing session to testing session. A prompt that you designed a year ago should be robust enough to still be able to use it today.

With every test, I start a new chat to make sure that the results of the previous test aren't influencing the current test. A lot of people make the mistake of doing everything in one chat, and that is generally a bad idea. (this is also a bad idea when you're not testing AI models and just using them - the rule I adhere to is one task, one chat)

Some tests are pass-fail and others are scored on a one to five scale, where five is the best.

I generally keep the results in either a notebook or a spreadsheet so that I can refer to it when I need to. You might want to publish your results, in which case you'd probably want to store them in something a bit easier to read or more robust for larger scale analysis.

In the video edition of this newsletter, you can check out how I evaluate OpenAI GPT-5 with the short competency test. The verdict: on par with Gemini 2.5, not substantially better or worse.

Part 4: Using Test Results

The big question is, what do you do with the test results once you're done? For me personally, it's just knowing which model to go to for specific tasks or which model is best suited towards my workflow and the things that I want to do with generative AI.

When you spot models that have substantially different capabilities than previous generations, or new models that have substantially greater capabilities than the models you're currently using, that's a sign that it's time to upgrade.

Sometimes, if it's a hot topic like OpenAI's GPT-5 model, I will publish the results or a subset of the results so that other people can benefit from them. But given how often I test models (and the fact that there's almost 2 million models publicly available), most of my results I just keep for myself because... no one's really asking for them. Does anyone really care, for example, what the difference is between Mistral Small 3.2 Dynamic Quant Q6 GGUF and Mistral Small 3.2 Dynamic Quant Q8 GGUF? Probably not.

Inside your organization, publishing your test results is vitally important so that as you are orchestrating your AI deployments, you know what models to recommend to people. You know that if given a choice, what models are best suited towards specific tasks. You might have a model that you know does particularly well with tool handling and mathematics, so you would use that for analytics. You might have another model that is a fantastic writer but can't reason for squat, so you'd use that for creative writing and brainstorming, but for analysis and strategy.

As our selection of AI tools and models increases, having a solid testing strategy and an understanding of what models are going to be best for our specific tasks is going to make your life easier. Yes, you'll have to invest more time in testing up front, but then after that you won't have to perpetually wonder or burn mental cycles on what tool to use and when. You'll be able to build your own cheat sheets of what tools to use and when, and share that inside your organization to help people not have to make decisions for every AI task.

Finally, a critically important use case of test results is to understand if a model's capabilities have changed dramatically. For example, with OpenAI, if GPT-5 generates substantially different results than previous versions of their models, then anyone who's made Custom GPTs needs to go back and take their old system instructions and use the new model to upgrade their instructions so that it performs best with the new model. Every model is best at optimizing prompts for itself. (By the way, this is actually the case. You should be going back and using the OpenAI GPT-5 prompt optimizer on all your old GPTs.)

Part 5: Wrapping Up

Model testing isn't for everyone, nor does everyone need to do it. If you are responsible for AI deployment or recommendation of tools within your organization, or you deeply care about what tools to use in which circumstances for your own use, then model testing is a good idea.

If you are responsible for AI governance in your organization, then testing is pretty much mandatory to make sure that you're evaluating models specifically for the use cases of your organization. And don't forget, if you have specific requirements for compliance, as an example, checking models for biases, that's mandatory. You gotta do it.

The main thing with model testing is to ensure that you're doing in a fair and statistically valid and replicable method, so that you're not introducing your own biases into the testing process itself, and you're getting test results that someone else could replicate with your tests.

If you do model testing of your own, stop on over our free Slack group, Analytics for Marketers, and let me know how you do your own model testing. I'd love to know.

How Was This Issue?

Rate this week's newsletter issue with a single click/tap. Your feedback over time helps me figure out what content to create for you.

Here's The Unsubscribe

It took me a while to find a convenient way to link it up, but here's how to get to the unsubscribe.

If you don't see anything, here's the text link to copy and paste:

https://almosttimely.substack.com/action/disable_email

Share With a Friend or Colleague

If you enjoy this newsletter and want to share it with a friend/colleague, please do. Send this URL to your friend/colleague:

https://www.christopherspenn.com/newsletter

For enrolled subscribers on Substack, there are referral rewards if you refer 100, 200, or 300 other readers. Visit the Leaderboard here.

Advertisement: Bring Me In To Speak At Your Event

Elevate your next conference or corporate retreat with a customized keynote on the practical applications of AI. I deliver fresh insights tailored to your audience's industry and challenges, equipping your attendees with actionable resources and real-world knowledge to navigate the evolving AI landscape.

👉 If this sounds good to you, click/tap here to grab 15 minutes with the team to talk over your event's specific needs.

If you'd like to see more, here are:

ICYMI: In Case You Missed It

This week, Katie and I talked through how AI news is drowning out all other kinds of news, to our detriment.

Skill Up With Classes

These are just a few of the classes I have available over at the Trust Insights website that you can take.

Premium

Free

Advertisement: New AI Book!

In Almost Timeless, generative AI expert Christopher Penn provides the definitive playbook. Drawing on 18 months of in-the-trenches work and insights from thousands of real-world questions, Penn distills the noise into 48 foundational principles—durable mental models that give you a more permanent, strategic understanding of this transformative technology.

In this book, you will learn to:

Master the Machine: Finally understand why AI acts like a "brilliant but forgetful intern" and turn its quirks into your greatest strength.
Deploy the Playbook: Move from theory to practice with frameworks for driving real, measurable business value with AI.
Secure Your Human Advantage: Discover why your creativity, judgment, and ethics are more valuable than ever—and how to leverage them to win.

Stop feeling overwhelmed. Start leading with confidence. By the time you finish Almost Timeless, you won’t just know what to do; you will understand why you are doing it. And in an age of constant change, that understanding is the only real competitive advantage.

👉 Order your copy of Almost Timeless: 48 Foundation Principles of Generative AI today!

Get Back to Work

Folks who post jobs in the free Analytics for Marketers Slack community may have those jobs shared here, too. If you're looking for work, check out these recent open positions, and check out the Slack group for the comprehensive list.

Advertisement: Free AI Strategy Kit

Grab the Trust Insights AI-Ready Marketing Strategy Kit! It's the culmination of almost a decade of experience deploying AI (yes, classical AI pre-ChatGPT is still AI), and the lessons we've earned and learned along the way.

In the kit, you'll find:

TRIPS AI Use Case Identifier
AI Marketing Goal Alignment Worksheet
AI Readiness Self-Assessment (5P & 6Cs)
12-Month AI Marketing Roadmap Template
Basic AI ROI Projection Calculator
AI Initiative Performance Tracker

If you want to earn a black belt, the first step is mastering the basics as a white belt, and that's what this kit is. Get your house in order, master the basics of preparing for AI, and you'll be better positioned than 99% of the folks chasing buzzwords.

👉 Grab your kit for free at TrustInsights.ai/aikit today.

How to Stay in Touch

Let's make sure we're connected in the places it suits you best. Here's where you can find different content:

My blog - daily videos, blog posts, and podcast episodes
My YouTube channel - daily videos, conference talks, and all things video
My company, Trust Insights - marketing analytics help
My podcast, Marketing over Coffee - weekly episodes of what's worth noting in marketing
My second podcast, In-Ear Insights - the Trust Insights weekly podcast focused on data and analytics
On Bluesky - random personal stuff and chaos
On LinkedIn - daily videos and news
On Instagram - personal photos and travels
My free Slack discussion forum, Analytics for Marketers - open conversations about marketing and analytics

Listen to my theme song as a new single:

Advertisement: Ukraine 🇺🇦 Humanitarian Fund

The war to free Ukraine continues. If you'd like to support humanitarian efforts in Ukraine, the Ukrainian government has set up a special portal, United24, to help make contributing easy. The effort to free Ukraine from Russia's illegal invasion needs your ongoing support.

👉 Donate today to the Ukraine Humanitarian Relief Fund »

Events I'll Be At

Here are the public events where I'm speaking and attending. Say hi if you're at an event also:

Marketing Profs Working Webinar Series, September 2025
SMPS, Denver, October 2025
Marketing AI Conference, Cleveland, October 2025
MarketingProfs B2B Forum, Boston, November 2025

There are also private events that aren't open to the public.

If you're an event organizer, let me help your event shine. Visit my speaking page for more details.

Can't be at an event? Stop by my private Slack group instead, Analytics for Marketers.

Required Disclosures

Events with links have purchased sponsorships in this newsletter and as a result, I receive direct financial compensation for promoting them.

Advertisements in this newsletter have paid to be promoted, and as a result, I receive direct financial compensation for promoting them.

My company, Trust Insights, maintains business partnerships with companies including, but not limited to, IBM, Cisco Systems, Amazon, Talkwalker, MarketingProfs, MarketMuse, Agorapulse, Hubspot, Informa, Demandbase, The Marketing AI Institute, and others. While links shared from partners are not explicit endorsements, nor do they directly financially benefit Trust Insights, a commercial relationship exists for which Trust Insights may receive indirect financial benefit, and thus I may receive indirect financial benefit from them as well.

Thank You

Thanks for subscribing and reading this far. I appreciate it. As always, thank you for your support, your attention, and your kindness.

See you next week,

Christopher S. Penn

Tris Hussey

Aug 10

One challenge I see happening is “tool lock” at companies where they’ve decided there is “one tool to rule them all,” regardless of progress in other tools. The results of testing a new model might fall on deaf ears if a company is ride or die with a single tool. While I find is generally lagging in a lot of ways, Copilot at least tries to deal with this with their internal model switching.

On the other hand, as you’ve said before, all the popular models/tools are all pretty darn good and for most people it’s okay to get familiar with a single tool (while still keeping an eye on the horizon).

Expand full comment