Why Your Workout App Can't Tell You Why: The 5-Question Explainability Test for AI Coaches
Most workout apps adapt. Few explain. A 5-question test to tell whether an AI coach actually reasons about your training — or just hands you a workout.
SensAI Team
12 min read
Get a training plan that adapts to your recovery — free on iOS
“Why Does My Workout App Tell Me to Do This?”
It’s Monday morning. Your app opens to a screen that says Push day — incline dumbbell press, 4 × 8, RPE 7. Cable fly, 3 × 12. Tricep pushdown, 3 × 15. You slept badly. Your knee twinged on Saturday. You ran twelve miles yesterday because the weather was finally good. You stare at the workout, and a small, reasonable question forms.
Why this?
Tap the prescription. There’s no answer. Maybe a tooltip about “auto-progression.” Maybe a percentage that says you’ve recovered 87% of your chest. Nothing that connects last week’s data to today’s plan in a sentence you can read.
This is the gap. In 2026, the question that separates a real AI coach from a dressed-up logger isn’t does it adapt? — almost everyone claims that. It’s can it explain? Because adaptation without explanation is just a slot machine that happens to give you barbells.
The category-defining feature of an AI fitness coach this year is rationale on demand: the app can tell you, in plain English, why today’s session looks the way it does, what data point drove the decision, and what would change its mind. Most apps can’t do this. The rest of this post is about why, what’s architecturally different about the ones that can, and a 10-minute test you can run on any app to find out.
Adaptive ≠ Explained: The Distinction Most “AI” Apps Blur
“Adaptive” and “explained” are two different properties, and most workout apps marketing themselves as AI conflate them. An app can adapt your training without ever telling you why. It can also explain a fixed plan without truly adapting. The distinction matters, and it’s architectural.
Three architectures dominate today’s “AI” workout apps, and only one of them produces rationale as a native output.
Rule engines. A pile of if/then statements written by a coach or product manager. If user logged RPE 9 last session, drop next session by 5%. These can adapt — sometimes very well — but explanations are limited to whatever rule fired, and the explanation is hard-coded by whoever wrote the rule. You can’t ask follow-ups in plain English. Most “smart” workout apps in 2026 are still here.
ML rankers. Classical machine-learning models that score and rank exercise candidates based on your history. Fitbod’s recovery model is a well-known example. These can adapt to surprisingly subtle patterns. But the model itself is a black box — you’d need post-hoc explanation methods like SHAP or LIME to surface why, and almost no consumer app exposes those to end users.12 Cynthia Rudin at Duke has argued this is exactly the wrong direction for high-stakes decisions: bolt-on explanations approximate the model rather than describing it, which means the explanation can be wrong even when the prediction is right.3
LLMs (large language models). ChatGPT, Claude, and the other systems most people now mean when they say “AI.” These don’t just produce a recommendation — they produce language about the recommendation as native output. You can ask “why?” and get a paragraph that traces the reasoning back to the inputs. You can challenge the answer in plain English. You can give new information mid-conversation and watch the plan revise.
| Architecture | Adapts? | Native rationale? | Plain-English challenge? | Examples |
|---|---|---|---|---|
| Rule engine | Yes, narrowly | No (canned strings only) | No | Most “smart” template apps |
| ML ranker | Yes | No (black box) | No | Fitbod-style recovery scoring |
| LLM-coach | Yes | Yes (native output) | Yes | SensAI, custom GPTs over health data |
That third row is the new category. It didn’t exist in any meaningful consumer form three years ago. The reason isn’t that someone finally bolted explanations onto a fitness app — it’s that LLMs gave us, for the first time, a system whose output is already an explanation. Reasoning is the artifact. (For a deeper look at how recovery-aware adaptation works in practice, the workout app fatigue adjustment guide walks through Tier 0 to Tier 2 systems. For the architectural side, see how logger-class apps like Hevy, Strong, Fitbod, and Jefit compare on actual decision-making.)
The 5-Question Explainability Test
Five questions will tell you, in about ten minutes, whether the app you’re using is a real coach or a confident-looking template engine. Type each one into your app and read the answer literally. Vague is a fail.
1. Can it cite the reason for today’s workout?
Ask: “Why did you give me this workout today?”
Pass: A specific narrative that ties today’s session to a goal, a recent input, and a forward-looking plan. “You’re four weeks into a hypertrophy block targeting upper-body volume. Last week you hit your prescribed sets with RPE 7-8 across the board, so I’m bumping volume by 10% this week. Today is the second push session of the week — Friday will be heavier, lower-rep.”
Fail: “This is your scheduled chest day” or no answer at all. Most rule-based apps fail here because the “reason” is just the template position, and the app has no language layer to verbalize it.
2. Can it cite the specific data point that drove the decision?
Ask: “What in my data made you choose this volume / intensity / exercise?”
Pass: It names the input. “Your 7-day average HRV is 8% below your 30-day baseline, and you reported poor sleep two of the last three nights, so I dropped the heavy compound and kept the volume work.” If it can cite a number you can verify in your wearable app, that’s the gold standard.
Fail: A generic “based on your recent activity” with no specific signal. This is the most common failure mode in 2026. The app did use your data, but it can’t tell you which slice mattered.
3. Can it be challenged in plain English?
This is the LLM tell. Ask: “I disagree — I actually feel great today and want to go heavier. Should I?”
Pass: A coach-like response that engages with your input, acknowledges the conflict between subjective feel and objective signals, and gives you a defensible recommendation. It might agree, might push back, might compromise. Crucially, it reasons with you. SensAI’s coach is built around this exact loop — the user can challenge the prescription mid-workout via the chat interface and the program adjusts on the fly.
Fail: A blank wall. The app has no way to receive a natural-language challenge because the prescription was generated by a ranker or rule with no conversational layer. You can change the workout — you just can’t talk about the change.
This question matters more than the others because it tests whether explanation is dialogue or monologue. Lakkaraju and colleagues at Harvard interviewed clinicians and policymakers about what they actually wanted from explainable AI, and the answer was unambiguous: not static feature importances, but interactive natural-language dialogue about the model’s reasoning.4 Doctors wanted to treat the model as another colleague — one they could ask follow-ups. Strength athletes want the same thing.
4. Does it remember last week — and the week before?
Ask: “What did we change in my program over the last month, and why?”
Pass: A multi-week narrative. “Three weeks ago we shortened your sessions because you flagged time pressure. Two weeks ago we deloaded the lower body after your knee soreness. Last week we re-introduced full-volume legs at 80% and you tolerated it well, so this week I’m back to your normal load.”
Fail: Nothing, or a recap of last session only. Memory is the foundation of coaching, and most apps don’t have it. They have logs, which is a different thing.
LLM-class context windows are now large enough to hold months of training history in a single conversation. Anthropic’s Claude family ships with context windows up to 200K tokens (and a 1M-token tier for some models),5 and OpenAI’s GPT-4o supports 128K.6 That’s more than enough to give the model a working memory of your entire training season. The constraint is no longer technical — it’s whether the product team chose to wire that memory into the coaching loop.
5. Will it change its mind when you give it new information?
Ask: “I just rolled my ankle on the warm-up. Replan the next 20 minutes.”
Pass: The app produces a revised session inside the same conversation. New exercises that load only the unaffected joints. A note about what to monitor. A flag for tomorrow’s plan. The change is grounded in the new input.
Fail: A canned “consult a medical professional” deflection, or a non-response. (A safety disclaimer is fine — what you’re testing for is whether the app also produces a usable adaptation.)
This test is borderline diagnostic for whether you’re talking to an LLM-based system at all. Static apps cannot improvise around an injury you didn’t pre-declare in a settings menu. A real AI coach can.
Why Most Apps Fail the Test (The Architectural Reason)
This isn’t a marketing problem — it’s a build problem. Most workout apps fail the explainability test because of how they were architected, not because their teams forgot to add a “why” button.
A rule engine can only produce explanations that someone wrote in advance. If a user’s question doesn’t match a pre-written template, the engine has nothing to say. You can scale the rule library to thousands of branches and still hit dead ends in the first five minutes of an actual conversation. (See the science of AI workout personalization for how systems vary in adaptive depth.)
ML rankers fail differently. The model produces a number — next exercise: deadlift, score 0.84 — but the number isn’t an explanation of itself. To get an explanation you need a second system layered on top: SHAP values, LIME approximations, attention visualizations.12 These methods exist and they work for engineers debugging models. They are almost never exposed to end users, because the output is a chart of feature contributions, not a sentence. Eric Topol has been blunt about the gap between model outputs and clinical communication, arguing that the next generation of AI in healthcare will succeed or fail on whether it can carry on a multimodal conversation with the human in front of it, not on benchmark numbers.7 The same point applies to fitness.
Lundberg and colleagues, who built SHAP, showed in a Nature Biomedical Engineering paper that giving anesthesiologists explanations alongside a hypoxemia prediction model meaningfully improved trust and decision quality.8 But that took a custom-built UI, hours of clinician onboarding, and a research team. It is not what shipping consumer apps look like.
LLM coaches don’t need a second explanation system because the explanation is the output. The model produces language all the way down. That’s the architectural unlock.
The flip side is honest framing: LLMs can also confabulate, which is why a real LLM-coach has guardrails — verified data inputs from your wearable, retrieval over your actual training history, and explicit constraints on what it will and won’t recommend.9 Bickmore and colleagues showed in a now-classic JMIR study that conversational assistants without medical guardrails produced unsafe medical advice in a meaningful fraction of cases.10 The lesson is not that conversational AI is unsafe — it’s that the system has to be designed for the domain, not bolted together from a generic chat layer.
What an Explained Coach Actually Looks Like in Practice
Here’s a Monday morning, annotated.
7:02 AM. You open the app. The home screen says: “Today: 35 min, upper body, push focus, RPE target 7. Reasoning: you logged a 12-mile run yesterday and your morning HRV came in 6% below your 14-day baseline, so I’m keeping today shorter and avoiding heavy compound legs. Tuesday’s lower body session is unchanged.”
That’s a real explanation. Two sentences, three data points (yesterday’s run, HRV, the 14-day baseline), and one forward-looking commitment (Tuesday is unchanged). You don’t need a chart. You don’t need to interpret a recovery percentage. You can read it.
7:04 AM. You tap the workout, see the first exercise, and type into chat: “Can I swap incline DB press for landmine press? Shoulder is a little cranky.”
The reply: “Yes — landmine press loads the shoulder more conservatively at the same volume target. I’ll swap that in for sets one through four. If the shoulder still bothers you on set two, message me and we’ll move to chest-supported rows for the remainder.”
That’s a coach. Not because the response is fancy — it’s actually quite plain — but because every part of the response is grounded. The swap rationale is real (landmine press genuinely changes the shoulder loading angle). The fallback is contingent. The whole thing is one English exchange, not a settings menu.
This is what SensAI is built for: the data layer (HealthKit, HRV, sleep, training history) feeds an LLM coach that produces both the program and the explanation as a single output. The user gets the workout and the why, in the same place, in the same language they’d use with a human coach.
When Explanation Actually Matters (And When It Doesn’t)
Explainability is high-value for some users and approximately zero for others. It depends on where you are in the training journey.
The beginner. High value. A lifter in their first six months has no internal reference for what’s normal — soreness, fatigue, plateaus, deloads. Without explanation, they can’t tell whether the app is being smart or being random. The rationale is the entire educational layer. They learn how training works by reading why their program changes.
The intermediate adapting around real life. High value. The 35-year-old who travels for work, has two kids, and trains around chronic shoulder impingement needs the program to bend to their constraints constantly. Explanation lets them trust the bends. Without it, every modification feels like the app is just guessing. The workout-results diagnostic guide is essentially built for this reader.
The advanced lifter on a fixed program. Low value, sometimes negative. Someone running a structured 12-week peaking block with a known coach doesn’t want their app to negotiate. They want to log the lift and move on. For this user, an explanation-heavy app can feel like noise. This is the smallest population by user count, but it’s a real one — and the right answer is to give them a “quiet mode” rather than to under-build for everyone else.
The rule: the value of explainability scales with the rate of change in the program. If your training never changes, you don’t need explanations. If it changes every day based on a dozen inputs, explanations are the difference between a coach and a coin flip.
How to Test an App in 10 Minutes
Run this test on any app you’re evaluating, including SensAI. Pass = at least four of five.
-
Open the app and find today’s workout. Tap the “why” affordance, if there is one. Read what it says. Pass if the reason cites a specific recent input. Fail if it says “scheduled” or “auto-progression” with no further detail.
-
Ask, in chat or whatever input the app provides: “What data point drove today’s volume?” Pass if it names a specific signal (HRV, sleep, last session RPE, recent volume). Fail if it gives a generic answer.
-
Disagree with the prescription. Type: “I feel great, can we go heavier?” Pass if it engages with the conflict and gives a reasoned response. Fail if it ignores you or just changes the weight without comment.
-
Ask for a 4-week program retrospective. “What changes did you make to my plan over the last month, and why?” Pass if it produces a multi-week narrative. Fail if it gives last session only or nothing.
-
Throw in a curveball. “I rolled my ankle in warm-up. Replan.” Pass if it produces a revised session. Fail if it deflects entirely.
If the app passes 4-5: you have an LLM-class coach. Trust the explanations and use them. If it passes 2-3: you have a hybrid. Useful for adaptation, weak for reasoning. If it passes 0-1: you have a logger or a rule engine. Fine for tracking, not a coach. (For a 2026 buyer’s-guide view of which apps fall where, see the best AI personal trainer apps for 2026 roundup.)
A note on what we’re testing for. We’re not testing whether the app can adapt — almost all of them claim to, and many do at some level. We’re testing whether the app can defend its adaptations in plain language. Susan Michie at UCL, whose behavior change technique taxonomy is the most-cited framework in digital health, has shown that providing rationale and feedback is among the most reliable predictors of behavior change in coaching contexts.11 Random adjustment without rationale doesn’t change behavior. Explained adjustment does. A 2019 systematic review and meta-analysis of smartphone-based physical activity apps found that the apps producing meaningful behavior change were the ones that combined real personalization with feedback the user could act on — and that the effect was modest in apps that didn’t.12 Explanation isn’t a luxury feature. It’s part of the active ingredient.
The Bigger Picture: Coaching Is a Conversation
A human coach explains. An algorithm executes. The new category — LLM-based fitness coaches like SensAI — is the first software that does both in the same breath. Not because someone bolted “explanations” onto an algorithm, but because the underlying technology speaks language as its native output.
The right test for an AI coach in 2026 isn’t “does it know what to prescribe?” — that’s table stakes. The right test is “can it tell me why, and will it argue with me when I push back?” If yes, you have a coach. If no, you have a workout that someone else picked.
Ask the next workout you get: why?
If the app can answer, you’re in the right place. If it can’t, you already know what to do.
References
Footnotes
-
Lundberg, Scott M.; Lee, Su-In. “A Unified Approach to Interpreting Model Predictions.” NeurIPS 2017 / arXiv:1705.07874, 2017. https://arxiv.org/abs/1705.07874 ↩ ↩2
-
Doshi-Velez, Finale; Kim, Been. “Towards A Rigorous Science of Interpretable Machine Learning.” arXiv:1702.08608, 2017. https://arxiv.org/abs/1702.08608 ↩ ↩2
-
Rudin, Cynthia. “Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead.” Nature Machine Intelligence 1(5):206-215, 2019. https://www.nature.com/articles/s42256-019-0048-x ↩
-
Lakkaraju, Himabindu; Slack, Dylan; Chen, Yuxin; Tan, Chenhao; Singh, Sameer. “Rethinking Explainability as a Dialogue: A Practitioner’s Perspective.” arXiv:2202.01875, 2022. https://arxiv.org/abs/2202.01875 ↩
-
Anthropic. “Claude — Models and pricing.” Anthropic product documentation, 2026. https://www.anthropic.com/claude ↩
-
OpenAI. “Hello GPT-4o.” OpenAI product release page, 2024. https://openai.com/index/hello-gpt-4o/ ↩
-
Topol, Eric J. “As artificial intelligence goes multimodal, medical applications multiply.” Science 381(6663):eadk6139, 2023. https://www.science.org/doi/10.1126/science.adk6139 ↩
-
Lundberg, Scott M.; Nair, Bala; Vavilala, Monica S.; et al. “Explainable machine-learning predictions for the prevention of hypoxaemia during surgery.” Nature Biomedical Engineering 2:749-760, 2018. https://www.nature.com/articles/s41551-018-0304-0 ↩
-
Eysenbach, Gunther. “The Role of ChatGPT, Generative Language Models, and Artificial Intelligence in Medical Education: A Conversation With ChatGPT and a Call for Papers.” JMIR Medical Education 9:e46885, 2023. https://mededu.jmir.org/2023/1/e46885 ↩
-
Bickmore, Timothy W.; Trinh, Ha; Olafsson, Stefan; et al. “Patient and Consumer Safety Risks When Using Conversational Assistants for Medical Information: An Observational Study of Siri, Alexa, and Google Assistant.” Journal of Medical Internet Research 20(9):e11510, 2018. https://www.jmir.org/2018/9/e11510 ↩
-
Michie, Susan; Richardson, Michelle; Johnston, Marie; et al. “The behavior change technique taxonomy (v1) of 93 hierarchically clustered techniques: building an international consensus for the reporting of behavior change interventions.” Annals of Behavioral Medicine 46(1):81-95, 2013. https://pubmed.ncbi.nlm.nih.gov/23512568/ ↩
-
Romeo, Aaron; Edney, Sarah; Plotnikoff, Ronald; et al. “Can Smartphone Apps Increase Physical Activity? Systematic Review and Meta-Analysis.” Journal of Medical Internet Research 21(3):e12053, 2019. https://www.jmir.org/2019/3/e12053 ↩