voice · nlp · prompt engineering

Conversational
AI Coach

A voice-first coaching assistant that simulates reflective, ICF-style conversations across three languages — built on Whisper, GPT, and OpenAI TTS, deployed on Hugging Face Spaces.

3
languages supported
ICF
coaching framework
6
pipeline layers
HF
hosted on Spaces

Why This Project

Most AI assistants are optimised for answers. Professional coaching works the other way — it uses open questions, reflection, and deliberate restraint to surface what the person already knows.

This project explores whether conversational AI can simulate a more reflective coaching experience through multilingual voice interaction, conversational continuity, and coaching-oriented prompt engineering.

End-to-end conversation flow

Modular six-stage pipeline with separable components.

01
Speak
Microphone input via Gradio
User speaks in English, Hindi, or Kannada. Whisper supports multilingual transcription while the coaching response language is selected by the user.
02
Transcribe
OpenAI Whisper — multilingual ASR
Whisper converts audio to text in a single multilingual inference call. Outputs the detected language alongside the transcript for downstream prompt construction.
03
Validate transcript
Editable text field — user-controlled submit
The transcript is shown before submission. Nothing reaches GPT until the user confirms. Accent errors, mixed-script confusion, or mispronunciations can be corrected in under two seconds — without breaking the conversation.
04
Generate coaching response
OpenAI GPT + ICF persona system prompt + session history
Full session history is passed with every turn so the model can reference earlier statements. The system prompt enforces ICF coaching behaviour: one question at a time, no unsolicited advice, reflect before probing deeper.
05
Synthesise voice
OpenAI TTS
The coaching response is converted to spoken audio using OpenAI TTS voice synthesis.
06
Play response
Gradio audio component + session state
Audio is played back through the same Gradio interface. Session state is maintained across turns for conversational continuity.

Layer decisions and rationale

The architectural choices that aren't obvious from the flow diagram.

Layer 02 Why Whisper over other ASR providers
OpenAI Whisper Indic language support

Other ASR providers were evaluated and failed on Kannada and Hindi code-switching without fine-tuning. Whisper's multilingual model handles all three languages in one inference call with no language detection pre-step. It can also be replaced with a local faster-whisper instance with no interface change if API cost becomes a constraint.

Layer 03 Why the transcript isn't auto-submitted
User trust ASR error recovery

Coaching conversations are personal. A model response to a garbled transcript — replying to something the user never actually said — breaks trust immediately and is hard to recover from mid-session. One deliberate confirmation tap is a worthwhile UX trade-off for that reliability.

Layer 04 ICF behavioural constraints in the system prompt
Prompt engineering Inspired by ICF-style coaching principles

The system prompt overrides the model's default "helpful assistant" behaviour with hard rules derived from ICF coaching principles — not soft stylistic instructions. Explicit constraints: ask one question per turn, never give unsolicited advice, acknowledge emotion before exploring action, reflect before probing deeper.

Prompt-adjustable coaching style

Empathy level, directness, warmth, questioning style, and coaching framework are independently adjustable through the prompt — no code changes needed.

empathy
How much emotion is validated before moving forward
high / moderate / low
directness
Whether the coach reflects or gently challenges
reflective / challenging
warmth
Language register and relational tone
warm / neutral / formal
question style
Coaching question pattern to follow
Socratic / scaling / miracle
framework
Coaching model the prompt is anchored to
ICF / GROW / Co-active
Layer 06 Why Gradio and Hugging Face Spaces
Gradio HF Spaces

A Streamlit or FastAPI frontend would have required building microphone capture and audio playback from scratch. Gradio ships both natively alongside session state management — the right scope for a conversational AI prototype.

HF Spaces provides a free persistent deployment with a shareable URL from day one, automatic restarts, and no server provisioning.

External services and modularity

All OpenAI services. Each has a documented replacement path at the same interface boundary.

Service Role Drop-in replacements
OpenAI Whisper Multilingual speech-to-text faster-whisper (local), Groq Whisper
OpenAI GPT Coaching intelligence + session memory Claude, Gemini, LLaMA via Groq/Ollama
OpenAI TTS Text-to-speech response delivery ElevenLabs, Coqui, gTTS (free/lower quality)
Gradio UI — mic, audio playback, session state Streamlit + custom audio components
HF Spaces Hosting and deployment Streamlit Cloud, Render, Railway, local

Extend it for your use case

The pipeline is modular by design. The session architecture, transcript layer, and prompt scaffolding are reusable as-is — only the content changes.

Revoice
the persona

Change the coaching personality

Adjust empathy, directness, warmth, question style, or switch frameworks — GROW, Co-Active, solution-focused. The prompt file is the only thing that changes; the rest of the pipeline is untouched.

Extend
languages

Add more languages

Whisper supports 90+ languages natively. Add a language selector to the UI, pass the language= parameter to Whisper, and update the system prompt language instruction. No model changes needed.

Swap
the models

Replace any AI service

GPT → Claude or LLaMA, Whisper → Groq or faster-whisper, TTS → ElevenLabs or Coqui. Each layer has a single function boundary — replacing one doesn't touch the others.

Repurpose
the workflow

Different conversation, same architecture

Interview prep coach, journaling companion, language learning tutor, leadership simulation. The voice pipeline and session memory are format-agnostic — any reflective conversation structure fits.

Add
persistence

Extend with memory and personalisation

The current architecture is session-scoped by design. Persistent memory, vector database retrieval, session summaries, or personalised coaching profiles can be layered in without restructuring the core pipeline.

# clone and run locally
git clone <repository-url>
pip install -r requirements.txt
# add your API keys to .env
python app.py

# or push to Hugging Face Spaces for a shareable URL

Designed for a sensitive context

Coaching conversations are personal. The defaults reflect that.

Session-scoped only

Conversation history lives in memory for the duration of the session and is not stored in a database..

No logging by default

Conversations are not stored after the session ends. Persistent storage would require an explicit architectural addition.

Credentials server-side

All API keys are stored as HF Spaces secrets — never sent to the client or visible in browser network traffic.

Not a clinical tool

This is a reflective conversational prototype, not a therapist or mental health service. Not designed for crisis support or clinical use.