<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>TTS |</title><link>https://nicolasfbportfolio.netlify.app/tags/tts/</link><atom:link href="https://nicolasfbportfolio.netlify.app/tags/tts/index.xml" rel="self" type="application/rss+xml"/><description>TTS</description><generator>HugoBlox Kit (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Fri, 17 Apr 2026 00:00:00 +0000</lastBuildDate><image><url>https://nicolasfbportfolio.netlify.app/media/icon_hu_3795d420522f6b97.png</url><title>TTS</title><link>https://nicolasfbportfolio.netlify.app/tags/tts/</link></image><item><title>AI Japanese Teacher - Niki sensei</title><link>https://nicolasfbportfolio.netlify.app/projects/nikisensei-project/</link><pubDate>Fri, 17 Apr 2026 00:00:00 +0000</pubDate><guid>https://nicolasfbportfolio.netlify.app/projects/nikisensei-project/</guid><description>&lt;p&gt;This is Niki sensei, a conversational Japanese teacher. It combines multilingual audio transcription with Faster-Whisper, a Japanese-proficient LLM with Qwen3-Swallow-8B, and natural-sounding speech synthesis with Qwen3-TTS-0.6B featuring Ono Anna&amp;rsquo;s voice. On the front, an expressive Live2D avatar featuring Hiyori Momose. Behind the curtains, microservice status observability with Prometheus and a Grafana dashboard for energy, latency, and cost monitoring.&lt;/p&gt;
&lt;video width="100%" controls poster="thumbnail.png"&gt;
&lt;source src="niki_demo_clean_audio.mp4" type="video/mp4"&gt;
&lt;/video&gt;
&lt;h2 id="project-architecture"&gt;Project Architecture&lt;/h2&gt;
&lt;p&gt;You can see the structure of the project below:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;nihongo_sensei/
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;├── prompts/
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;│ └── tutor_system.txt ← Niki&amp;#39;s system prompt (in Japanese)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;├── services/
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;│ ├── shared/
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;│ │ └── gpu_utils.py ← shared VRAM utility
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;│ ├── brain/ ← LLM wrapper (FastAPI + vLLM)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;│ │ └── main.py
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;│ ├── ears/ ← STT service (Faster-Whisper)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;│ │ └── main.py
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;│ ├── mouth/ ← TTS service (Qwen3-TTS)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;│ │ └── main.py
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;│ ├── core/ ← WebSocket orchestrator (FastAPI)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;│ │ └── main.py
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;│ └── monitor/ ← Observability (Prometheus + Grafana)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;│ ├── docker-compose.yml
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;│ └── prometheus.yml
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;└── sensei-ui/ ← React + Vite frontend
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; ├── src/
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; │ ├── App.jsx
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; │ ├── hooks/
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; │ │ ├── useWebSocket.js
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; │ │ ├── useAudioCapture.js
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; │ │ └── useAudioPlayback.js
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; │ └── components/
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; │ ├── Avatar.jsx ← Live2D Hiyori + expressions
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; │ ├── Subtitles.jsx ← furigana + karaoke effect
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; │ ├── ChatHistory.jsx
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; │ ├── Controls.jsx
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; │ ├── MicButton.jsx
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; │ ├── LoadingScreen.jsx
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; │ └── StatusBar.jsx
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; └── public/
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; └── live2d/
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; └── Hiyori/ ← Live2D model files
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Each microservice can be deployed on different hardware and accessed via IP and port. Individual services can also be tested with dedicated scripts to diagnose or validate their functionality.&lt;/p&gt;
&lt;h2 id="back-end"&gt;Back End&lt;/h2&gt;
&lt;p&gt;The back end is composed of four Python microservices, each with its own virtual environment managed by &lt;code&gt;uv&lt;/code&gt;, and a shared GPU utility module.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;sensei-ears&lt;/strong&gt; receives raw audio bytes from the client and transcribes them using &lt;em&gt;Faster-Whisper&lt;/em&gt;, a CTranslate2-optimized reimplementation of OpenAI&amp;rsquo;s Whisper by SYSTRAN. It runs the &lt;code&gt;large-v3-turbo&lt;/code&gt; variant in INT8 quantization on the GPU, with VAD (voice activity detection) filtering to ignore silence. It automatically detects the spoken language, supporting Japanese, Spanish, English, etc.&lt;/p&gt;
&lt;p style="color: #ff00ff;"&gt;
&lt;strong&gt;Cool Fact:&lt;/strong&gt; Faster-Whisper helps me improve my pronunciation indirectly; if I speak "carelessly", my audio is transcribed in &lt;em&gt;romaji (konnichiwa)&lt;/em&gt;, meaning the threshold for Japanese language detection wasn't passed. However, if I pay more attention to my pronunciation, my audio is transcribed in &lt;em&gt;hiragana/katakana/kanji (こんにちは)&lt;/em&gt;, meaning the STT detected Japanese language.
&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;sensei-brain&lt;/strong&gt; wraps a vLLM instance serving &lt;em&gt;Qwen3-Swallow-8B-AWQ-INT4&lt;/em&gt;, a model developed by the Okazaki and Yokota Laboratories at Institute of Science Tokyo and AIST, built on top of Alibaba&amp;rsquo;s Qwen3 base model and fine-tuned extensively on Japanese. vLLM&amp;rsquo;s prefix caching reaches ~93% hit rate within a session since the system prompt stays fixed, which significantly reduces latency on consecutive turns. The service extracts an &lt;em&gt;emotion tag&lt;/em&gt; from the model&amp;rsquo;s response before forwarding the clean text downstream. These are the emotion tags stated in the prompt (among other useful instructions for improved user experience), which will be useful for the avatar&amp;rsquo;s expressions:&lt;/p&gt;
&lt;p&gt;&lt;span style="color:#f6c86e"&gt;[EMOTION:happy]&lt;/span&gt;&lt;br&gt;
&lt;span style="color:#44da37"&gt;[EMOTION:encouraging]&lt;/span&gt;&lt;br&gt;
&lt;span style="color:#3eacac"&gt;[EMOTION:neutral]&lt;/span&gt;&lt;br&gt;
&lt;span style="color:#6957df"&gt;[EMOTION:sad]&lt;/span&gt;&lt;br&gt;
&lt;span style="color:#ff7bbb"&gt;[EMOTION:surprised]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;sensei-mouth&lt;/strong&gt; synthesizes speech using &lt;em&gt;Qwen3-TTS-12Hz-0.6B-CustomVoice&lt;/em&gt;, also by Alibaba, with the Ono Anna voice profile. It uses Flash Attention 2 and &lt;code&gt;torch.compile&lt;/code&gt; to reduce synthesis time, and performs a three-language warm-up at startup to pre-compile the model&amp;rsquo;s execution graph. However, it&amp;rsquo;s still the bottleneck of the system, with an RTF of ~1.5 (should be 1.0 or less for a better experience).&lt;/p&gt;
&lt;p&gt;The audio is returned as raw WAV bytes with synthesis time, audio duration, and RTF (real-time factor) in the response headers. &lt;strong&gt;IMPORTANT:&lt;/strong&gt; very short texts, symbols and characters induce audio hallucinations or speech artifacts (babbling, gibberish), so prompting and regex were applied to minimize that.&lt;/p&gt;
&lt;h2 id="front-end"&gt;Front End&lt;/h2&gt;
&lt;p&gt;The front end is a React + Vite single-page application. It communicates with sensei-core exclusively over WebSocket, sending audio as raw binary frames and receiving JSON control messages and WAV audio in return.&lt;/p&gt;
&lt;p&gt;The UI is organized around five components:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;LoadingScreen&lt;/strong&gt; shows the warm-up status of each service as they come online.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;
&lt;figure &gt;
&lt;div class="flex justify-center "&gt;
&lt;div class="w-full" &gt;
&lt;img alt="Loading Screen"
srcset="https://nicolasfbportfolio.netlify.app/projects/nikisensei-project/loadingscreen_hu_b2d964b6105394c.webp 320w, https://nicolasfbportfolio.netlify.app/projects/nikisensei-project/loadingscreen_hu_d793df4358e97530.webp 480w, https://nicolasfbportfolio.netlify.app/projects/nikisensei-project/loadingscreen_hu_ccada2dd82ee76a7.webp 633w"
sizes="(max-width: 480px) 100vw, (max-width: 768px) 90vw, (max-width: 1024px) 80vw, 760px"
src="https://nicolasfbportfolio.netlify.app/projects/nikisensei-project/loadingscreen_hu_b2d964b6105394c.webp"
width="633"
height="600"
loading="lazy" data-zoomable /&gt;&lt;/div&gt;
&lt;/div&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Avatar&lt;/strong&gt; renders Hiyori Momose, the Live2D model, using PixiJS as the WebGL renderer and pixi-live2d-display as the bridge to the Live2D Cubism SDK 5. Hiyori plays idle motions in a loop and switches between the already mentioned five expressions — happy, encouraging, neutral, sad, and surprised — driven by the emotion tag from the LLM response.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Subtitles&lt;/strong&gt; renders the furigana (the little hiragana characters above kanji) HTML with a karaoke-style progress bar synchronized to the audio duration.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;
&lt;figure &gt;
&lt;div class="flex justify-center "&gt;
&lt;div class="w-full" &gt;
&lt;img alt="Subtitles"
srcset="https://nicolasfbportfolio.netlify.app/projects/nikisensei-project/subtitles_hu_3acef7ab53164cfc.webp 320w, https://nicolasfbportfolio.netlify.app/projects/nikisensei-project/subtitles_hu_abd7f045e3a958d5.webp 480w, https://nicolasfbportfolio.netlify.app/projects/nikisensei-project/subtitles_hu_1d4c6ee517871c6b.webp 756w"
sizes="(max-width: 480px) 100vw, (max-width: 768px) 90vw, (max-width: 1024px) 80vw, 760px"
src="https://nicolasfbportfolio.netlify.app/projects/nikisensei-project/subtitles_hu_3acef7ab53164cfc.webp"
width="756"
height="760"
loading="lazy" data-zoomable /&gt;&lt;/div&gt;
&lt;/div&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;ChatHistory&lt;/strong&gt; maintains a scrollable transcript of the conversation.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;MicButton&lt;/strong&gt; is a push-to-talk button that captures audio with the MediaRecorder API and sends it as a single blob when released.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The audio pipeline uses the Web Audio API for playback, decoding the WAV buffer once to extract both the duration for subtitle sync and the audio data for playback.&lt;/p&gt;
&lt;h2 id="monitoring"&gt;Monitoring&lt;/h2&gt;
&lt;p&gt;The observability stack runs in Docker Compose and consists of Prometheus for metric collection and Grafana for visualization. Prometheus scrapes all four services — including the vLLM engine itself — every 15 seconds via &lt;code&gt;host.docker.internal&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;
&lt;figure &gt;
&lt;div class="flex justify-center "&gt;
&lt;div class="w-full" &gt;
&lt;img alt="Prometheus"
srcset="https://nicolasfbportfolio.netlify.app/projects/nikisensei-project/prometheus_hu_238ed840893fa810.webp 320w, https://nicolasfbportfolio.netlify.app/projects/nikisensei-project/prometheus_hu_facde40064ae7a10.webp 480w, https://nicolasfbportfolio.netlify.app/projects/nikisensei-project/prometheus_hu_a4fbd828e87a92e1.webp 760w"
sizes="(max-width: 480px) 100vw, (max-width: 768px) 90vw, (max-width: 1024px) 80vw, 760px"
src="https://nicolasfbportfolio.netlify.app/projects/nikisensei-project/prometheus_hu_238ed840893fa810.webp"
width="760"
height="573"
loading="lazy" data-zoomable /&gt;&lt;/div&gt;
&lt;/div&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p&gt;The Grafana dashboard has 18 panels organized in five rows covering completed and failed turns, error rate, pipeline and per-stage latencies (STT, LLM, TTS), CPU and GPU power consumption, VRAM usage broken down by CUDA process, LLM throughput in tokens per second, TTS real-time factor, and energy cost in COP per hour and per turn.&lt;/p&gt;
&lt;p&gt;
&lt;figure &gt;
&lt;div class="flex justify-center "&gt;
&lt;div class="w-full" &gt;
&lt;img alt="Grafana"
srcset="https://nicolasfbportfolio.netlify.app/projects/nikisensei-project/grafana_hu_e879678f31be3136.webp 320w, https://nicolasfbportfolio.netlify.app/projects/nikisensei-project/grafana_hu_491a0f7f039faf18.webp 480w, https://nicolasfbportfolio.netlify.app/projects/nikisensei-project/grafana_hu_71ba9e3b8d5c2534.webp 760w"
sizes="(max-width: 480px) 100vw, (max-width: 768px) 90vw, (max-width: 1024px) 80vw, 760px"
src="https://nicolasfbportfolio.netlify.app/projects/nikisensei-project/grafana_hu_e879678f31be3136.webp"
width="760"
height="725"
loading="lazy" data-zoomable /&gt;&lt;/div&gt;
&lt;/div&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p&gt;On the RTX 5070 Ti, Qwen3-Swallow-8B AWQ-INT4 sustains &amp;gt;100 tokens per second on my single session and shared VRAM.&lt;/p&gt;
&lt;h2 id="conclusion"&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Niki sensei demonstrates that a fully local, real-time conversational AI system with voice, language understanding, speech synthesis, and an animated avatar is achievable on consumer hardware — without cloud APIs, subscriptions, or data leaving your machine. The RTX 5070 Ti runs all three models simultaneously within its 16 GB VRAM budget, with roughly 1.3 GB to spare. Despite the obvious limitations of my consumer hardware, I&amp;rsquo;ve had a great time practicing my N5 Japanese with Niki!&lt;/p&gt;
&lt;h2 id="future-plans"&gt;Future Plans&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Anger expression&lt;/strong&gt; if the student is rude or curses, Niki gets annoyed. Requires a new &lt;code&gt;.exp3.json&lt;/code&gt; expression and a system prompt rule. Although&amp;hellip; cursing in a foreign language is fun, isn&amp;rsquo;t it.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;SQL logging&lt;/strong&gt; for memory and traceability.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Computer vision&lt;/strong&gt; by means of a sensei-eyes microservice, e.g. gesture detection, object detection&amp;hellip; there are many possibilities.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Bubble translation on hover&lt;/strong&gt; — mousing over a dialogue bubble shows a translation.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Making my own avatars&lt;/strong&gt; with VROID Studio (available on Steam).&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="resources"&gt;Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;If you have suggestions for future improvement, please reach out!&lt;/strong&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;&lt;strong&gt;Project Status&lt;/strong&gt;: ✅ Fully functional prototype. Watch the video below if you missed it!&lt;/p&gt;
&lt;video width="100%" controls poster="thumbnail.png"&gt;
&lt;source src="niki_demo_clean_audio.mp4" type="video/mp4"&gt;
&lt;/video&gt;</description></item></channel></rss>