The Future of Welfare Changed by "Voice": AI Model Comparison and Accuracy Requirements
AI voice recognition changes caregiving. Revealing benchmark results showing the high accuracy of OpenAI Whisper and the challenges of Kotoba-Whisper.
The evolution of artificial intelligence (AI) has dramatically improved the speed and accuracy of converting our "voices" into digital data. In particular, the emergence of speech recognition models led by OpenAI's "Whisper" is bringing a quiet but steady revolution to the welfare and nursing care fields, which have historically been burdened by heavy administrative tasks like "records, records, records" for everything.
Dramatic Evolution of AI Speech Recognition and Its Ripple Effects on Welfare
In the past, voice input might have had a strong impression of being difficult to use due to low recognition rates and the hassle of corrections. However, current state-of-the-art AI models can pick up words with accuracy equal to or greater than humans, even in environments with masks or background noise. This technological advancement is drawing attention as a trump card to dramatically reduce the administrative burden on staff in welfare fields facing severe labor shortages.
Capabilities of the Latest Whisper and Derivative Models by AI Leader OpenAI: Benchmark Comparison
Welfare and care environments demand more than just speed. Since they handle information concerning human lives, accuracy where errors are unacceptable is also crucial. We strictly verified the transcription accuracy of major AI models using audio files simulating actual care and medical settings. The results highlighted a surprising difference in capability between the models. Please note that the audio used was read aloud by Kurousagi, who is an around-sixties male, so results might differ for females, who make up a large portion of the welfare field. We hope you find this useful as a reference document.
Model Name | Model Size | Speed (176 chars) | Character Accuracy (Char Acc) | Rating & Evaluation |
OpenAI Whisper Large-v3 | 3.0GB | 14.04 s | 81.2% | 【Highest Accuracy】 Deep contextual understanding, maintaining high accuracy even for technical terms. |
OpenAI Whisper Large-v3-turbo | 1.5GB | 6.66 s | 80.1% | 【Practical】 Maintains high accuracy while processing in about half the time of large-v3. |
ReazonSpeech NeMo v2 | 2.4GB | 10.63 s | 79.0% | Outputs connected text, but character-level accuracy is close to Whisper. However, because it outputs continuously, humans need to insert punctuation. |
OpenAI Whisper medium | 1.5GB | 7.99 s | 67.4% | Large に比べると文字一致率が低下する。 |
kotoba-whisper-v2.2 | 1.5GB | 5.90 s 一番早い | 36.9% | 文章の後半が脱落し、内容が大幅に不足。実用には遠い。 |
Qwen3-ASR-1.7B | 4.5GB | 617.52 s | 80.7% | ともかく遅い。認識率は悪くないが、利用者が多くなると致命的か。 |
According to the benchmark results, OpenAI's Whisper large-v3 recorded the highest level of accuracy. On the other hand, the highly anticipated "kotoba-whisper-v2.2" had a significantly low character accuracy of 36.9%, with a critical issue where the transcription cut off mid-sentence. Lacking accuracy as an AI, it seems difficult to put it into practical use in welfare settings in its current state. *Note that the operating environment was a Macbook Pro M5 using MLX; results may differ in an NVIDIA CUDA environment.
Risks Posed by "Significant Misconversions"
This verification also highlighted issues that cannot be measured by numerical accuracy (like CER) used in benchmarks. For instance, even with high-performance models, we saw cases where "renal failure" was misidentified as "anemia," or "dialysis" was converted to "stone throwing." These are not simple spelling mistakes, but serious errors that could distort the content of care itself.
Therefore, when introducing the latest AI, simply selecting a model is not enough; advanced customization to correctly recognize specific terms, user names, and drug names is essential. The development of a "welfare-dedicated LoRA" designed to learn the voices of elderly women and the background noise of care settings is driven by the urgent voices of staff who do not underestimate the weight of a single character.
Time with Users Created by Productivity Improvements
If accurate AI voice input is achieved, its effects will be immense. At one facility, introducing a system that records voice on the spot while providing care successfully reduced administrative tasks by more than 40 hours per month per staff member. This "40 hours" is not just a number. The saved time is redirected from staring at screens to "genuine care" where staff can look at users' faces and listen to their words.
AI is by no means a replacement for humans. It is the most powerful tool to free humans from the chains of heavy "records" and restore warm communication that only humans can provide. As the accuracy of AI improves further, welfare sites will accelerate toward true "human-centered care."
【Sources】
- CareNews (Caregiving News Site): Case Studies of Productivity Improvements
- OpenAI: Whisper Model Research Report
- Ministry of Health, Labour and Welfare: Promoting Productivity Improvement in Nursing Care