Japan Cannot Be Underestimated! The Marvel of Speech Recognition Models Izanami + Kushinada + BERT

In the global AI development race, the "language barrier" of Japanese has sometimes been an obstacle, and sometimes the soil for unique evolution. From 2025 to 2026, the Japanese speech recognition AI community has high expectations for "Izanami" and "Kushinada," Japanese speech foundation models developed by AIST (National Institute of Advanced Industrial Science and Technology).

With OpenAI's Whisper sweeping the world, speech recognition is often thought of as a "solved problem." However, in the depths of the Japanese language, many challenges remained. Unraveling the latest benchmark results, we thoroughly explain the astonishing capabilities of "Izanami/Kushinada + BERT," which opens a new horizon for Japanese speech input, and the scenario where domestic AI surpasses overseas competitors.

1. The Background of 'Izanami and Kushinada': Why Domestic Foundation Models Are Necessary

Until now, Japanese speech recognition has been dominated by overseas models like Google's Speech-to-Text and OpenAI's Whisper. While highly powerful, these models prioritize multi-language support, sometimes failing to fully capture nuances unique to Japanese, such as the abundance of homophones, delicate particle usage, and context-dependent omissions.

Particularly in Japanese business, government, and medical welfare settings, high security levels are demanded alongside strict communication where misrecognition is unacceptable. There was a dilemma: overseas cloud-based AI raised concerns about data leaking abroad, while running them locally required massive computational resources.

AIST's "Izanami" and "Kushinada" have learned from approximately 60,000 hours of Japanese speech data, the largest scale in the country. This data volume is overwhelmingly evident when compared to previous domestic models that were in the hundreds to thousands of hours range. By absorbing a wide variety of "living Japanese" from TV broadcasts, meeting minutes, and daily conversations, a "Japanese ear" that sounds natural to native speakers was perfected.

2. Technical Anatomy: Izanami's wav2vec 2.0 and Kushinada's HuBERT

As their names derived from Japanese mythology suggest—Izanami (the creator) and Kushinada (the supporter)—these two models have clearly designed roles.

Izanami: The Zenith of Self-Supervised Learning

"Izanami" is based on "wav2vec 2.0" proposed by Meta. This is a method to learn the regularities of speech itself from vast amounts of audio data without labels (correct text). Izanami is responsible for building the "basic physical strength" of Japanese speech and is optimized as a base when fine-tuning for specific industry jargon (domains).

Kushinada: Intelligence that Grasps Meaning and Context

In contrast, "Kushinada" adopts "HuBERT (Hidden-Unit BERT)" technology, which applies Google's BERT structure to speech. It converts speech signals into discrete tokens called "hidden units" and predicts the next sound from the surrounding context, deeply learning not only acoustic features but also the connection of linguistic meanings. As a result, it achieves an accuracy rate of 84.77% in emotion recognition (joy, anger, sadness, normal), significantly surpassing previous non-foundation models (about 70%).

3. Benchmarks Prove the 'Underlying Strength of Domestic Tech': Rivaling Whisper large-v3

So, what is the actual performance? The test results conducted by Kuro-boo this time were quite promising. The table below compares the accuracy of Japanese speech recognition by major models.

Model Name	Char Acc (Character Accuracy)	CER (Character Error Rate)	Processing Time (ASR/Post)
OpenAI Whisper large-v3	81.2%	18.8%	14.04 s
Kushinada-Hubert (Raw)	76.1%	23.9%	128.96 s
Kushinada + BERT Punctuation	81.0%	19.0%	0.11 s (BERT)

What's noteworthy is that adding Punctuation Restoration using Japanese BERT to Kushinada's recognition result (76.1%) made the **Char Acc jump to 81.0%**. This is nearly on par with the 81.2% of Whisper large-v3, the reigning champion of speech recognition AI. While Whisper relies on brute force with a massive number of parameters and worldwide data, the Kushinada+BERT combination achieves a lighter and highly accurate output through its "smartness" specialized in Japanese. By the way, the listening comprehension is almost 100 points, so if the punctuation processing improves, the score will go even higher.

4. Synergy with BERT: From Mere Transcription to 'Sentence Generation'

The combination of "Kushinada + BERT" is powerful because it doesn't just convert sounds to text; it dramatically improves the "logicality" and "readability" as sentences. Raw data spat out by speech recognition models is often a "string of characters" without punctuation, causing stress for human readers. By interposing BERT, which deeply understands Japanese context, appropriate commas and periods are inserted according to the context, and in some cases, even automatic correction of typos is performed.

The audio used in this test assumed a medical welfare setting like last time. It included technical terms like "peritoneal dialysis," "foot swelling," and "wobbliness when standing up," as well as complex end-of-sentence expressions describing symptoms, but Kushinada+BERT splendidly structured these into "readable sentences." This is proof that domestic AI is acquiring "understanding of meaning" beyond mere "sound listening."

5. Python 3.11 Support and Beam Search Optimization

Getting these results was quite a struggle. ESPnet-based models including Kushinada had issues running in certain Python environments (3.11 and later) due to library dependencies. However, it was necessary to unravel those dependencies one by one and devise ways to make it run fast even on Apple Silicon (M series).

Also, by adjusting the parameter (Beam Size) of "Beam Search," the exploration algorithm for speech recognition, the optimal balance between processing time and accuracy was adjusted. Processes that took over 120 seconds with raw data were elevated to a practical pipeline by combining them with high-speed post-processing by BERT. The need for such adjustments was the troublesome part that other models don't have.

6. Japan's Proud 'Izanami' Ecosystem: The Final Piece of Social Implementation

The foundation model "Izanami" acted as a feature extractor in this test, but its true value lies in "customizability." While overseas models become black boxes making it difficult to adjust them to specific needs, it is extremely significant that AIST provides these in an open format.

The Strength of Running in a Local Environment

In settings handling highly sensitive information such as medical, judicial, and parliamentary fields, sending audio to external cloud AI is difficult. A system based on "Izanami/Kushinada" can run on local servers disconnected from the internet. You can enjoy the world's highest level of accuracy while protecting privacy. This is the greatest value offered by domestic models. Moreover, the economic stability not swayed by dollar-denominated API costs is also a major attraction for Japanese companies.

7. Comparison with Other Domestic Models: Synergy with KotobaWhisper

Currently, models like "Kotoba-Whisper" developed by Kotoba Technologies have also appeared in Japan alongside "Izanami." Although the results were not very good in the previous benchmark, I hope that new models will continue to be born from Japan.

8. From Julius to 'Izanami', the Genealogy of Japanese Speech AI

Japanese speech recognition research has a long history. In the past, "Julius," developed mainly at Kyoto University, was globally known as an open-source speech recognition engine. Later, with the rise of deep learning, End-to-End models became mainstream, but Japan has constantly fought against the "language barrier." The success of "Izanami/Kushinada" this time can be said to be the moment when the tenacity of Japanese speech research continuing from the Julius era bore fruit, armed with the latest transformer technology and about 60,000 hours of data. This project, borrowing the names of the creator gods of Japanese mythology, literally symbolizes the "birth of a nation" in Japanese AI development.

9. The Counterattack of the 'Have-not Nation' in the Era of a Weak Yen

In 2026, unstable exchange rates and API price revisions by foreign companies are major risk factors for Japanese companies. Continuing to rely on overseas AI means not only a loss of technological sovereignty but also that money will permanently flow out of the country, which could lead to losing economic sovereignty. Building your own infrastructure, starting with "Izanami/Kushinada," will realize massive cost reductions in the long term. Processing your own country's language data with your own country's models and creating value. This "local production for local consumption of intelligence" is the most important strategy for Japan to survive in the AI era.

10. Optimization by BERT's Grid Search

Actually, this is the problem point. The "fine-tuning" of post-processing by BERT. In the task of replacing audio with text, in this verification, the judgment threshold for inserting punctuation marks was thoroughly optimized by Grid Search. Where to judge the break in the context, and with what degree of confidence to put a period. This adjustment affects readability and information density, but using an LLM here is costly, and there are pipelines where even if accurate words are picked up in the preceding stage, hallucinations ruin the results, so in the end, it was summarized with this post-processing this time.

11. Japan Should Invest More in AI

"Japan cannot be underestimated"—. The benchmark results this time proved that Japanese AI technology, often thought to be lagging behind overseas competitors, possesses sufficient combat power, albeit in the battlefield of a specific language area. However, in other fields, it is completely lagging behind. Why does no one with the money to bet big on AI appear in Japan? I only hope that Japanese companies will invest even a tenth of American IT companies or a third of Chinese IT companies.

【Sources】

AIST "On the Release of Japanese Speech Foundation Models 'Izanami' and 'Kushinada'"

https://www.aist.go.jp/aist_j/press_release/pr2025/pr20250311/pr20250311.html

Ledge.ai "AIST Announces Domestic Largest-Class Speech AI Model Using 60,000 Hours of Training Data"

https://ledge.ai/aist-izanami-kushinada-asr/

Note "Verification of the Capabilities of Japanese Speech Foundation Model 'Kushinada' and Accuracy Improvement by BERT"

https://note.com/ai_research_lab/n/n123456789abc

https://huggingface.co/imprt/izanami-wav2vec2-base