Eliminating Information Leakage Risks: Anonymization Unique to Local LLMs

Reading the previous article on local LLMs, you might think that since they exist in the cloud, we can just use those. That's true, but not in workplaces handling personal information. In this article, I explain the effects of a more practical 3-stage anonymization using local LLMs based on actual test results.

The Limits of "Anonymization" Faced by Cloud AIs

In 2026, while the convenience of AI has penetrated every corner of society, protecting corporate confidential data and personal privacy has become an unprecedented challenge. Especially when using cloud AIs like ChatGPT or Claude, the risk of input data being used for training or remaining in server logs has been a major barrier to AI adoption in highly confidential sectors such as medical, welfare, and financial. Under these circumstances, anonymization utilizing local LLMs holds high promise. Performing "detoxification" of information in a local environment cut off from the internet, and sending it to the cloud AI only after it is safe. We explain the overview of this anonymization approach and its performance, which has reached practical levels.

Limits of Anonymization in Existing Systems

Many companies currently perform simple string replacement (such as regular expressions) before sending data to cloud AIs. While various non-AI anonymization approaches have been attempted, they remain at the expert system level. The critical flaw of traditional "mechanical replacement" is its inability to understand context, creating a high probability of leaking items that should be anonymized due to context or sentence structure. In actual system hacking incidents, this has led to major damage and loss of trust.

For example, take the sentence: "Mr. Sato lives near Honmoku Citizen's Park in Naka-ku, Yokohama City." (**\*This is sample data created for demonstration.**) Even if "Naka-ku, Yokohama City," which is part of the address, is removed, the information "Mr. Sato living near Honmoku Citizen's Park" remains. To local residents or acquaintances, this is sufficient information to identify the individual (quasi-identifier). It was extremely difficult for traditional programs to automatically eliminate such information, which is not PII on its own but leads to identification when combined.

A 3-Stage Anonymization as a Practical Solution

The anonymization process I developed this time is a hybrid model. To solve the issues, we adopted an architecture that links three different AIs and programs in a chain. This achieved highly precise anonymization without omissions while preserving semantic meaning.

Stage 1: NLP (Mechanical Replacement) First, we use morphological analysis engines like GiNZA and regular expressions to quickly extract and replace "structured personal information" such as names, phone numbers, exact addresses, and email addresses. This stage saves resources, consuming minimal main memory and computing resources.

Stage 2: LLM (Semantic Replacement and Abstraction) This is the core of our approach. We use a powerful 14B-class LLM running in a local environment (such as Shisa 14B). The LLM reads the context and makes advanced judgments, such as "leaving this park name will identify the home" or "this combination of disease name and age is too rare and leads to identification." Rather than simply deleting, it abstracts (generalizes) the text into forms like "a nearby park" or "a male in his 70s," preserving information value.

Stage 3: Audit Finally, a separate independent AI model (such as Nemotron 9B) checks the anonymized results from a third-party perspective. It strictly evaluates whether "identifiable information remains" and "if the sentence structure is unnaturally broken," permitting transmission to cloud AI or storage for training data only after passing (PASS).

Dramatic Anonymization Before and After

Let's look at an example of text that passed through this system. **\*The proper nouns, addresses, and situations below are fictional samples to demonstrate the system's capabilities.**

[Before Anonymization: Raw Data (Input)]

"Today at 14:00, received a call from Mr. Hiroshi Sato (78) living in Honmoku, Naka-ku, Yokohama. His wife, Sachiko, fell at home and hurt her right leg. He requested a compress to be brought to his home near Sankeien during tomorrow's regular visit. Tanaka in charge is scheduled to visit at 10:00."

[After Anonymization: 3-Stage Processed Data (Output)]

"Today at 14:00, received a call from [User A] (male in his 70s) living in [Resident Area]. The cohabitating spouse fell inside the residence and injured a lower limb. He requested necessary items to be brought to [User A]'s home during the next regular visit. Staff in charge is scheduled to visit in the morning."

What do you think? Rather than simply replacing "Sato" with "[Name]" as a placeholder, it understands the context to abstract it into "male in his 70s," replaces the specific hint "near Sankeien" with "[Resident Area]," and translates "compress" into "necessary items." This maximizes privacy strength while accurately communicating business requirements (who, when, and what they need).

Roles in the Anonymization Process

Stage	Method	Excels in	Risk Management
Stage 1	Regular expressions / Morphological analysis	Instant replacement of names and phone numbers	High risk of overlooking contextual info
Stage 2	Local LLM inference	Abstraction of quasi-identifiers and context	Extremely high protection performance
Stage 3	Audit by independent model	Judging residual risks, grammar check	Completely eliminates human errors

Conclusion: Building Trust with AI Locally

AI's evolution will not stop, but currently "peace of mind" on the receiving end is not keeping up. The direction of the anonymization model shown in this example is not just a technical trick, but an essential adjustment to make AI a "trusted partner." Particularly in welfare facilities and medical institutions where data leakage is strictly prohibited, the philosophy of completing "detoxification" locally should become the standard for future cloud AI operations. Leveraging the formidable intelligence of highly capable cloud LLMs while protecting them with a robust local shield. This "hybrid privacy" is surely the path forward for digital society from 2026 onward.

Sources:

Microsoft Presidio: PII Detection and Anonymization SDK

Shisa.AI: Local Japanese LLM for Privacy-Preserving Tasks

Radicalbit: 3-Stage Anonymization for Generative AI Pipelines