Eliminating Information Leakage Risks: Anonymization Unique to Local LLMs
Eliminate the risk of sending PII to the cloud. We explain practical methods for utilizing AI while protecting privacy via a local LLM '3-stage anonymization pipeline'.
Reading the previous article on local LLMs, you might think that since they exist in the cloud, we can just use those. That's true, but not in workplaces handling personal information. In this article, I explain the effects of a more practical 3-stage anonymization using local LLMs based on actual test results.
The Limits of "Anonymization" Faced by Cloud AIs
In 2026, while the convenience of AI has penetrated every corner of society, protecting corporate confidential data and personal privacy has become an unprecedented challenge. Especially when using cloud AIs like ChatGPT or Claude, the risk of input data being used for training or remaining in server logs has been a major barrier to AI adoption in highly confidential sectors such as medical, welfare, and financial. Under these circumstances, anonymization utilizing local LLMs holds high promise. Performing "detoxification" of information in a local environment cut off from the internet, and sending it to the cloud AI only after it is safe. We explain the overview of this anonymization approach and its performance, which has reached practical levels.
Limits of Anonymization in Existing Systems
Many companies currently perform simple string replacement (such as regular expressions) before sending data to cloud AIs. While various non-AI anonymization approaches have been attempted, they remain at the expert system level. The critical flaw of traditional "mechanical replacement" is its inability to understand context, creating a high probability of leaking items that should be anonymized due to context or sentence structure. In actual system hacking incidents, this has led to major damage and loss of trust.
For example, take the sentence: "Mr. Sato lives near Honmoku Citizen's Park in Naka-ku, Yokohama City." (**\*This is sample data created for demonstration.**) Even if "Naka-ku, Yokohama City," which is part of the address, is removed, the information "Mr. Sato living near Honmoku Citizen's Park" remains. To local residents or acquaintances, this is sufficient information to identify the individual (quasi-identifier). It was extremely difficult for traditional programs to automatically eliminate such information, which is not PII on its own but leads to identification when combined.
A 3-Stage Anonymization as a Practical Solution
The anonymization process I developed this time is a hybrid model. To solve the issues, we adopted an architecture that links three different AIs and programs in a chain. This achieved highly precise anonymization without omissions while preserving semantic meaning.
Stage 1: NLP (Mechanical Replacement) First, we use morphological analysis engines like GiNZA and regular expressions to quickly extract and replace "structured personal information" such as names, phone numbers, exact addresses, and email addresses. This stage saves resources, consuming minimal main memory and computing resources.
Stage 2: LLM (Semantic Replacement and Abstraction) This is the core of our approach. We use a powerful 14B-class LLM running in a local environment (such as Shisa 14B). The LLM reads the context and makes advanced judgments, such as "leaving this park name will identify the home" or "this combination of disease name and age is too rare and leads to identification." Rather than simply deleting, it abstracts (generalizes) the text into forms like "a nearby park" or "a male in his 70s," preserving information value.
Stage 3: Audit Finally, a separate independent AI model (such as Nemotron 9B) checks the anonymized results from a third-party perspective. It strictly evaluates whether "identifiable information remains" and "if the sentence structure is unnaturally broken," permitting transmission to cloud AI or storage for training data only after passing (PASS).
Dramatic Anonymization Before and After
Let's look at an example of text that passed through this system. **\*The proper nouns, addresses, and situations below are fictional samples to demonstrate the system's capabilities.**
[Before Anonymization: Raw Data (Input)]
"Today at 14:00, received a call from Mr. Hiroshi Sato (78) living in Honmoku, Naka-ku, Yokohama. His wife, Sachiko, fell at home and hurt her right leg. He requested a compress to be brought to his home near Sankeien during tomorrow's regular visit. Tanaka in charge is scheduled to visit at 10:00."
[After Anonymization: 3-Stage Processed Data (Output)]
"Today at 14:00, received a call from [User A] (male in his 70s) living in [Resident Area]. The cohabitating spouse fell inside the residence and injured a lower limb. He requested necessary items to be brought to [User A]'s home during the next regular visit. Staff in charge is scheduled to visit in the morning."
What do you think? Rather than simply replacing "Sato" with "[Name]" as a placeholder, it understands the context to abstract it into "male in his 70s," replaces the specific hint "near Sankeien" with "[Resident Area]," and translates "compress" into "necessary items." This maximizes privacy strength while accurately communicating business requirements (who, when, and what they need).
Roles in the Anonymization Process
Stage | Method | Excels in | Risk Management |
Stage 1 | Regular expressions / Morphological analysis | Instant replacement of names and phone numbers | High risk of overlooking contextual info |
Stage 2 | Local LLM inference | Abstraction of quasi-identifiers and context | Extremely high protection performance |
Stage 3 | Audit by independent model | Judging residual risks, grammar check | Completely eliminates human errors |
Conclusion: Building Trust with AI Locally
AI's evolution will not stop, but currently "peace of mind" on the receiving end is not keeping up. The direction of the anonymization model shown in this example is not just a technical trick, but an essential adjustment to make AI a "trusted partner." Particularly in welfare facilities and medical institutions where data leakage is strictly prohibited, the philosophy of completing "detoxification" locally should become the standard for future cloud AI operations. Leveraging the formidable intelligence of highly capable cloud LLMs while protecting them with a robust local shield. This "hybrid privacy" is surely the path forward for digital society from 2026 onward.
Sources:
Microsoft Presidio: PII Detection and Anonymization SDK
Shisa.AI: Local Japanese LLM for Privacy-Preserving Tasks
Radicalbit: 3-Stage Anonymization for Generative AI Pipelines