They’re Training on Your Secret Sauce. You Just Don’t Know It Yet.

Your employees are feeding proprietary data into AI tools. Your coding assistant is introducing vulnerabilities. And the agent you trusted just deleted your production database.

Mar 09, 2026

In early 2023, Samsung allowed its semiconductor engineers to use ChatGPT to help with their work. Within twenty days, three separate incidents occurred.

One engineer pasted source code from a semiconductor database to debug an error. Another submitted proprietary chip-testing code for optimization. A third uploaded an entire internal meeting transcript to generate minutes.

All three inputs became part of OpenAI’s training data. Samsung couldn’t retrieve any of it.

The information was retained, processed, and absorbed into the model’s weights. Samsung issued a company-wide ban on ChatGPT. JPMorgan, Amazon, Verizon, and Walmart followed.

But here’s the detail that doesn’t get enough attention: Samsung eventually lifted the ban. By 2025, the company relaxed its restrictions because the productivity gains were too valuable to walk away from.

That tension is the one I keep thinking about. The competitive advantage AI offers versus the competitive exposure it creates. Most organizations are living inside that tension right now. Most of them haven’t named it yet.

The Slow Leak Nobody Is Watching

Samsung made headlines because it was public. The same dynamic is playing out quietly across thousands of companies every day.

LayerX Security’s 2025 Enterprise AI Report found that 77% of employees have pasted company information into AI tools. More than half of those paste events included corporate data. And 82% of those workers used personal accounts rather than enterprise-managed tools, which means the data bypassed every security control their company had in place.

Cyberhaven’s research tells a similar story. 34.8% of all corporate data going into AI tools is now classified as sensitive: source code, R&D materials, financial projections. That’s up from 10.7% two years earlier. The rate isn’t creeping upward. It has tripled.

I manage technology infrastructure for 45,000 scholars across a global academic community. I think about data flows constantly: what crosses boundaries, what gets retained, what becomes visible to the wrong audience at the wrong time. When I look at these numbers, I don’t see an abstract risk. I see an organization that has already lost something and doesn’t know it yet.

Only 17% of organizations have automated controls to block or scan uploads to public AI tools. The other 83% rely on training sessions, email warnings, or nothing at all. And once data enters a public AI system, it cannot be retrieved. Every unmonitored employee prompt is a potential compliance failure under GDPR, HIPAA, or SOX.

Most companies can’t even answer a basic question: which AI tools currently hold our proprietary data? That’s not a gap in security posture. It’s a gap in awareness.

When the Tool Breaks What It Touches

Data leakage is the slow-moving risk. There’s also a fast-moving one.

In July 2025, Jason Lemkin, founder of the SaaS community SaaStr, was testing Replit’s AI coding assistant. On the ninth day of his project, during an active code freeze with explicit instructions that no changes should be made without permission, the AI agent deleted his entire production database. Records for over 1,200 executives and nearly 1,200 companies. Gone.

When Lemkin confronted the agent, it admitted to running unauthorized commands. Then it told him rollback was impossible and the data was gone forever.

That turned out to be wrong. Lemkin recovered the data manually, going against the agent’s own advice.

But the agent didn’t just destroy data. It fabricated it. It generated over 4,000 fake user records with completely made-up information to fill the void. When asked to score its own behavior on a 100-point severity scale, it gave itself a 95. Replit’s CEO called the behavior “unacceptable.”

This isn’t an isolated case. Days later, Google’s Gemini CLI agent deleted a user’s files after misinterpreting a command. In August 2024, researchers showed that Slack’s AI could be tricked into summarizing sensitive private-channel conversations and sending those summaries to an external address. The AI thought it was being helpful. It was functioning as an insider threat.

None of these agents were hacked. They were doing exactly what they were designed to do: execute commands on their own. That’s the same pattern I’ve been writing about since the first article in this series: delegation without oversight, at a speed no human can supervise.

The Vulnerabilities You Can’t See

There’s a third problem. It’s quieter than the other two, but at scale, it may be the most dangerous.

AI coding assistants are introducing security vulnerabilities into proprietary codebases faster than human developers can find them. Apiiro, an application security platform, analyzed tens of thousands of repositories across Fortune 50 companies. By June 2025, AI-generated code was producing over 10,000 new security findings per month. That’s a tenfold increase in just six months.

These aren’t formatting errors. Privilege escalation paths increased 322%. Architectural design flaws spiked 153%. AI-generated code was 2.74 times more likely to contain cross-site scripting vulnerabilities. Developers using AI assistance exposed cloud credentials at nearly double the rate of those working without it.

The reason is straightforward: the models were trained on vast repositories of open-source code, and much of that code contains the same vulnerabilities they now reproduce. The model doesn’t understand your security architecture. It optimizes for finishing the task, not for protecting the system. Ask it to query a database and it might hand you a textbook SQL injection flaw, because that pattern appeared thousands of times in its training data.

The Architecture Is the Problem

Here’s what I keep coming back to.

Data absorption. Autonomous destruction. Vulnerability injection. These look like three separate problems, but they’re all symptoms of the same architectural choice: feeding your proprietary knowledge, your production access, and your code into systems you don’t own and can’t inspect.

Think of it this way. If you rented office space and the landlord could read every document on your desk, copy your filing cabinets, and hand the contents to your competitor across the hall, you’d move. That’s roughly the arrangement most organizations have with cloud AI. Except the lease is called “terms of service,” the filing cabinets are training data, and most tenants haven’t read the fine print.

The cloud sold us convenience in exchange for control. For most workloads, that trade was reasonable. But AI workloads are different. AI learns. That’s the entire value proposition. When the system learning from your work is hosted on someone else’s infrastructure, the cost goes beyond a subscription fee.

In my last article, I wrote about entire nations discovering this cost. The chief prosecutor of the International Criminal Court lost his email because Microsoft complied with U.S. sanctions. Amsterdam Trade Bank lost its cloud because of a court order from another continent. That was about collaboration tools. This is about something more intimate: the proprietary knowledge that makes your organization competitive. When that knowledge trains a model you don’t control, you lose more than access to infrastructure. You lose control of what the infrastructure learned by watching you work.

The Market Is Doing the Math

The repatriation wave isn’t a prediction. It’s happening. A February 2026 survey found that 93% of enterprises have already moved AI workloads off public cloud, are in the process, or are actively evaluating it. 91% said they would choose on-premises or hybrid infrastructure over public cloud for AI involving sensitive data.

Gartner calls the trend “geopatriation” and named it a top strategic technology trend for 2026. They project that 75% of European and Middle Eastern enterprises will move to sovereign environments by 2030, up from 5% in 2025. Organizations that have already made the shift are documenting cost savings of 30 to 60 percent.

37signals, the company behind Basecamp, is the clearest case study. They were spending $3.2 million a year on AWS. They bought $700,000 in Dell servers and moved on-premise. Savings: nearly $2 million a year. Projected total over five years: more than $10 million. Their CTO said the industry convinced everyone that owning hardware is impossible. It isn’t.

And in a move that would be funny if the stakes weren’t serious: Microsoft launched “Sovereign Cloud” in February 2026, letting organizations run AI models on their own hardware, fully disconnected from Microsoft’s central cloud. The company that locked the ICC prosecutor out of his email is now selling the fix.

Your Tier 1, at Organizational Scale

Throughout this series, I’ve argued for a tiered model of AI accountability. Tier 1 is local. Your agent runs on your hardware, stays in your space, and doesn’t need anyone else’s permission to operate.

On-premise AI is Tier 1 thinking applied at organizational scale. When you run models on your own hardware, the principle holds: what stays local stays yours. Your pricing logic doesn’t become training data for a competitor. Your customer patterns don’t get aggregated into a model someone else can query. Your production database isn’t at the mercy of an agent whose guardrails were set by another company’s product team.

The same principle works at every level. For an individual, it’s a Raspberry Pi running a local agent that doesn’t phone home. For an organization, it’s AI on hardware you own. For a nation, it’s France building sovereign collaboration tools and Germany migrating 30,000 government workstations off Microsoft. The thread connecting all three: when software learns from what you feed it, where it runs determines who benefits from the learning.

What I’m Still Working Through

I want to be honest about the limits of this argument.

On-premise AI isn’t free. The hardware costs are real. Maintaining your own infrastructure takes expertise that most mid-sized organizations don’t have on staff. 37signals runs a ten-person operations team with decades of experience. That’s not a typical bench.

The cloud remains excellent for prototyping, elastic workloads, and global distribution. I’m not arguing that every organization should move everything tomorrow. I’m arguing that where you run your production AI is not a neutral technical decision. It’s a decision about who controls your data, who learns from your operations, and who holds the keys when things go wrong.

I also worry about the equity gap. Data sovereignty could easily become another advantage that accrues to organizations with deep pockets while smaller players stay locked into the rental model. I don’t have a clean answer for that yet.

But the current default isn’t neutral. It was designed by companies whose revenue depends on you renting their infrastructure and feeding their models. Whether that default serves your interests is a question worth asking.

The cloud made us tenants. AI made the landlord observant. Are you comfortable with what they’re learning from watching you work?

This is the sixth in a series about AI accountability. In the next piece, I’ll look at observability as a human right: why you deserve to see what your AI did and why, and what it means when the systems shaping your decisions can’t show their work.

If you’re thinking about these questions too, I hope you’ll subscribe.

Rachel Ankerholz is an IT Director and writer exploring the intersection of AI ethics, accessibility, and human-centered technology. She writes about who gets included, and who gets left behind, when we build systems.

Rachel Ankerholz

Discussion about this post

Ready for more?