Did DeepSeek Use Google Gemini to Train Its Latest AI Model?

DeepSeek’s recent release of its updated R1-0528 AI reasoning model has sparked widespread curiosity and debate among AI enthusiasts and developers. Many are asking: Did DeepSeek use Google’s Gemini AI outputs to train this new model? How does this affect the competitive AI landscape? The updated R1-0528 model excels at solving complex math and coding problems, but DeepSeek has kept its training data sources under wraps. However, speculation is mounting that DeepSeek’s team leveraged Google Gemini’s powerful AI outputs as part of its training process. This raises important questions about AI data usage, model training ethics, and the impact on innovation.


Several AI researchers, including Melbourne-based developer Sam Paech, have pointed to strong linguistic similarities between DeepSeek’s R1-0528 and Google’s Gemini 2.5 Pro outputs. Paech noted that the choice of words and expressions in DeepSeek’s model “echoes” those favored by Gemini, suggesting a possible link. Similarly, the creator of the SpeechMap “free speech eval” observed that the reasoning “traces” or intermediate thoughts generated by DeepSeek’s model closely resemble those from Gemini. While these findings don’t conclusively prove DeepSeek used Gemini data, they fuel ongoing speculation about cross-model training.

This is not the first time DeepSeek has faced accusations regarding its data sources. Last December, observers noticed that DeepSeek’s V3 model sometimes identified itself as ChatGPT, hinting it might have trained on OpenAI’s chatbot logs. Earlier in 2025, OpenAI disclosed evidence pointing to DeepSeek’s use of “distillation,” a controversial training method that extracts knowledge from larger, more advanced models. Microsoft, a major OpenAI partner, reportedly detected large-scale data transfers from OpenAI developer accounts linked to DeepSeek toward the end of 2024. These developments highlight the tension between innovation and intellectual property in AI development.

Distillation itself is a common AI training technique. It allows smaller models to learn from larger, well-trained ones by using their outputs as training data. However, OpenAI’s terms of service explicitly forbid clients from using its model outputs to build rival AI systems. This adds a layer of complexity to evaluating whether DeepSeek’s approach crosses ethical or legal boundaries. Complicating matters, many AI models tend to use similar phrases and reasoning patterns because much training data comes from the open web — increasingly saturated with AI-generated content, clickbait, and bot activity on platforms like Reddit and X (formerly Twitter). This “contamination” blurs the lines between original data and synthesized AI outputs, making it harder to track training sources.

Experts like Nathan Lambert from the nonprofit AI research institute AI2 believe that if DeepSeek did use Google Gemini data, it wouldn’t be surprising. According to Lambert, DeepSeek has ample funding but limited GPU resources, so generating synthetic training data from a top-tier API like Gemini’s would be a cost-effective strategy to boost model performance. This approach leverages the computational strength of an existing model to train a competing AI without investing heavily in new hardware.

In response to these concerns, AI companies are tightening security around their models to prevent unauthorized distillation. OpenAI now requires rigorous ID verification to access certain advanced models, excluding countries such as China from API access. Google has started “summarizing” the reasoning traces generated by its Gemini models on the AI Studio platform, making it more difficult for rivals to replicate or learn from Gemini’s outputs. Similarly, Anthropic recently announced plans to summarize its own model traces to protect competitive advantages.

The DeepSeek-Gemini connection remains a hot topic in AI circles, raising crucial questions about data ethics, innovation, and competition in the fast-evolving AI space. As these developments unfold, clearer policies and stronger safeguards will be essential to balance open AI progress with intellectual property rights. We will continue monitoring this story and update as more information from Google and DeepSeek becomes available.

Post a Comment

Previous Post Next Post