DeepSeek R1: Inside the AI Model Disrupting the Industry - A Deep Dive into Training Costs, Privacy Concerns, and Alternatives

Explore DeepSeek R1's innovative AI technology, featuring efficient training costs and key privacy considerations.

Feb 02, 2025

Instead of our usual "What am I Reading" edition, this week focuses entirely on DeepSeek. You've likely encountered this AI tool in recent headlines—whether about:

Its impact on the stock market
Its rise to #1 in app stores
The privacy concerns it has raised

In this piece, I'll examine DeepSeek in detail, compare it with competitors, and evaluate whether it truly lives up to the hype.

DeepSeek R1

DeepSeek R1 has stormed the AI ranks. While DeepSeek's AI models have been around for some time—particularly known for their fast code autocompletion—their latest release is groundbreaking. The new DeepSeek R1 is a reasoning model that directly competes with OpenAI's o1, which costs $200 per month for the Pro version. Notably, the accompanying research paper suggests the model required significantly fewer GPU resources for training. Despite China's restricted access to advanced semiconductor technology from US and European companies—which led many to assume the US would maintain a competitive advantage—DeepSeek R1 was trained on:

2,048 Nvidia H800 GPUs

These chips are less advanced than those available to OpenAI or Anthropic

This achievement has raised some skepticism in the US; only time will tell if DeepSeek's claims hold up.

I reviewed two key papers to understand the fundamentals. The research paper DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning demonstrates the model's approach and performance compared to competitors, while the technical report provides extensive details about its training process, architecture, and infrastructure.

The technical report caused significant buzz by revealing that the model was trained with far fewer resources than major competitors.

Let’s dive into some other interesting details these papers reveal:

DeepSeek is a MoE (Mixture of Experts) language model. That means that each task does not activate the whole model, which in the full range would be 671 billion parameters, but instead relies on experts in this model. 37 billion parameters are used by each token, which makes the model very efficient
It incorporates load balancing, which the previous model, DeepSeek-V3, already incorporated. That basically means that the model figures out how to distribute the task among all expert models. This ensures that no expert is overloaded or idle, saving again in performance.
Efficient Training Framework: It incorporates splitting the model into stages and distributing experts across nodes to effectively use all computing resources. They mark this as the DualPipe algorithm, which reduces idle times on GPUs and optimizes to save memory, which saves costs.

Despite its large size, they managed to train the model using just:

2.788 million hours on H800 GPUs

While exact comparisons are difficult since many models are closed source, estimates suggest:

GPT-4 required 54 million GPU hours on A100 chips

This translates to a dramatic cost difference:

$5.58 million for DeepSeek
$70-$100 million for GPT-4

If these figures prove accurate, they would explain the significant tech company losses seen at the start of this week.

This was particularly significant for NVIDIA, as investors now believe fewer GPUs are needed to train massive models. While modern OpenAI and Gemini models are also cost-efficient, excessive hype around R1 and older GPUs led to oversimplified market assumptions and created price volatility.

Let's look at how inference works—when the model interprets your inputs. After training, the process involves two stages:

Prefilling
Decoding

Prefilling processes input prompts and context by converting them into numerical representations. Running the full model with 671 billion parameters requires 4 nodes with 32 GPUs.

The decoding stage focuses on generating the output text one token at a time. This process requires at least 40 nodes with 320 GPUs.

This requires significant investment, though these calculations are based on H800 chips—currently the only ones available in China. While newer chips might reduce these requirements, one thing remains clear: we still need substantial GPU resources. History shows that as we become more efficient and capable, we pursue even more ambitious goals—in this case, larger models and additional features. While NVIDIA's stock hasn't fully recovered, it's trending upward, and if GPU demand was indeed the main factor in the stock drop, it's likely to rebound as demand for NVIDIA GPUs remains strong.

What is so special about reasoning models

You might wonder why you need general models like Claude 3.5 or GPT when reasoning models already exist.

While reasoning models excel at complex tasks, they come with trade-offs:

They are more expensive and slower than general models
This slower speed can be problematic for quick iterations

In coding tasks, for example, you might encounter complex problems requiring reasoning, followed by refinements where a general model would be faster and easier to work with.

In my experience, the best approach is using both types of models together: Start with a reasoning model to understand and solve complex problems, then transition to a general model for iterations and refinements.

How to use DeepSeek R1?

The answer could be as simple as visiting their main page. However, I wouldn't recommend this approach. Their Privacy Policy raises several concerning issues.

Their policy states that data is retained "as long as necessary" and is shared with third parties within China.
The policy lacks clear data deletion timelines and does not provide users with a way to access their stored data, which violates GDPR requirements.
Since the servers are located in China, your data could potentially be accessed by the Chinese government.

While some of these privacy concerns also apply to U.S.-based service providers, those companies must comply with GDPR and other regulations. Italy has gone as far as banning deepseek.com, though this measure may prove ineffective since users can easily circumvent it using a VPN.

Another issue is that multiple users have reported that when asking about the 1989 Tiananmen Square protests and massacre, the model on deepseek.com refuses to provide any response.

Alternatives to use DeepSeek R1

Two recommended ways to use DeepSeek:

Option 1: Local Deployment with Ollama

You can install DeepSeek-r1 in a distilled version of your choice. Keep in mind you'll need a powerful PC. I run the 7 billion parameter model without issues on my MacBook with M1 Pro, but this smaller model isn't as capable as the full version—you'll notice the difference. So, either invest in hardware or consider the alternative route.

Option 2: Using Perplexity.ai

They provide an LLM-powered search experience and have become my go-to search engine in recent months. While I still use Google, I turn to perplexity.ai when I need quick, direct answers. Always verify the references and check for validity, but it's excellent for getting an overview and continuing your research. They've integrated DeepSeek-R1 into their Pro model for about $20 per month. This is much cheaper than buying all the hardware, and the model runs on U.S.-hosted servers with reasonable privacy conditions. While Perplexity still has room for privacy improvements, it's significantly better than the China-hosted alternative.

Alternatives to DeepSeek R1

OpenAI's Recent Release:

OpenAI recently released o3-mini, which outperforms o1 and surpasses R1 in certain areas.

Fundamental Differences:

While R1 uses a Mixture of Experts approach, o3 employs a traditional dense transformer architecture. In o3, every token uses the entire model rather than a subset of models like R1, delivering consistent performance across different tasks, though it doesn't scale as effectively.

Model Parameters:

o3: 200 billion parameters (full model always activated)
R1: 671 billion parameters in total

Performance and Pricing:

o3 excels with larger context windows and maximum output, allowing it to handle more data better than R1. However, the pricing differs:

Token Pricing:

o3: $1.10/million (input), $4.40/million (output)
R1: $0.55/million (input), $2.19/million (output)

Despite higher costs, o3's superior token generation and lower memory requirements make it more cost-effective for on-premise deployment.

Interestingly, according to the paper O3-MINI VS DEEPSEEK-R1: WHICH ONE IS SAFER?, R1 is also considered less safe than o3 when it comes to ensuring that LLM outputs remain free from harmful content. Testing showed that DeepSeek R1 produced unsafe responses in 11.98% of cases, while o3 did so in only 1.19% of cases.

Summary and Takeaway

DeepSeek R1 represents a significant advancement in AI technology, offering competitive performance to OpenAI's o1 at a fraction of the training cost. Key takeaways include:

The model's efficiency stems from its Mixture of Experts approach, using only 37B parameters per token out of 671B total parameters
Training costs were dramatically lower ($5.58M vs $70-100M for GPT-4) despite using less advanced GPU technology
Privacy concerns exist due to Chinese data policies and server locations, making alternatives like Perplexity.ai ($20/month) or self-hosted solutions more attractive
OpenAI's o3-mini provides a compelling alternative with better safety metrics (1.19% unsafe responses vs 11.98% for R1) and larger context windows, though at higher token costs

While DeepSeek R1 demonstrates that efficient AI training is possible with fewer resources, the trade-offs between performance, privacy, and safety suggest that users should carefully consider their specific needs when choosing between available options.

The news storm is already out, and DeepSeek has stormed to the top of app stores. As a company, you should be aware that your employees might use DeepSeek with internal information, and with what you now know about the privacy situation, you should not allow this. However, this means you should provide them with an alternative—if you don't, people will simply seek out tools that help them anyway. The cost of not providing such tools internally is that your company's internal information will end up spread across these websites.

Thanks for reading Sound of Development ! This post is public so feel free to share it.

Thank you for reading my newsletter!

Have any suggestions or want to connect? Feel free to message me on Bluesky or Threads.

Sound of Development