Art Debono Hotel, Γουβιά, Κέρκυρα 49100

Επαγγελματική Σχολή με σύγχρονες μεθόδους διδασκαλίας

I.E.K. Κέρκυρας

26610 90030

iekker@mintour.gr

Art Debono Hotel

Γουβιά, Κέρκυρα 49100

08:30 - 15:30

Δευτέρα - Παρασκευή

I.E.K. Κέρκυρας

26610 90030

info@iek-kerkyras.edu.gr

Art Debono Hotel

Γουβιά, Κέρκυρα 49100

08:30 - 19:00

Δευτέρα - Παρασκευή

Overview

  • Founded Date April 17, 1928
  • Sectors Τουριστικά
  • Posted Jobs 0
  • Viewed 7

Company Description

DeepSeek R-1 Model Overview and how it Ranks against OpenAI’s O1

DeepSeek is a Chinese AI company “dedicated to making AGI a reality” and open-sourcing all its models. They started in 2023, however have been making waves over the past month or two, and specifically this past week with the release of their 2 latest thinking models: DeepSeek-R1-Zero and the advanced DeepSeek-R1, also known as DeepSeek Reasoner.

They’ve launched not only the designs however also the code and examination prompts for public usage, along with a detailed paper detailing their approach.

Aside from producing 2 extremely performant designs that are on par with OpenAI’s o1 model, the paper has a lot of valuable info around support learning, chain of idea reasoning, prompt engineering with thinking models, and more.

We’ll start by concentrating on the training process of DeepSeek-R1-Zero, which distinctively relied entirely on support learning, rather of standard supervised learning. We’ll then carry on to DeepSeek-R1, how it’s thinking works, and some prompt engineering best practices for reasoning designs.

Hey everyone, Dan here, co-founder of PromptHub. Today, we’re diving into DeepSeek’s latest model release and comparing it with designs, specifically the A1 and A1 Mini models. We’ll explore their training process, reasoning abilities, and some key insights into prompt engineering for reasoning models.

DeepSeek is a Chinese-based AI company committed to open-source advancement. Their recent release, the R1 reasoning design, is groundbreaking due to its open-source nature and innovative training techniques. This consists of open access to the models, triggers, and research documents.

Released on January 20th, DeepSeek’s R1 attained impressive efficiency on different criteria, matching OpenAI’s A1 designs. Notably, they likewise launched a precursor design, R10, which acts as the structure for R1.

Training Process: R10 to R1

R10: This design was trained exclusively using reinforcement learning without supervised fine-tuning, making it the first open-source design to accomplish high performance through this technique. Training included:

– Rewarding proper answers in deterministic jobs (e.g., math issues).
– Encouraging structured reasoning outputs utilizing design templates with “” and “” tags

Through countless models, R10 established longer reasoning chains, self-verification, and even reflective habits. For example, throughout training, the design demonstrated “aha” minutes and self-correction behaviors, which are uncommon in standard LLMs.

R1: Building on R10, R1 included several improvements:

– Curated datasets with long Chain of Thought examples.
– Incorporation of R10-generated reasoning chains.
– Human choice positioning for refined reactions.
– Distillation into smaller models (LLaMA 3.1 and 3.3 at different sizes).

Performance Benchmarks

DeepSeek’s R1 design performs on par with OpenAI’s A1 designs across numerous thinking benchmarks:

Reasoning and Math Tasks: R1 competitors or surpasses A1 designs in precision and depth of reasoning.
Coding Tasks: A1 designs usually perform better in LiveCode Bench and CodeForces jobs.
Simple QA: R1 often surpasses A1 in structured QA jobs (e.g., 47% accuracy vs. 30%).

One noteworthy finding is that longer thinking chains normally improve efficiency. This lines up with insights from Microsoft’s Med-Prompt framework and OpenAI’s observations on test-time compute and thinking depth.

Challenges and Observations

Despite its strengths, R1 has some constraints:

– Mixing English and Chinese actions due to a lack of supervised fine-tuning.
– Less polished responses compared to talk designs like OpenAI’s GPT.

These issues were resolved during R1’s refinement procedure, including monitored fine-tuning and human feedback.

Prompt Engineering Insights

A remarkable takeaway from DeepSeek’s research is how few-shot prompting degraded R1’s efficiency compared to zero-shot or succinct tailored prompts. This lines up with findings from the Med-Prompt paper and OpenAI’s recommendations to limit context in reasoning designs. Overcomplicating the input can overwhelm the model and reduce accuracy.

DeepSeek’s R1 is a considerable advance for open-source reasoning models, showing capabilities that match OpenAI’s A1. It’s an exciting time to explore these designs and their chat user interface, which is free to use.

If you have concerns or wish to discover more, inspect out the resources connected listed below. See you next time!

Training DeepSeek-R1-Zero: A reinforcement learning-only method

DeepSeek-R1-Zero stands apart from a lot of other cutting edge designs since it was trained using just support learning (RL), no monitored fine-tuning (SFT). This challenges the existing conventional method and opens up new opportunities to train reasoning models with less human intervention and effort.

DeepSeek-R1-Zero is the first open-source model to confirm that innovative thinking capabilities can be developed simply through RL.

Without pre-labeled datasets, the model learns through trial and mistake, refining its behavior, specifications, and weights based exclusively on feedback from the services it generates.

DeepSeek-R1-Zero is the base design for DeepSeek-R1.

The RL process for DeepSeek-R1-Zero

The training process for DeepSeek-R1-Zero included providing the design with different reasoning tasks, varying from mathematics problems to abstract reasoning difficulties. The design generated outputs and was examined based upon its efficiency.

DeepSeek-R1-Zero received feedback through a reward system that helped guide its knowing process:

Accuracy rewards: Evaluates whether the output is correct. Used for when there are deterministic results (mathematics issues).

Format benefits: Encouraged the design to structure its thinking within and tags.

Training prompt design template

To train DeepSeek-R1-Zero to create structured chain of idea sequences, the researchers used the following prompt training template, changing timely with the thinking question. You can access it in PromptHub here.

This template prompted the design to explicitly detail its thought procedure within tags before delivering the last answer in tags.

The power of RL in thinking

With this training procedure DeepSeek-R1-Zero started to produce sophisticated thinking chains.

Through thousands of training steps, DeepSeek-R1-Zero progressed to resolve increasingly complicated problems. It learned to:

– Generate long reasoning chains that enabled deeper and more structured analytical

– Perform self-verification to cross-check its own responses (more on this later).

– Correct its own mistakes, showcasing emerging self-reflective behaviors.

DeepSeek R1-Zero efficiency

While DeepSeek-R1-Zero is mainly a precursor to DeepSeek-R1, it still achieved high efficiency on numerous criteria. Let’s dive into a few of the experiments ran.

Accuracy enhancements during training

– Pass@1 accuracy started at 15.6% and by the end of the training it enhanced to 71.0%, equivalent to OpenAI’s o1-0912 model.

– The red strong line represents performance with bulk voting (comparable to ensembling and self-consistency methods), which increased precision further to 86.7%, surpassing o1-0912.

Next we’ll take a look at a table comparing DeepSeek-R1-Zero’s performance across numerous thinking datasets against OpenAI’s reasoning models.

AIME 2024: 71.0% Pass@1, slightly below o1-0912 but above o1-mini. 86.7% cons@64, beating both o1 and o1-mini.

MATH-500: Achieved 95.9%, beating both o1-0912 and o1-mini.

GPQA Diamond: Outperformed o1-mini with a rating of 73.3%.

– Performed much worse on coding tasks (CodeForces and LiveCode Bench).

Next we’ll take a look at how the action length increased throughout the RL training process.

This chart shows the length of reactions from the design as the training procedure advances. Each “step” represents one cycle of the design’s knowing process, where feedback is provided based on the output’s efficiency, assessed using the timely template gone over previously.

For each question (corresponding to one action), 16 responses were tested, and the typical precision was computed to ensure steady examination.

As training advances, the model creates longer reasoning chains, allowing it to resolve significantly complex thinking tasks by leveraging more test-time compute.

While longer chains don’t constantly guarantee better results, they typically associate with improved performance-a pattern likewise observed in the MEDPROMPT paper (learn more about it here) and in the initial o1 paper from OpenAI.

Aha minute and self-verification

Among the coolest aspects of DeepSeek-R1-Zero’s advancement (which also applies to the flagship R-1 design) is simply how good the design became at thinking. There were advanced reasoning behaviors that were not clearly programmed but arose through its reinforcement learning process.

Over thousands of training steps, the model started to self-correct, review problematic reasoning, and verify its own solutions-all within its chain of idea

An example of this noted in the paper, described as a the “Aha moment” is below in red text.

In this circumstances, the model actually said, “That’s an aha minute.” Through DeepSeek’s chat function (their version of ChatGPT) this kind of reasoning generally emerges with expressions like “Wait a minute” or “Wait, but … ,”

Limitations and obstacles in DeepSeek-R1-Zero

While DeepSeek-R1-Zero had the ability to perform at a high level, there were some disadvantages with the model.

Language mixing and coherence concerns: The model periodically produced responses that mixed languages (Chinese and English).

Reinforcement knowing compromises: The absence of supervised fine-tuning (SFT) indicated that the design lacked the improvement required for totally polished, human-aligned outputs.

DeepSeek-R1 was established to deal with these issues!

What is DeepSeek R1

DeepSeek-R1 is an open-source thinking model from the Chinese AI laboratory DeepSeek. It builds on DeepSeek-R1-Zero, which was trained entirely with support knowing. Unlike its predecessor, DeepSeek-R1 incorporates supervised fine-tuning, making it more fine-tuned. Notably, it outshines OpenAI’s o1 model on numerous benchmarks-more on that later on.

What are the main differences in between DeepSeek-R1 and DeepSeek-R1-Zero?

DeepSeek-R1 builds on the foundation of DeepSeek-R1-Zero, which serves as the base design. The two differ in their training methods and overall performance.

1. Training technique

DeepSeek-R1-Zero: Trained completely with support knowing (RL) and no supervised fine-tuning (SFT).

DeepSeek-R1: Uses a multi-stage training pipeline that consists of monitored fine-tuning (SFT) initially, followed by the exact same reinforcement discovering process that DeepSeek-R1-Zero wet through. SFT helps enhance coherence and readability.

2. Readability & Coherence

DeepSeek-R1-Zero: Battled with language blending (English and Chinese) and readability concerns. Its reasoning was strong, but its outputs were less polished.

DeepSeek-R1: Addressed these concerns with cold-start fine-tuning, making reactions clearer and more structured.

3. Performance

DeepSeek-R1-Zero: Still a very strong thinking design, often beating OpenAI’s o1, however fell the language mixing concerns minimized usability significantly.

DeepSeek-R1: Outperforms R1-Zero and OpenAI’s o1 on the majority of reasoning benchmarks, and the responses are far more polished.

In other words, DeepSeek-R1-Zero was an evidence of concept, while DeepSeek-R1 is the totally enhanced version.

How DeepSeek-R1 was trained

To take on the readability and coherence issues of R1-Zero, the scientists integrated a cold-start fine-tuning phase and a multi-stage training pipeline when developing DeepSeek-R1:

Cold-Start Fine-Tuning:

– Researchers prepared a top quality dataset of long chains of idea examples for preliminary monitored fine-tuning (SFT). This information was gathered utilizing:- Few-shot triggering with comprehensive CoT examples.

– Post-processed outputs from DeepSeek-R1-Zero, improved by human annotators.

Reinforcement Learning:

DeepSeek-R1 underwent the very same RL process as DeepSeek-R1-Zero to refine its thinking capabilities even more.

Human Preference Alignment:

– A secondary RL stage improved the design’s helpfulness and harmlessness, guaranteeing better positioning with user needs.

Distillation to Smaller Models:

– DeepSeek-R1’s reasoning capabilities were distilled into smaller sized, efficient models like Qwen and Llama-3.1 -8 B, and Llama-3.3 -70 B-Instruct.

DeepSeek R-1 standard performance

The researchers checked DeepSeek R-1 throughout a variety of criteria and versus leading models: o1, GPT-4o, and Claude 3.5 Sonnet, o1-mini.

The criteria were broken down into numerous classifications, shown listed below in the table: English, Code, Math, and Chinese.

Setup

The following specifications were applied across all models:

Maximum generation length: 32,768 tokens.

Sampling setup:- Temperature: 0.6.

– Top-p worth: 0.95.

– DeepSeek R1 surpassed o1, Claude 3.5 Sonnet and other models in the bulk of reasoning benchmarks.

o1 was the best-performing model in 4 out of the five coding-related criteria.

– DeepSeek performed well on innovative and long-context job job, like AlpacaEval 2.0 and ArenaHard, exceeding all other models.

Prompt Engineering with thinking designs

My preferred part of the post was the researchers’ observation about DeepSeek-R1’s sensitivity to triggers:

This is another datapoint that aligns with insights from our Prompt Engineering with Reasoning Models Guide, which recommendations Microsoft’s research on their MedPrompt framework. In their study with OpenAI’s o1-preview design, they discovered that overwhelming thinking models with few-shot context broken down performance-a sharp contrast to non-reasoning designs.

The crucial takeaway? Zero-shot prompting with clear and concise guidelines appear to be best when utilizing reasoning designs.