Hacker News Clone

Ask HN: DeepSeek V3's AI Code Review Performance – A Reality Check with Data

by Jet_Xu on 12/30/2024, 7:05:43 AM with 1 comments

I recently conducted a detailed benchmark of various LLMs for AI code review, specifically focusing on Pull Request analysis. The results were quite surprising and contradict some recent marketing claims.

Test setup:

  Models tested: Mistral-Large-2411, Gemini 2.0 Flash (thinking-exp-1219), Mistral-Nemo-12B, DeepSeek V3, and "a ReAct AI Agent(also based on Mistral-Nemo-12B)" implementation
  
  Consistent testing environment: Same prompts, temperature settings, and max_tokens (8192 for DeepSeek)
  
  Test data: Real-world PRs from various open-source projects (all in English)
  
  Evaluation: Results were assessed by Claude 3.5 Sonnet V2 for consistency

Results (in order of performance):

  1. Gemini 2.0 Flash (thinking-exp-1219) - it has the best in depth code review result, but the output format could not fulfill requirement perfectly (compare with Mistral AI models)
  
  2. Mistral-Large-2411
  
  3. Mistral-Nemo-12B + ReAct AI Agent
  
  4. DeepSeek V3
  
  5. Mistral-Nemo-12B

Key findings:

  - Despite recent marketing claims, DeepSeek V3 only marginally outperformed a 12B model from July
  
  - The price-performance ratio is concerning, especially after their February 8th pricing changes
  
  - Larger parameter count (671B) didn't translate to better PR review quality

For transparency: I developed LlamaPReview (<https://jetxu-llm.github.io/LlamaPReview-site/>), a GitHub App for automated PR reviews, which I used its core code as the testing framework. The app is free and can help you reproduce PR review efforts.

Questions for the community:

  1. Has anyone else noticed similar performance gaps with DeepSeek V3?
  
  2. What metrics should we standardize for comparing LLM performance in specific tasks like code review?
  
  3. How much should marketing claims influence our technical evaluations?

Would love to hear your experiences and thoughts, especially from those who've tested multiple models in production environments.

by homarp on 12/30/2024, 8:04:00 AM
>Evaluation: Results were assessed by Claude 3.5 Sonnet V2 for consistency
why not doing human assessment on top... to ensure the assessment by Claude is correct?
>conducted a detailed benchmark
i suggest you post a sample for other to try to reproduce