by homarp on 12/30/2024, 8:04:00 AM
>Evaluation: Results were assessed by Claude 3.5 Sonnet V2 for consistency
why not doing human assessment on top... to ensure the assessment by Claude is correct?
>conducted a detailed benchmark
i suggest you post a sample for other to try to reproduce
I recently conducted a detailed benchmark of various LLMs for AI code review, specifically focusing on Pull Request analysis. The results were quite surprising and contradict some recent marketing claims.
Test setup:
Results (in order of performance): Key findings: For transparency: I developed LlamaPReview (<https://jetxu-llm.github.io/LlamaPReview-site/>), a GitHub App for automated PR reviews, which I used its core code as the testing framework. The app is free and can help you reproduce PR review efforts.Questions for the community:
Would love to hear your experiences and thoughts, especially from those who've tested multiple models in production environments.