• by hatefulmoron on 5/4/2025, 7:34:56 AM

    I had assumed that the Y axis was corresponding to some measurement of the LLM's ability to actually work/mull over a task in a loop while making progress. In other words, I thought it meant something like "you can leave Sonnet 3.7 for a whole hour and it will meaningfully progress on a problem", but the reality is less impressive. Serves me right for not looking at the fine print.

  • by sandspar on 5/5/2025, 4:28:39 AM

    Gary Marcus could save himself lots of time. He just has to write a post called "Here's today's opinion." Because he's so predictable, he could just leave the body text blank. Everyone knows his conclusions anyways. This way he could save himself and his readers lots of time.

  • by aoeusnth1 on 5/5/2025, 2:41:00 AM

    This post is a very weak and incoherent criticism of a well formulated benchmark: task length bucket for which a model succeeds 50% of the time.

    Gary says: - This is just the task length that the models were able to solve in THIS dataset. What about other tasks?

    Yeah, obviously. The point is that models are improving on these tasks in a predicable fashion. If you care about software, you should care how good ai is at software.

    - Gary says: Task length is a bad metric. What about a bunch of other factors of difficulty which might not factor into task length?

    Task length is a pretty good proxy for difficulty, that's why people grade a bug in days. Of course many factors contribute to this estimate, but averaged over many tasks, time is a great metric for difficulty.

    Finally, Gary just ignores that despite his perspective that the metric makes no sense and is meaningless, it has extremely strong predictive value. This should give you pause - how can an arbitrary metric with no connection to the true difficulty of a task, with no real way of comparing its validity of measuring difficulty across tasks or across task-takers, result in such a retrospectively smooth curve, and so closely predict the recent data points from sonnet and o3? something IS going on there, which cannot fit into Gary's ~spin~ narrative that nothing ever happens.

  • by yorwba on 5/4/2025, 8:10:13 AM

    > you could probably put together one reasonable collection of word counting and question answering tasks with average human time of 30 seconds and another collection with an average human time of 20 minutes where GPT-4 would hit 50% accuracy on each.

    So do this and pick the one where humans do best. I doubt that doing so would show all progress to be illusory.

    But it would certainly be interesting to know what the easiest thing is that a human can do but current AIs struggle with.

  • by Nivge on 5/4/2025, 7:30:04 AM

    TL;DR - the benchmark depends on its specific dataset, and it isn't a perfect representation to evaluate AI progress. That doesn't mean it doesn't make sense, or doesn't have value.

  • by Sharlin on 5/4/2025, 8:19:24 AM

    > Unfortunately, literally none of the tweets we saw even considered the possibility that a problematic graph specific to software tasks might not generalize to literally all other aspects of cognition.

    How am I not surprised?

  • by dist-epoch on 5/4/2025, 7:59:39 AM

    > Abject failure on a task that many adults could solve in a minute

    Maybe author should check before pressing "Publish" if the info in the post is not already outdated.

    ChatGPT passed the image generation test mentioned: https://chatgpt.com/share/68171e2a-5334-8006-8d6e-dd693f2cec...