by alecco on 11/21/2023, 1:41:40 PM
by davidkunz on 11/21/2023, 12:20:04 PM
For smaller models, I'm impressed by Mistral-7b or fine-tuned variants like Zephyr. I use it regularly in Neovim[1] for mundane tasks (grammar correction, summaries, ...). I'm curious how Orca 2 performs, downloading it right now.
[1]: with https://github.com/David-Kunz/gen.nvim
by kromem on 11/21/2023, 1:35:40 PM
A really important nuance here is that they are building on top of Llama-2, the pretrained model, and not Llama-2-chat.
I really think the entire field is doing a degree of damage with the chat fine tuning beyond what might be expected, because regularly part of that chat instruction is an emphasis on identification as a LLM.
The problem with this is that nearly all of the training data it's performing next token prediction on is text generated by humans.
So there's an inherent narrowing of the model scope with most of the fine tuning I've seen such that while pretrained models are harder to use, I regularly prefer them over chat models when both are available as even at similar temperatures the quality and variety of language is much improved in the pretrained over chat model.
This fine tuning was only introducing bias towards logical step by step analysis and problem solving techniques, and the results are great. But I'm willing to bet that an identical fine tuning on top of the chat model would have been much worse on the evaluations - not just the compounding of a typical fine tuning loss of a few percent, but more like a double digit relative difference.
It's quite frustrating that the anxiety over model safety is likely throwing out tens of millions of dollars worth of data in the pretrained model when only chat models are available for the SotA, and I hope in the future a lighter touch is taken on fine tuning the pretrained model and instead of focusing on safety inherent to the model it is just set behind a safety oriented discriminator or 'editor' which filters or modifies responses accordingly.
I'd happily take a 2-3x increased API cost for a much more broadly capable and performant model with similar safety characteristics but without the handicaps that come with it.
So while a lot of the gains here might be due to the fine tuning, I expect at least part is shrugging off the baggage of the chat/safety fine tuning as well. Even in the first detailed example, we can see that while Llama-2 goes off rambling later on, its statement of the relative knowledge of John vs Llama-2-chat is much more clear and connected between initial conditions and result particularly regarding theory of mind (i.e. "he assumed" vs the latter's "it must be in").
by intended on 11/21/2023, 12:56:24 PM
I really really want this to work.
However at this point - benchmark success is about as effective as results from someone who has been “taught the test”
If say… Merck wanted to use this same model to reason out a logistics issue, or apply it to some business problem at scale - you’d have to deal with hallucinations all over the place.
The best analogy I have right now is that improved results on benchmarks are like better acting from Hugh Laurie as House.
If you want to watch a show - great (generative work)
If you want to get a prescription - then not so much.
by fgfm on 11/21/2023, 10:19:42 AM
Orca 2-13B consistently beat Llama 2-70B on most benchmarks in 0-shot. Hopefully, research papers will start to include Mistral/Zephyr 7B & Openchat 3.5. Even though they're smaller, they're getting competitive against much larger models and they're much cheaper to orchestrate.
by ple13 on 11/21/2023, 10:59:55 AM
It fails other benchmarks vs Mistral-7b. https://twitter.com/Teknium1/status/1726846755344634020
(There is some doubts about the validity of the comparaison in the comments)
by btbuildem on 11/21/2023, 1:15:06 PM
Are we beginning to see "specialized SLMs"? We've already seen some pretend-agent based solutions (where the same model is given several different roles and made to act as eg. ceo / architect / dev / sales in a startup).
I wonder if the way forward is to train smaller models with different sets of "skills" or "neural affinities". One for reasoning, one for summarization, one for math, one for code, etc - then combining them into full-fledged solutions. Perhaps smaller models can be "better" at their specific domains/tasks than the giant generalist models can be at any of them.
by amelius on 11/21/2023, 3:59:12 PM
This is why imho Microsoft is way cooler than Apple. They have tons of published research. In Apple, even speaking about your research with a friend may result in severe punishment.
by yujian on 11/21/2023, 2:24:30 PM
I'm not sure if I'm missing something from the paper, but are multi-billion parameter models getting called "small" language models now? And when did this paradigm shift happen?
by Philpax on 11/21/2023, 1:33:25 PM
by iandanforth on 11/21/2023, 3:22:24 PM
Released under the MS Research License, so not OSI and non-commercial, for the curious.
https://huggingface.co/microsoft/Orca-2-13b/blob/main/LICENS...
by jug on 11/21/2023, 9:32:15 PM
This sounds quite exciting! Like Mistral all over again, only more transparent, open, and major backing probably as Microsoft are looking to significantly reduce costs now that they're expanding AI wide across their platforms? The approach truly feels like a next step in LLM design.
by Yuvrajs on 11/22/2023, 12:38:32 PM
Official Orca-2 demo is available on huggingface Spaces now - https://huggingface.co/spaces/ari9dam/Orca-2-13B
> Progressive Learning: We start with LLaMA-2-7B or LLaMA-2-13B checkpoint and finetune it on the train split of FLAN-v2 dataset for one epoch. Note that FLAN-v2 dataset contains both zero-shot and few-shot problems. We then train on 5 million ChatGPT data from Orca 1 for 3 epochs. Then we train on the combination of 1 million GPT-4 data from Orca 1 and Orca 2’s 817K data for 4 epochs.
I think people are missing why they are comparing against Llama-2 13B/70B. They improved Llama-2 7B/13B and reach the level of a 5-10x larger model of the same base.
This is huge. Models on HF.
https://huggingface.co/papers/2311.11045