• by TheJCDenton on 10/3/2023, 8:10:46 PM

    The demo [3] seems very promising.

    "Their method cleverly exploits the LLMs' tendency to use initial tokens as "attention sinks" to anchor the distribution of attention scores. By caching initial tokens alongside recent ones, StreamingLLM restored perplexity and achieved up to 22x faster decoding than prior techniques." [1]

    "We show that StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more." [2]

    "we discover that adding a placeholder token as a dedicated attention sink during pre-training can further improve streaming deployment." [2]

    "StreamingLLM achieves an impressive speedup, reaching up to 22.2× per token. Despite its reduced latency, StreamingLLM sustains a memory footprint consistent with the re-computation baseline." [2]

    [1] https://notes.aimodels.fyi/llm-infinite-context-window-strea...

    [2] https://arxiv.org/pdf/2309.17453.pdf

    [3] https://github.com/mit-han-lab/streaming-llm

  • by Palmik on 10/4/2023, 9:48:24 AM

    I see a lot of people misunderstanding what this is about. What this allows is incrementally updating the attention cache. It does not allow the model to see beyond its usual attention window. As they explain in a README, only the tokens that fit into the usual window are considered -- so if you ask a question about a long book, it will only consider the last pages.

    But it can still be useful. Imagine this use case, where you have a chat conversation between Assistant and User. Assume that the inputs to get the next assistant response are just the past conversation turns (cut off to fit context window).

    So for turn 1 the input is:

       User: (user turn 1)
    
    For turn 2 the input is:

       User: (user turn 1)
       Assistant: (assistant turn 1)
       User: (user turn 2)
    
    Etc.

    Now, what this allows you to do is reuse the attention computed from the previous turns (since the prefix is the same).

    In practice, people often have a system prompt before the conversation history, which (as far a I can tell) makes this technique not applicable (the input prefix will change as soon as the conversation history is long enough that we need to start dropping the oldest turns, otherwise the system prompt would get ignored).

    In such case, what you could do is to cache at least the system prompt. This is also possible with https://github.com/OpenNMT/CTranslate2/blob/2203ad5c8baf878a...

  • by kirill5pol on 10/4/2023, 4:47:30 AM

    In figure 3, it shows that Falcon and Pythia are much less susceptible to the lack of an attention sink than Llama or MPT… seems they work almost as well by just doing naïve windowed attention

  • by CapsAdmin on 10/4/2023, 9:57:06 AM

    As far as I understand from their FAQ on github it's more about reducing latency a for a more instant response.

    Would be interesting to see an application for this where you can have a more fluid conversation with the ability to interrupt each other mid sentence. I suppose this would require retraining or finetuning on transcribed natural vocal conversations between two people. It would probably also require a different structure than the current chat based methods.

  • by avereveard on 10/3/2023, 11:05:30 PM

    Yeah "by conveniently dropping token out of attention we can have infinite tokens" not exactly a breakthrough