by KuriousCat on 12/15/2024, 8:02:47 AM with 2 comments
How do LLM servers handle contextual data? Is the context passed on a prefix to a stateless machine? (Would mean a lot of tokens have to reprocessed during a session) Or a separate LLM instance is created and maintained for an active session? (Expensive and inefficient)
the session is tied to a gpu cluster. It would actually be quite inefficient to switch gpu cluster to another one mid session, but its needed in a failure scenario
How do LLM servers handle contextual data? Is the context passed on a prefix to a stateless machine? (Would mean a lot of tokens have to reprocessed during a session) Or a separate LLM instance is created and maintained for an active session? (Expensive and inefficient)