LLM with Largest Context Window?

Best LLM with large context window?

Qs? I've been using GPT4-32k for a project, and it is extremely expensive and unaffordable? I'm looking for an open source alternative for this use case:

I'm using GPT4-32k as a "designer" with an input of the API/Guide docs to an entire custom built code library (essentially trying to teach GPT4 my custom javascript lib) and the output should be a technical design and psudo-code instructions on how to code the users request using the lib components.

The "designers" output and technical instruction is input to a "coder" LLM, using GPT4 (normal context window) for the actual coding task.

Unfortunately I don't think I can use RAG for this (Retrieval-Augmented Generation which incorporate a vector database and/or feature store with their LLM to provide context to prompts, also known as RAG LLMs) because the user input may have no close words to the actual API docs it should be looking up.

The model itself would need to decide which components to describe to the coder.

tldr (too long; didn't read:) Is there a best "extraction" Open source LLM I can use today that has a large context window (the libs docs are hitting about 18k tokens rn) ? For this task, it would need to be able to comprehend all of the information in the prompt and extract only the info needed to create a good design.

Qs? Has anyone tried CodeLlama's large context window? how are the hallucinations?

Ans: CodeLlama has a base context of 16k but goes up to 100k. I tested it on my M1 Ultra, sending it 53,000 tokens of Javascript code and asking it to summarize. It took about 10 minutes but it actually summarized the code FANTASTICALLY with no hallucinations that I could see. I was extremely impressed. I was actually happy with that time to respond. I'm not sure I would have read 5,000 lines of javascript and summarized it much faster

Qs? Which size of CodeLlama did you use for this?

Ans: I have been testing Phind v2 producing some parts of application based on existing code with 48gb vram got to 30k context with less than 1 token/s i think its good enough to be worthwhile to setup in cloud, but I dont think it will be cheaper than gpt4. (Note: Phind is a Code Generator).

Qs? Do these open source LLMs work with cloud vms with multiple GPUs that add up to more than 48gb vram? or only single GPU?

Qs? how did you get access to GPT4-32k?

Ans. You pay for it on your OpenAi account

I've tried the CodeLlama 34b to about 20k tokens. It remained coherent although prompt ingestion took a few minutes and token generation speed dropped about 5 times. I haven't noticed hallucinations getting any worse.

I'm not sure if I understood the rational behind the two stage LLM pipeline. You'd be probably better off with using gpt4 to prepare a CoT fine-tuning dataset based on your lib docs and then use it to fine-tune CodeLlama.

CoT fine-tuning dataset based on your lib docs and then use it to fine-tune CodeLlama.

So just use one LLM to do everything? I agree, I think the two stage pipeline idea came from me trying to think of a way to save on tokens outputted by GPT4-32k, but the coder would need all the context the first LLM had on the documentation/usage examples, not much improvement.

I've been doing some research and from what I see, fine tuning doesn't work to introduce any new knowledge, only to change the style of the output, it might be because I'm not sure exactly how chain of thought prompting would work for code docs?

Qs. Which size of codellama did you use for this?

Ans. If I remember correctly, I used CodeLlama-34b-Instruct-q5_K_M GGUF for this. My Mac Studio has 128GB of RAM with 98GB of that allocated to VRAM workingspace. The 5_K_M took about 27-30GB of it, leaving the other 68GB to hold context. My eval speed was not terrible so I don't think I spilled into system RAM at all, so I believe the whole 53k tokens fit into that 68GB of VRAM.

Sorry what do you mean 98gb of ram allocated to vram? How does that work what does it do please elaborate

Last updated