Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

llamafile has an HTTP server mode with an OpenAI API compatible completions endpoint. But Emacs Copilot doesn't use it. The issue with using the API server is it currently can't stream the output tokens as they're generated. That prevents you from pressing ctrl-g to interactively interrupt it, if it goes off the rails, or you don't like the output. It's much better to just be able to run it as a subcommand. Then all you have to do is pay a few grand for a better PC. No network sysadmin toil required. Seriously do this. Even with a $1500 three year old no-GPU HP desktop pro, WizardCoder 13b (or especially Phi-2) is surprisingly quick off the mark.


Hi, I haven't tried this myself, but it seems there's a way? https://github.com/ggerganov/llama.cpp/blob/master/examples/...

The call takes a "stream" boolean: stream: It allows receiving each predicted token in real-time instead of waiting for the completion to finish. To enable this, set to true.

And the response includes: stop: Boolean for use with stream to check whether the generation has stopped (Note: This is not related to stopping words array stop from input options)

Certainly the local web interface has a stop button, and I'm pretty sure that one did work.

But maybe I'm misunderstanding the challenge here?


You're right, llama-cpp-python OpenAI compatible endpoint works with `stream:true` and you can interrupt generation anytime by simply closing the connection.

I use this in a private fork of Chatbot-UI, and it just works.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: