Artificial Intelligence (AI) has exploded since the public release of ChatGPT, bringing the power of AI to the masses at a low cost has changed the way that many of us work and interact with the world. However, there are concerns around the privacy and security of using online, privately owned large language models (LLMs). These concerns include data privacy and security risks, the potential for LLMs to memorize sensitive information, ethical considerations in data usage, regulatory compliance challenges, and the need for informed consent and transparency.
The alternative to using online services like OpenAI’s ChatGPT and Google’s Gemini is to run a large language model locally, but for the average person, this is a non-starter as there is a considerable technical barrier to entry.
Well, not anymore.
Today we’ll explore how to easily set up a highly capable, locally running, “talk to anything with any LLM” application with Retreival Augmented Generation (RAG) capability using just your laptop or desktop. By leveraging two powerful tools, LMStudio and AnythingLLM, you can create a comprehensive LLM experience without any coding required.
Disclaimer: I am not affiliated with either LMStudio or AnythingLLM, and the information provided in this blog post is for informational purposes only. Please assess whether this guide suits your needs before proceeding.
LMStudio
LMStudio is a user-friendly desktop application that allows users to experiment with local and open-source large language models (LLMs) and allows individuals to leverage the power of LLMs on their own machines without relying on cloud-based services.
One of the key advantages of LMStudio is its simplicity, making it accessible to users with varying levels of technical expertise, its a one-stop-shop for downloading, configuring, and running a wide range of ggml-compatible models from Hugging Face.
Because your data remains local and under your control, LMStudio is an ideal choice for users who value data privacy and want to build and experiment with AI models without exposing sensitive information to third-party services.
To install LMStudio, simply head over to lmstudio.ai and download the right version for your operating system and hardware. If your system has an RX6000 or newer AMD GPU and you want to utilise GPU acceleration, at this time, you will need to download the technology preview version of lmstudio that supports AMD ROCm.
LMStudio doesn’t come pre-packaged with a language model, but once LMStudio is installed, the home screen presents you with an interface to download a language model from Hugging Face. You can choose one of the well-known models such as Llama 2 or the new Llama 3 model, Mistral, Google Gemma, Microsoft Phi2, Alibaba’s Qwen and much more. Which model you download depends on what you want to achieve, but for basic chat, any of the popular chat models that are able to run on your local hardware will suit your needs.
Once you have a model downloaded, click on the chat icon in the menubar on the left and load the model. In this example, I’m running the Meta Llama 3 Instruct model with 7 billion parameters and 8bit quantization – I’ll explain all of this later.
To load the model, simply click the big blue Select a model to load button at the top of the window, and you’re good to go. Performance will depend on your hardware, but it’s exciting having a local chat at your fingertips all the same.
But what if you want to chat with documents? That’s where AnythingLLM comes in.
AnythingLLM
AnythingLLM is a desktop application that allows you to chat with anything. It can connect to various LLMs such as ChatGPT, Anthropic, Mistral and most importantly for our local setup, various local LLM platforms such as LMStudio.
When you first run AnythingLLM, you will be asked to select your LLM, your embeddings and your vector database. To keep things incredibly simple, all you need to do here is select LMStudio as your LLM, and leave the embeddings and vector database using the defaults.
When selecting LMStudio as your LLM, you will be prompted to input the URL and the token context window for LMStudio.
You can find this information in LMStudio, firstly, click on the folder icon in the menubar to browse your downloaded models and open the model inspector, this will show information about the model you want to use. In this case, our Llama 3 model has a context window of 8192.
Next, to find the server URL, start the server within LMStudio
Once the server loads, it will display the URL in the server log which is http://localhost:1234
Once you have those values configured in AnythingLLM, you’ll be asked to configure a workspace. The workspace is where you can upload documents, or connect to webpages and github repositories. It’s also where you will be chatting with your documents while LMStudio is in server mode.
From here it’s time to start chatting. I have found that the quality of responses is highly dependent on the quality and clarity of the document that you’re referencing. There are also minor adjustments you can make to how AnythingLLM ingests the dataset to fine-tune text matching capabilities.
The way the document retrieval works is that the text in the document is chunked into smaller portions and stored as a number in a vector database. When you ask the LLM a question, information stored in the vector database is checked for similarity to your question, and then used as context in the LLM response.
As an example, if I load The Toyota Way by Jeffrey Liker into AnythingLLM and ask questions about lean management concepts, the responses generated as a result are comprehensive and accurate.
On the other hand, if we load a document comprised of short clauses and a large number of cross-references the retrieval ability and resulting accuracy of the responses with the default settings are hit and miss.
I have also been finding an increasing number of documents are oddly formatted when printed to PDF, such as this example from the Plumbing and Drainage standard, AS3500.1:2021
I’m not sure as to the cause of these formatting anomalies, but as humans, we are able to make sense of the jumbled letters extracted from the PDF; for our vector database and our language model to reference, the pdf context becomes an irrelevant mess.
To improve the responses, consider fine-tuning the following:
- Remove superfluous information from the document and re-upload. Remove contents pages, bibliographies and other texts that do not contain relevant content and improve the PDF formatting and compatibility where possible.
- Adjust the text chunking and chunk overlaps. Text chunking is a method used in natural language processing that divides text into smaller segments, usually based on grammatical meanings and phrases. Adjusting these values can improve your responses
- Consider adjusting the number of context snippets and the similarity thresholds of the vector database in your workspace.
- Change your prompt. As simple as it sounds, adjusting your prompt and being more direct can result in more accurate responses.
This is of course not the only method for configuring a local LLM, but it is arguably the quickest. Including downloading of a language model, you can create a powerful, private, and customizable LLM experience in under an hour. The future of local LLM usage is becoming more accessible thanks to tools like LMStudio, ollama, Jan and Local AI. Paired with a user-friendly interface like AnythingLLM, you can build out powerful local LLMs without relying on paid services. The model you choose will ultimately determine your chatting experience, so be sure to select one that aligns with your needs.
No Comments