You don't need a MacMini for running OpenClaw. These alternative projects can run on SBCs and ESP32 microcontrollers.
![]()
![]()
![]()
When I started experimenting with AI integrations, I wanted to create a chat assistant on my website, something that could talk like GPT-4, reason like Claude, and even joke like Grok.
But OpenAI, Anthropic, Google, and xAI all require API keys. That means I needed to set up an account for each of the platforms and upgrade to one of their paid plans before I could start coding. Why? Because most of these LLM providers require a paid plan for API access. Not to mention, I would need to cover API usage billing for each LLM platform.
What if I could tell you there's an easier approach to start integrating AI within your websites and mobile applications, even without requiring API keys at all? Sounds exciting? Let me share how I did exactly that.
Thanks to Puter.js, an open source JavaScript library that lets you use cloud features like AI models, storage, databases, user auth, all from the client side. No servers, no API keys, no backend setup needed here. What else can you ask for as a developer?
Puter.js is built around Puter’s decentralized cloud platform, which handles all the stuff like key management, routing, usage limits, and billing. Everything’s abstracted away so cleanly that, from your side, it feels like authentication, AI, and LLM just live in your browser.
Enough talking, let’s see how you can add GPT-5 integration within your web application in less than 10 lines.
<html>
<body>
<script src="https://js.puter.com/v2/"></script>
<script>
puter.ai.chat(`What is puter js?`, {
model: 'gpt-5-nano',
}).then(puter.print);
</script>
</body>
</html>Yes, that’s it. Unbelievable, right? Let's save the HTML code into an index.html file place this a new, empty directory. Open a terminal and switch to the directory where index.html file is located and serve it on localhost with the Python command:
python -m http.serverThen open http://localhost:8000 in your web browser. Click on Puter.js “Continue” button when presented.

🚧 It would take some time before you see a response from ChatGPT. Till then, you'll see a blank page.

You can explore a lot of examples and get an idea of what Puter.js does for you on its playground.
Let’s modify the code to make it more interesting this time. It would take a user query and return streaming responses from three different LLMs so that users can decide which among the three provides the best result.
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>AI Model Comparison</title>
<script src="https://cdn.twind.style"></script>
<script src="https://js.puter.com/v2/"></script>
</head>
<body class="bg-gray-900 min-h-screen p-6">
<div class="max-w-7xl mx-auto">
<h1 class="text-3xl font-bold text-white mb-6 text-center">AI Model Comparison</h1>
<div class="mb-6">
<label for="queryInput" class="block text-white mb-2 font-medium">Enter your query:</label>
<div class="flex gap-2">
<input
type="text"
id="queryInput"
class="flex-1 px-4 py-3 rounded-lg bg-gray-800 text-white border border-gray-700 focus:outline-none focus:border-blue-500"
placeholder="Write a detailed essay on the impact of artificial intelligence on society"
value="Write a detailed essay on the impact of artificial intelligence on society"
/>
<button
id="submitBtn"
class="px-6 py-3 bg-blue-600 hover:bg-blue-700 text-white rounded-lg font-medium transition-colors"
>
Generate
</button>
</div>
</div>
<div class="grid grid-cols-1 md:grid-cols-3 gap-4">
<div class="bg-gray-800 rounded-lg p-4">
<h2 class="text-xl font-semibold text-blue-400 mb-3">Claude Opus 4</h2>
<div id="output1" class="text-gray-300 text-sm leading-relaxed h-96 overflow-y-auto whitespace-pre-wrap"></div>
</div>
<div class="bg-gray-800 rounded-lg p-4">
<h2 class="text-xl font-semibold text-green-400 mb-3">Claude Sonnet 4</h2>
<div id="output2" class="text-gray-300 text-sm leading-relaxed h-96 overflow-y-auto whitespace-pre-wrap"></div>
</div>
<div class="bg-gray-800 rounded-lg p-4">
<h2 class="text-xl font-semibold text-purple-400 mb-3">Gemini 2.0 Pro</h2>
<div id="output3" class="text-gray-300 text-sm leading-relaxed h-96 overflow-y-auto whitespace-pre-wrap"></div>
</div>
</div>
</div>
<script>
const queryInput = document.getElementById('queryInput');
const submitBtn = document.getElementById('submitBtn');
const output1 = document.getElementById('output1');
const output2 = document.getElementById('output2');
const output3 = document.getElementById('output3');
async function generateResponse(query, model, outputElement) {
outputElement.textContent = 'Loading...';
try {
const response = await puter.ai.chat(query, {
model: model,
stream: true
});
outputElement.textContent = '';
for await (const part of response) {
if (part?.text) {
outputElement.textContent += part.text;
outputElement.scrollTop = outputElement.scrollHeight;
}
}
} catch (error) {
outputElement.textContent = `Error: ${error.message}`;
}
}
async function handleSubmit() {
const query = queryInput.value.trim();
if (!query) {
alert('Please enter a query');
return;
}
submitBtn.disabled = true;
submitBtn.textContent = 'Generating...';
submitBtn.classList.add('opacity-50', 'cursor-not-allowed');
await Promise.all([
generateResponse(query, 'claude-opus-4', output1),
generateResponse(query, 'claude-sonnet-4', output2),
generateResponse(query, 'google/gemini-2.0-flash-lite-001', output3)
]);
submitBtn.disabled = false;
submitBtn.textContent = 'Generate';
submitBtn.classList.remove('opacity-50', 'cursor-not-allowed');
}
submitBtn.addEventListener('click', handleSubmit);
queryInput.addEventListener('keypress', (e) => {
if (e.key === 'Enter') {
handleSubmit();
}
});
</script>
</body>
</html>
Save the above file in the index.html file as we did in the previos example and then run the server with Python. This is what it looks like now on localhost.

And here is a sample response from all three models on the query "What is It's FOSS".

Looks like It's FOSS is well trusted by humans as well as AI 😉
That’s not bad! Without requiring any API keys, you can do this crazy stuff.
Puter.js utilizes the “User pays model” which means it’s completely free for developers, and your application user will spend credits from their Puter’s account for the cloud features like the storage and LLMs they will be using. I reached out to them to understand their pricing structure, but at this moment, the team behind it is still working out to come up with a pricing plan.
This new Puter.js library is superbly underrated. I’m still amazed by how easy it has made LLM integration. Besides it, you can use Puter.js SDK for authentication, storage like Firebase.
Do check out this wonderful open source JavaScript library and explore what else you can build with it.
Puter
![]()
My interest in running AI models locally started as a side project with part curiosity and part irritation with cloud limits. There’s something satisfying about running everything on your own box. No API quotas, no censorship, no signups. That’s what pulled me toward local inference.
My setup, being an AMD GPU on Windows, turned out to be the worst combination for most local AI stacks.
The majority of AI stacks assume NVIDIA + CUDA, and if you don’t have that, you’re basically on your own. ROCm, AMD’s so-called CUDA alternative, doesn’t even work on Windows, and even on Linux, it’s not straightforward. You end up stuck with CPU-only inference or inconsistent OpenCL backends that feel like a decade behind.
I started with the usual tools, i.e., Ollama and LM Studio. Both deserve credit for making local AI look plug-and-play. I tried LM Studio first. But soon after, I discovered how LM Studio hijacks my taskbar. I frequently jump from one application window to another using the mouse, and it was getting annoying for me. Another thing that annoyed me is its installer size of 528 MB.
I’m a big advocate for keeping things minimal yet functional. I’m a big admirer of a functional text editor that fits under 1 MB (Dred), a reactive JavaScript library and React alternative that fits under 1KB (Van JS), and a game engine that fits under 100 MB (Godot).
Then I tried Ollama. Being a CLI user (even on Windows), I was impressed with Ollama. I don’t need to spin up an Electron JS application (LM Studio) to run an AI model locally.
With just two commands, you can run any AI models locally with Ollama.
ollma pull tinyllama
ollama run tinyllama 
But once I started testing different AI models, I needed to reclaim disk space after that. My initial approach was to delete the model manually from File Explorer. I was a bit paranoid! But soon, I discovered these Ollama commands:
ollama rm tinyllama #remove the model
ollama ls #lists all modelsUpon checking how lightweight Ollama is, it comes close to 4.6 GB on my Windows system. Although you can delete unnecessary files to make it slim (it comes bundled with all libraries like rocm, cuda_v13, and cuda_v12),
After trying Ollama, I was curious! Does LM Studio even provide a CLI? Upon my research, I came to know, yeah, it does offer a command lineinterface. I investigated further and found out that LM Studio uses Llama.cpp under the hood.
With these two commands, I can run LM Studio via CLI and chat to an AI model while staying in the terminal:
lms load <model name> #Load the model
lms chat #starts the interactive chat
I was generally satisfied with LM Studio CLI at this moment. Also, I noticed it came with Vulkan support out of the box. Now, I have been looking to add Vulkan support for Ollama. I discovered an approach to compile Ollama from source code and enable Vulkan support manually. That’s a real hassle!

I just had three additional complaints at this moment. Every time I needed to use LM Studio CLI(lms), it would take some time to wake up its Windows service. LMS CLI is not feature-rich. It does not even provide a CLI way to delete a model. And the last one was how it takes two steps to load the model first and then chat.
After the chat is over, you need to manually unload the model. This mental model doesn’t make sense to me.
That’s where I started looking for something more open, something that actually respected the hardware I had. That’s when I stumbled onto Llama.cpp, with its Vulkan backend and refreshingly simple approach.
Head over to its GitHub releases page and download its latest releases for your platform.
Extract the downloaded zip file and, optionally, move the directory where you usually keep your binaries, like /usr/local/bin on macOS and Linux. On Windows 10, I usually keep it under %USERPROFILE%\.local/bin.
Now, you need to add its directory location to the PATH environment variable.
On Linux and macOS (replace path-to-llama-cpp-directory with your exact directory location):
export PATH=$PATH:”<path-to-llama-cpp-directory>”On Windows 10 and Windows 11:
setx PATH=%PATH%;:”<path-to-llama-cpp-directory>”Now, Llama.cpp is ready to use.
Just grab a .gguf file, point to it, and run. It reminded me why I love tinkering on Linux in the first place: fewer black boxes, more freedom to make things work your way.
With just one command, you can start a chat session with Llama.cpp:
llama-cli.exe -m e:\models\Qwen3-8B-Q4_K_M.gguf --interactive
If you carefully read its verbose message, it clearly shows signs of GPU being utilized:

With llama-server, you can even download AI models from Hugging Face, like:
llama-server -hf itlwas/Phi-4-mini-instruct-Q4_K_M-GGUF:Q4_K_M-hf flag tells to download the model from the Hugging Face repository.
You even get a web UI with Llama.cpp. Like run the model with this command:
llama-server -m e:\models\Qwen3-8B-Q4_K_M.gguf --port 8080 --host 127.0.0.1This starts a web UI on http://127.0.0.1:8080, along with the ability to send an API request from another application to Llama.

Let’s send an API request via curl:
curl http://127.0.0.1:8080/completion -H "Content-Type: application/json" -d "{\"prompt\":\"Explain the difference between OpenCL and SYCL in short.\",\"temperature\":0.7,\"max_tokens\":128}
What am I losing by using llama? Nothing. Like Ollama, I can use a feature-rich CLI, plus Vulkan support. All comes under 90 MB on my Windows 10 system.
Now, I don’t see the point of using Ollama and LM Studio, I can directly download any model with llama-server, run the model directly with llama-cli, and even interact with its web UI and API requests.
I’m hoping to do some benchmarking on how performant AI inference on Vulkan is as compared to pure CPU and SYCL implementation in some future post. Until then, keep exploring AI tools and the ecosystem to make your life easier. Use AI to your advantage rather than going on endless debate with questions like, will AI take our jobs?
![]()
Like it or not, AI is here to stay. For those who are concerned about data privacy, there are several local AI options available. Tools like Ollama and LM Studio makes things easier.
Now those options are for the desktop user and require significant computing power.
What if you want to use the local AI on your smartphone? Sure, one way would be to deploy Ollama with a web GUI on your server and access it from your phone.
But there is another way and that is to use an application that lets you install and use LLMs (or should I say SLMs, Small Language Models) on your phone directly instead of relying on your local AI server on another computer.
Allow me to share my experience with experimenting with LLMs on a phone.
Here's what you'll need:
After researching, I decided to explore following applications for this purpose. Let me share their features and details.
MLC Chat supports top models like Llama 3.2, Gemma 2, phi 3.5 and Qwen 2.5 offering offline chat, translation, and multimodal tasks through a sleek interface. Its plug-and-play setup with pre-configured models, NPU optimization (e.g., Snapdragon 8 Gen 2+), and beginner-friendly features make it a good choice for on-device AI.
You can download the MLC Chat APK from their GitHub release page.
Android is looking to forbid sideloading of APK files. I don't know what would happen then, but you can use APK files for now.
Put the APK file on your Android device, go into Files and tap the APK file to begin installation. Enable “Install from Unknown Sources” in your device settings if prompted. Follow on-screen instructions to complete the installation.

Once installed, open the MLC Chat app, select a model from the list, like Phi-2, Gemma 2B, Llama-3 8B, Mistral 7B. Tap the download icon to install the model. I recommend opting for smaller models like Phi-2. Models are downloaded on first use and cached locally for offline use.

Tap the Chat icon next to the downloaded model. Start typing prompts to interact with the LLM offline. Use the reset icon to start a new conversation if needed.

SmolChat is an open-source Android app that runs any GGUF-format model (like Llama 3.2, Gemma 3n, or TinyLlama) directly on your device, offering a clean, ChatGPT-like interface for fully offline chatting, summarization, rewriting, and more.
Install SmolChat from Google's Play Store. Open the app, choose a GGUF model from the app’s model list or manually download one from Hugging Face. If manually downloading, place the model file in the app’s designated storage directory (check app settings for the path).



Google AI Edge Gallery is an experimental open-source Android app (iOS soon) that brings Google's on-device AI power to your phone, letting you run powerful models like Gemma 3n and other Hugging Face models fully offline after download. This application makes use of Google’s LiteRT framework.
You can download it from Google Play Store. Open the app and browse the list of provided models or manually download a compatible model from Hugging Face.
Select the downloaded model and start a chat session. Enter text prompts or upload images (if supported by the model) to interact locally. Explore features like prompt discovery or vision-based queries if available.



Here are the best ones I’ve used:
| Model | My Experience | Best For |
|---|---|---|
| Google’s Gemma 3n (2B) | Blazing-fast for multimodal tasks including image captions, translations, even solving math problems from photos. | Quick, visual-based AI assistance |
| Meta’s Llama 3.2 (1B/3B) | Strikes the perfect balance between size and smarts. It’s great for coding help and private chats.The 1B version runs smoothly even on mid-range phones. | Developers & privacy-conscious users |
| Microsoft’s Phi-3 Mini (3.8B) | Shockingly good at summarizing long documents despite its small size. | Students, researchers, or anyone drowning in PDFs |
| Alibaba’s Qwen-2.5 (1.8B) | Surprisingly strong at visual question answering—ask it about an image, and it actually understands! | Multimodal experiments |
| TinyLlama-1.1B | The lightweight champ runs on almost any device without breaking a sweat. | Older phones or users who just need a simple chatbot |
All these models use aggressive quantization (GGUF/safetensors formats), so they’re tiny but still powerful. You can grab them from Hugging Face—just download, load into an app, and you’re set.
Getting large language models (LLMs) to run smoothly on my phone has been equally exhilarating and frustrating.
On my Snapdragon 8 Gen 2 phone, models like Llama 3-4B run at a decent 8-10 tokens per second, which is usable for quick queries. But when I tried the same on my backup Galaxy A54 (6 GB RAM), it choked. Loading even a 2B model pushed the device to its limits. I quickly learned that Phi-3-mini (3.8B) or Gemma 2B are far more practical for mid-range hardware.
The first time I ran a local AI session, I was shocked to see 50% battery gone in under 90 minutes. MLC Chat offers power-saving mode for this purpose. Turning off background apps to free up RAM also helps.
I also experimented with 4-bit quantized models (like Qwen-1.5-2B-Q4) to save storage but noticed they struggle with complex reasoning. For medical or legal queries, I had to switch back to 8-bit versions. It was slower but far more reliable.
I love the idea of having an AI assistant that works exclusively for me, no monthly fees, no data leaks. Need a translator in a remote village? A virtual assistant on a long flight? A private brainstorming partner for sensitive ideas? Your phone becomes all of these staying offline and untraceable.
I won’t lie, it’s not perfect. Your phone isn’t a data center, so you’ll face challenges like battery drain and occasional overheating. But it also provides tradeoffs like total privacy, zero costs, and offline access.
The future of AI isn’t just in the cloud, it’s also on your device.

Bhuwan Mishra is a Fullstack developer, with Python and Go as his tools of choice. He takes pride in building and securing web applications, APIs, and CI/CD pipelines, as well as tuning servers for optimal performance. He also has passion for working with Kubernetes.
![]()
We’ve covered a lot of local LLMs on It's FOSS. You can use them as coding assistants or run them on your tiny Raspberry Pi setups.
But recently, I’ve noticed many comments asking about local AI tools to interact with PDFs and documents.
Now, during my research, I stumbled upon countless AI-powered websites that promise to summarize, query, or analyze PDFs.
Some were sleek and polished but unsurprisingly, most were paid or had limited “free tier” options. And let’s be honest, when you’re uploading documents to a cloud service, there’s no real guarantee of privacy.
That’s why I’ve put together this list of open-source AI projects that let you interact with PDFs locally. These tools enable you to have your data stay on your machine, offline, and under your control.
Whether you’re summarizing long research papers, extracting key insights, or just searching for specific details, these tools will have your back.
Let’s dive in!
chatd is a desktop application that allows you to chat with your documents locally using a large language model.
Unlike other tools, chatd comes with a built-in LLM runner, so you don’t need to install anything extra, just download, unzip, and run the executable.

Key features:
LocalGPT is an open-source solution that enables you to securely interact with your documents locally.
Built for ultimate privacy, LocalGPT ensures that no data ever leaves your computer, making it a perfect fit for privacy-conscious users.

Key features:
PrivateGPT is a production-ready, privacy-focused AI project that enables you to interact with your documents using Large Language Models (LLMs), completely offline.
No data ever leaves your local environment, making it ideal for privacy-sensitive industries like healthcare, legal, or finance.
Having personally used this project, I highly recommend it for its privacy and performance once set up.

Key features:
It's FOSSAbhishek Kumar
GPT4All is another open-source project that enables you to run large language models (LLMs) offline on everyday desktops or laptops, no internet, API calls, or GPUs required.
The application is designed to run smoothly on a variety of systems. It's perfect for privacy-conscious users who want local AI capabilities to interact with documents or chat seamlessly.

Key features:
LM Studio has become my go-to tool for daily use, and it’s easily my favorite project in this space.
With the latest release (version 0.3), it introduced the ability to chat with your documents, a beta feature that has worked exceptionally well for me so far.

Key features:
It's FOSSAbhishek Kumar
Personally, I use LM Studio daily. As a university student, reading through PDFs day in and day out can be quite tiresome. That's why I like to fiddle around with such projects and look for what best suits my workflow.
I started with PrivateGPT, but once I tried LM Studio, I instantly fell in love with its clean UI and the ease of downloading models.
While I’ve also experimented with Ollama paired with Open WebUI, which worked well, LM Studio has truly become my go-to tool for handling documents efficiently.
These are some of the projects I recommend for interacting with or chatting with PDF documents. However, if you know of more tools that offer similar functionality, feel free to comment below and share them with the community!
![]()