An AI assistant in the terminal can help you guide through the process, help you move faster with your tasks. I tested Qwen Code and share my findings with you.
![]()
![]()
When I started experimenting with AI integrations, I wanted to create a chat assistant on my website, something that could talk like GPT-4, reason like Claude, and even joke like Grok.
But OpenAI, Anthropic, Google, and xAI all require API keys. That means I needed to set up an account for each of the platforms and upgrade to one of their paid plans before I could start coding. Why? Because most of these LLM providers require a paid plan for API access. Not to mention, I would need to cover API usage billing for each LLM platform.
What if I could tell you there's an easier approach to start integrating AI within your websites and mobile applications, even without requiring API keys at all? Sounds exciting? Let me share how I did exactly that.
Thanks to Puter.js, an open source JavaScript library that lets you use cloud features like AI models, storage, databases, user auth, all from the client side. No servers, no API keys, no backend setup needed here. What else can you ask for as a developer?
Puter.js is built around Puter’s decentralized cloud platform, which handles all the stuff like key management, routing, usage limits, and billing. Everything’s abstracted away so cleanly that, from your side, it feels like authentication, AI, and LLM just live in your browser.
Enough talking, let’s see how you can add GPT-5 integration within your web application in less than 10 lines.
<html>
<body>
<script src="https://js.puter.com/v2/"></script>
<script>
puter.ai.chat(`What is puter js?`, {
model: 'gpt-5-nano',
}).then(puter.print);
</script>
</body>
</html>Yes, that’s it. Unbelievable, right? Let's save the HTML code into an index.html file place this a new, empty directory. Open a terminal and switch to the directory where index.html file is located and serve it on localhost with the Python command:
python -m http.serverThen open http://localhost:8000 in your web browser. Click on Puter.js “Continue” button when presented.

🚧 It would take some time before you see a response from ChatGPT. Till then, you'll see a blank page.

You can explore a lot of examples and get an idea of what Puter.js does for you on its playground.
Let’s modify the code to make it more interesting this time. It would take a user query and return streaming responses from three different LLMs so that users can decide which among the three provides the best result.
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>AI Model Comparison</title>
<script src="https://cdn.twind.style"></script>
<script src="https://js.puter.com/v2/"></script>
</head>
<body class="bg-gray-900 min-h-screen p-6">
<div class="max-w-7xl mx-auto">
<h1 class="text-3xl font-bold text-white mb-6 text-center">AI Model Comparison</h1>
<div class="mb-6">
<label for="queryInput" class="block text-white mb-2 font-medium">Enter your query:</label>
<div class="flex gap-2">
<input
type="text"
id="queryInput"
class="flex-1 px-4 py-3 rounded-lg bg-gray-800 text-white border border-gray-700 focus:outline-none focus:border-blue-500"
placeholder="Write a detailed essay on the impact of artificial intelligence on society"
value="Write a detailed essay on the impact of artificial intelligence on society"
/>
<button
id="submitBtn"
class="px-6 py-3 bg-blue-600 hover:bg-blue-700 text-white rounded-lg font-medium transition-colors"
>
Generate
</button>
</div>
</div>
<div class="grid grid-cols-1 md:grid-cols-3 gap-4">
<div class="bg-gray-800 rounded-lg p-4">
<h2 class="text-xl font-semibold text-blue-400 mb-3">Claude Opus 4</h2>
<div id="output1" class="text-gray-300 text-sm leading-relaxed h-96 overflow-y-auto whitespace-pre-wrap"></div>
</div>
<div class="bg-gray-800 rounded-lg p-4">
<h2 class="text-xl font-semibold text-green-400 mb-3">Claude Sonnet 4</h2>
<div id="output2" class="text-gray-300 text-sm leading-relaxed h-96 overflow-y-auto whitespace-pre-wrap"></div>
</div>
<div class="bg-gray-800 rounded-lg p-4">
<h2 class="text-xl font-semibold text-purple-400 mb-3">Gemini 2.0 Pro</h2>
<div id="output3" class="text-gray-300 text-sm leading-relaxed h-96 overflow-y-auto whitespace-pre-wrap"></div>
</div>
</div>
</div>
<script>
const queryInput = document.getElementById('queryInput');
const submitBtn = document.getElementById('submitBtn');
const output1 = document.getElementById('output1');
const output2 = document.getElementById('output2');
const output3 = document.getElementById('output3');
async function generateResponse(query, model, outputElement) {
outputElement.textContent = 'Loading...';
try {
const response = await puter.ai.chat(query, {
model: model,
stream: true
});
outputElement.textContent = '';
for await (const part of response) {
if (part?.text) {
outputElement.textContent += part.text;
outputElement.scrollTop = outputElement.scrollHeight;
}
}
} catch (error) {
outputElement.textContent = `Error: ${error.message}`;
}
}
async function handleSubmit() {
const query = queryInput.value.trim();
if (!query) {
alert('Please enter a query');
return;
}
submitBtn.disabled = true;
submitBtn.textContent = 'Generating...';
submitBtn.classList.add('opacity-50', 'cursor-not-allowed');
await Promise.all([
generateResponse(query, 'claude-opus-4', output1),
generateResponse(query, 'claude-sonnet-4', output2),
generateResponse(query, 'google/gemini-2.0-flash-lite-001', output3)
]);
submitBtn.disabled = false;
submitBtn.textContent = 'Generate';
submitBtn.classList.remove('opacity-50', 'cursor-not-allowed');
}
submitBtn.addEventListener('click', handleSubmit);
queryInput.addEventListener('keypress', (e) => {
if (e.key === 'Enter') {
handleSubmit();
}
});
</script>
</body>
</html>
Save the above file in the index.html file as we did in the previos example and then run the server with Python. This is what it looks like now on localhost.

And here is a sample response from all three models on the query "What is It's FOSS".

Looks like It's FOSS is well trusted by humans as well as AI 😉
That’s not bad! Without requiring any API keys, you can do this crazy stuff.
Puter.js utilizes the “User pays model” which means it’s completely free for developers, and your application user will spend credits from their Puter’s account for the cloud features like the storage and LLMs they will be using. I reached out to them to understand their pricing structure, but at this moment, the team behind it is still working out to come up with a pricing plan.
This new Puter.js library is superbly underrated. I’m still amazed by how easy it has made LLM integration. Besides it, you can use Puter.js SDK for authentication, storage like Firebase.
Do check out this wonderful open source JavaScript library and explore what else you can build with it.
Puter
![]()
My interest in running AI models locally started as a side project with part curiosity and part irritation with cloud limits. There’s something satisfying about running everything on your own box. No API quotas, no censorship, no signups. That’s what pulled me toward local inference.
My setup, being an AMD GPU on Windows, turned out to be the worst combination for most local AI stacks.
The majority of AI stacks assume NVIDIA + CUDA, and if you don’t have that, you’re basically on your own. ROCm, AMD’s so-called CUDA alternative, doesn’t even work on Windows, and even on Linux, it’s not straightforward. You end up stuck with CPU-only inference or inconsistent OpenCL backends that feel like a decade behind.
I started with the usual tools, i.e., Ollama and LM Studio. Both deserve credit for making local AI look plug-and-play. I tried LM Studio first. But soon after, I discovered how LM Studio hijacks my taskbar. I frequently jump from one application window to another using the mouse, and it was getting annoying for me. Another thing that annoyed me is its installer size of 528 MB.
I’m a big advocate for keeping things minimal yet functional. I’m a big admirer of a functional text editor that fits under 1 MB (Dred), a reactive JavaScript library and React alternative that fits under 1KB (Van JS), and a game engine that fits under 100 MB (Godot).
Then I tried Ollama. Being a CLI user (even on Windows), I was impressed with Ollama. I don’t need to spin up an Electron JS application (LM Studio) to run an AI model locally.
With just two commands, you can run any AI models locally with Ollama.
ollma pull tinyllama
ollama run tinyllama 
But once I started testing different AI models, I needed to reclaim disk space after that. My initial approach was to delete the model manually from File Explorer. I was a bit paranoid! But soon, I discovered these Ollama commands:
ollama rm tinyllama #remove the model
ollama ls #lists all modelsUpon checking how lightweight Ollama is, it comes close to 4.6 GB on my Windows system. Although you can delete unnecessary files to make it slim (it comes bundled with all libraries like rocm, cuda_v13, and cuda_v12),
After trying Ollama, I was curious! Does LM Studio even provide a CLI? Upon my research, I came to know, yeah, it does offer a command lineinterface. I investigated further and found out that LM Studio uses Llama.cpp under the hood.
With these two commands, I can run LM Studio via CLI and chat to an AI model while staying in the terminal:
lms load <model name> #Load the model
lms chat #starts the interactive chat
I was generally satisfied with LM Studio CLI at this moment. Also, I noticed it came with Vulkan support out of the box. Now, I have been looking to add Vulkan support for Ollama. I discovered an approach to compile Ollama from source code and enable Vulkan support manually. That’s a real hassle!

I just had three additional complaints at this moment. Every time I needed to use LM Studio CLI(lms), it would take some time to wake up its Windows service. LMS CLI is not feature-rich. It does not even provide a CLI way to delete a model. And the last one was how it takes two steps to load the model first and then chat.
After the chat is over, you need to manually unload the model. This mental model doesn’t make sense to me.
That’s where I started looking for something more open, something that actually respected the hardware I had. That’s when I stumbled onto Llama.cpp, with its Vulkan backend and refreshingly simple approach.
Head over to its GitHub releases page and download its latest releases for your platform.
Extract the downloaded zip file and, optionally, move the directory where you usually keep your binaries, like /usr/local/bin on macOS and Linux. On Windows 10, I usually keep it under %USERPROFILE%\.local/bin.
Now, you need to add its directory location to the PATH environment variable.
On Linux and macOS (replace path-to-llama-cpp-directory with your exact directory location):
export PATH=$PATH:”<path-to-llama-cpp-directory>”On Windows 10 and Windows 11:
setx PATH=%PATH%;:”<path-to-llama-cpp-directory>”Now, Llama.cpp is ready to use.
Just grab a .gguf file, point to it, and run. It reminded me why I love tinkering on Linux in the first place: fewer black boxes, more freedom to make things work your way.
With just one command, you can start a chat session with Llama.cpp:
llama-cli.exe -m e:\models\Qwen3-8B-Q4_K_M.gguf --interactive
If you carefully read its verbose message, it clearly shows signs of GPU being utilized:

With llama-server, you can even download AI models from Hugging Face, like:
llama-server -hf itlwas/Phi-4-mini-instruct-Q4_K_M-GGUF:Q4_K_M-hf flag tells to download the model from the Hugging Face repository.
You even get a web UI with Llama.cpp. Like run the model with this command:
llama-server -m e:\models\Qwen3-8B-Q4_K_M.gguf --port 8080 --host 127.0.0.1This starts a web UI on http://127.0.0.1:8080, along with the ability to send an API request from another application to Llama.

Let’s send an API request via curl:
curl http://127.0.0.1:8080/completion -H "Content-Type: application/json" -d "{\"prompt\":\"Explain the difference between OpenCL and SYCL in short.\",\"temperature\":0.7,\"max_tokens\":128}
What am I losing by using llama? Nothing. Like Ollama, I can use a feature-rich CLI, plus Vulkan support. All comes under 90 MB on my Windows 10 system.
Now, I don’t see the point of using Ollama and LM Studio, I can directly download any model with llama-server, run the model directly with llama-cli, and even interact with its web UI and API requests.
I’m hoping to do some benchmarking on how performant AI inference on Vulkan is as compared to pure CPU and SYCL implementation in some future post. Until then, keep exploring AI tools and the ecosystem to make your life easier. Use AI to your advantage rather than going on endless debate with questions like, will AI take our jobs?
![]()