Software · AI

Mocca

A local AI chat app you run on your own machine. It loads GGUF models in-process, keeps every conversation on disk, and never depends on a cloud or a separate model server.

Year: 2026 →
Role: Solo · End-to-end
Status: In development

GitHub ↗

What it is

Mocca is a self-hosted web UI for chatting with local language models. You point it at a GGUF file and it runs the model in-process through llama-cpp-python, so there is no Ollama, no external model server, and nothing leaving your machine to answer a question.

It is built around four goals. Easy to use, down to downloading new models from inside the app. Standalone and local-first. Cross-platform across Windows and Arch Linux. And heavily logged, so problems are quick to trace. The interface is a deliberately monochrome dark theme.

How it is built

A FastAPI backend on uvicorn, split into one router per area. The frontend is plain HTML, CSS, and native ES modules with no build step and no framework. Storage is SQLite through the standard library. The whole thing stays dependency-light on purpose.

The inference engine is imported lazily, so the UI, model browsing, downloads, and settings all work before llama-cpp-python is even installed. Generation runs on a worker thread that streams tokens back to the browser over Server-Sent Events, and the loaded model unloads itself after a stretch of inactivity to give the memory back.

Models come from the public Hugging Face API rather than a hand-maintained list. Mocca resolves the most-downloaded GGUF repos to a single right-sized file, reads each one's architecture from its header, and skips anything too new for the bundled llama.cpp to load. Native hardware detection rates whether a given model will fit your RAM or VRAM before you download it.

What the model can do

The chat model can call tools, small self-contained capabilities like a calculator, unit conversion, file access, web search, weather, and reading a YouTube transcript. Adding one is dropping a file; the registry discovers it at startup. A model-based router offers only the tool categories a given message needs, which matters a lot on CPU, where verbose tool schemas dominate prompt-processing time.

Mocca learns durable facts about you across conversations. After each turn a separate, focused pass extracts what is worth keeping (name, location, work, standing preferences) and stores it, then injects those facts into later chats. It is the closest local, transparent stand-in for updating the model itself, and every fact is listed in Settings with a delete button.

A chat can also have documents attached. The model never gets their contents in the prompt. It reads them on demand through a session-scoped tool, so one chat can never see another's files, and writes edits back to the right document.

Local-first, by construction

The one sanctioned outbound capability is web search, on by default but a single toggle away from fully offline. Everything else, every chat, every setting, every downloaded model, lives on your machine. There is no account, no telemetry, no cloud.

Graceful degradation is the rule throughout. If hardware detection fails, the app hides the fit hints and keeps working. If Hugging Face is unreachable, the catalog says so instead of shipping stale entries. Mocca should never fail to boot or chat because something optional was missing.

Highlights

Runs models in-process. GGUF files loaded directly via llama-cpp-python. No Ollama, no external model server.
Local-first by default. No account, no telemetry, no cloud. One toggle takes it fully offline.
Tools the model can call. Calculator, units, files, web, weather, YouTube. Drop a file to add one.
Long-term memory. A focused pass captures durable facts and recalls them in later chats, each one deletable.
Hugging Face catalog. Live, right-sized GGUF picks with hardware-fit ratings before you download.
No build step. FastAPI backend, vanilla ES-module frontend, SQLite storage. Dependency-light by design.

Stack

Python · FastAPI · LLama · SQLite