Running AI Models in Your Browser: Exploring WebAssembly-Based Solutions

#1 · May 12, 2025, 8:13 pm

Quote from emo on May 12, 2025, 8:13 pm
English

tangledgroup/llama-cpp-wasm: Compiles llama.cpp to WebAssembly, enabling small GGUF models (e.g., TinyLlama, Phi-2) to run in browsers without external APIs. Offers single-thread and multi-thread builds with an HTML interface for user prompts and real-time inference. Ideal for lightweight, local AI execution; requires a local server for testing. Link: https://github.com/tangledgroup/llama-cpp-wasm

ngxson/wllama: A WebAssembly library for llama.cpp, supporting SIMD and multi-threaded inference of GGUF models in browsers. Features a JavaScript API for loading models from Hugging Face or URLs, with tools to split large models for efficient memory use. Runs inference in a Web Worker for a smooth UI. Link: https://github.com/ngxson/wllama

mlc-ai/web-llm: Engine for running language models (e.g., LLaMA 3, Phi 3) in browsers using WebAssembly and WebGPU for acceleration. Supports custom GGUF models and offers plug-and-play integration via NPM/CDN. Not based on llama.cpp but robust for client-side inference; WebGPU limits compatibility to modern browsers. Link: https://github.com/mlc-ai/web-llm

Español

tangledgroup/llama-cpp-wasm: Compila llama.cpp a WebAssembly, permitiendo ejecutar modelos GGUF pequeños (como TinyLlama, Phi-2) en navegadores sin APIs externas. Ofrece compilaciones de hilo único y multihilo, con una interfaz HTML para prompts e inferencia en tiempo real. Requiere un servidor local para pruebas. Enlace: https://github.com/tangledgroup/llama-cpp-wasm

ngxson/wllama: Biblioteca WebAssembly para llama.cpp, soporta inferencia SIMD y multihilo de modelos GGUF en navegadores. Incluye una API JavaScript para cargar modelos desde Hugging Face o URLs, con herramientas para dividir modelos grandes y optimizar memoria. La inferencia se ejecuta en un Web Worker para no bloquear la interfaz. Enlace: https://github.com/ngxson/wllama

mlc-ai/web-llm: Motor para ejecutar modelos de lenguaje (como LLaMA 3, Phi 3) en navegadores con WebAssembly y WebGPU para aceleración. Soporta modelos GGUF personalizados y se integra vía NPM/CDN. No usa llama.cpp, pero es robusto para inferencia en cliente; WebGPU limita compatibilidad a navegadores modernos. Enlace: https://github.com/mlc-ai/web-llm

English

tangledgroup/llama-cpp-wasm: Compiles llama.cpp to WebAssembly, enabling small GGUF models (e.g., TinyLlama, Phi-2) to run in browsers without external APIs. Offers single-thread and multi-thread builds with an HTML interface for user prompts and real-time inference. Ideal for lightweight, local AI execution; requires a local server for testing. Link: https://github.com/tangledgroup/llama-cpp-wasm
ngxson/wllama: A WebAssembly library for llama.cpp, supporting SIMD and multi-threaded inference of GGUF models in browsers. Features a JavaScript API for loading models from Hugging Face or URLs, with tools to split large models for efficient memory use. Runs inference in a Web Worker for a smooth UI. Link: https://github.com/ngxson/wllama
mlc-ai/web-llm: Engine for running language models (e.g., LLaMA 3, Phi 3) in browsers using WebAssembly and WebGPU for acceleration. Supports custom GGUF models and offers plug-and-play integration via NPM/CDN. Not based on llama.cpp but robust for client-side inference; WebGPU limits compatibility to modern browsers. Link: https://github.com/mlc-ai/web-llm

Español

tangledgroup/llama-cpp-wasm: Compila llama.cpp a WebAssembly, permitiendo ejecutar modelos GGUF pequeños (como TinyLlama, Phi-2) en navegadores sin APIs externas. Ofrece compilaciones de hilo único y multihilo, con una interfaz HTML para prompts e inferencia en tiempo real. Requiere un servidor local para pruebas. Enlace: https://github.com/tangledgroup/llama-cpp-wasm
ngxson/wllama: Biblioteca WebAssembly para llama.cpp, soporta inferencia SIMD y multihilo de modelos GGUF en navegadores. Incluye una API JavaScript para cargar modelos desde Hugging Face o URLs, con herramientas para dividir modelos grandes y optimizar memoria. La inferencia se ejecuta en un Web Worker para no bloquear la interfaz. Enlace: https://github.com/ngxson/wllama
mlc-ai/web-llm: Motor para ejecutar modelos de lenguaje (como LLaMA 3, Phi 3) en navegadores con WebAssembly y WebGPU para aceleración. Soporta modelos GGUF personalizados y se integra vía NPM/CDN. No usa llama.cpp, pero es robusto para inferencia en cliente; WebGPU limita compatibilidad a navegadores modernos. Enlace: https://github.com/mlc-ai/web-llm

luishp and Vadim have reacted to this post.

Running AI Models in Your Browser: Exploring WebAssembly-Based Solutions - Forum

Running AI Models in Your Browser: Exploring WebAssembly-Based Solutions