llama.cpp

Verified

The C/C++ engine powering local AI — lightning-fast inference that Ollama and LM Studio build on.

4.5

About llama.cpp

llama.cpp is the foundational C/C++ library for running quantized LLMs on consumer hardware. Created by Georgi Gerganov, it powers tools like Ollama and LM Studio behind the scenes. It supports GGUF model format, GPU offloading, and runs on virtually any platform.

Key Features

C/C++ for maximum performance
GGUF quantization format
GPU offloading (CUDA, Metal, Vulkan)
Server mode with OpenAI-compatible API
Runs on everything from Raspberry Pi to servers

Pros & Cons

Pros

+ Fastest local inference engine

+ Runs on virtually any hardware

+ Foundation of the local AI ecosystem

Cons

- Command-line interface only

- Requires compilation for best performance

- Steep learning curve for beginners

Use Cases

Building local AI applicationsMaximum performance local inferenceEmbedded AI in appsResearch and benchmarking

Pricing

Open source

Free and open-source. MIT license.

Who It's For

C/C++ developersML engineersEmbedded systems developersPerformance enthusiasts

Details

Companyggml.org

Founded2023

WebsiteVisit