AI / INFERENCE

Run 70B models on metal that costs less than the GPU bill.

High-memory EPYC platforms with 96–768GB DDR5, NVMe storage, and 3–10 Gbps networking. CPU inference with llama.cpp and vLLM handles production LLM workloads at a fraction of the GPU price — and there's no per-token egress trap.

BAREMETALSERVER.AI · 42U
01
BMS-01
AMD RYZEN 9950X
192GB · 10GbE · ONLINE
02
BMS-02
AMD EPYC 9354P
192GB · 10GbE · ONLINE
03
BMS-03
2x EPYC 7443
256GB · 10GbE · ONLINE
04
BMS-04
THREADRIPPER 7965WX
512GB · 10GbE · ONLINE

What is CPU LLM inference?

CPU LLM inference is the practice of running large language model generation on standard server processors instead of dedicated GPU accelerators, using runtimes like llama.cpp, vLLM, ExLlamaV2 with CPU offload, or Hugging Face Transformers with bitsandbytes. It became viable in 2024–2026 because modern AMD EPYC and Intel Xeon platforms ship with 12-channel DDR5 memory delivering 400–600 GB/s of aggregate bandwidth, which is the bottleneck that actually governs token throughput for memory-bound transformer decoding. With 4-bit GGUF quantization, a 70B-parameter model fits in roughly 40 GB of RAM, leaving headroom for KV-cache and concurrent requests on a 384 GB server. Real-world throughput on a 64-core EPYC is 5–15 tokens per second per stream — production-grade for chat assistants, copilots, RAG pipelines, and batch summarization. Compared to renting GPU instances, the cost per million tokens is typically 5–20 times lower and there is no per-token egress charge.

llama.cppvLLMGGUFQ4_K_M quantizationAMD EPYCDDR5 bandwidthKV-cacheRAG
Jump to pricing
Why bare metal

Inference economics, without the cloud markup.

01

High-memory CPU inference

A 384GB plan loads a 70B model in 4-bit quant entirely in RAM. llama.cpp and vLLM handle the rest. 5–15 tok/s on EPYC — production-grade for chat, copilot, and RAG.

02

Vector DB ready

The same high-RAM platforms run Qdrant, Weaviate, Milvus, or pgvector at billion-scale embedding indexes. NVMe storage keeps recall latency in the single-digit-millisecond range.

03

Save 50–70% vs cloud

Always-on inference is the worst possible workload for hourly cloud pricing. Fixed monthly bare metal flips the math — and there are no egress fees on inference traffic.

Cloud GPU is built for spiky training. Inference lives 24/7.

Different workload, different machine
Hardware requirements

Pick the tier for your model class.

13B models
96–128 GB
Llama 3, Mistral
70B models
256–384 GB
4-bit GGUF
MoE / 100B+
512–768 GB
Mixtral, DeepSeek
Vector DB
128+ GB
Billion-scale recall
Common questions

Answers before you ask.

Can I run large language models without a GPU?
Yes. CPU-based inference with llama.cpp, vLLM, and ggml works well on high-memory servers. With 384–768GB RAM you can load 70B+ parameter models entirely in memory at reasonable throughput — no GPU required.
How does pricing compare to AWS or GCP?
For always-on inference, bare metal is typically 50–70% less than equivalent cloud instances — and there are no per-token egress fees stacked on top. A 192GB plan saves you thousands per year over comparable cloud SKUs.
Which ML frameworks are supported?
You get full root access, so anything that runs on Linux. PyTorch, TensorFlow, vLLM, llama.cpp, Ollama, Triton Inference Server, ONNX Runtime — install whatever fits your stack.
How fast is CPU inference for 70B models?
On 64+ core EPYC platforms, expect 5–15 tokens/second for 70B models in 4-bit quantization. That's production-grade for chat, copilot, and RAG workloads — and it costs a fraction of GPU inference for the same throughput floor.
Can I host vector databases on these servers?
Yes. The same high-RAM platforms run Qdrant, Weaviate, Milvus, or pgvector beautifully. NVMe storage handles billion-scale embedding indexes, and dedicated networking keeps query latency low.
Do you offer GPU servers?
Not yet — our current fleet is CPU-class. CPU inference covers a surprising amount of production workloads (LLMs in 4-bit, classical ML, vector search, embedding generation). GPU SKUs are on the roadmap.
Compared

CPU LLM inference vs GPU cloud, compared

What it actually costs to run a 70B-class model in production. CPU inference on high-RAM EPYC platforms is cheaper than rented GPUs for almost every workload that doesn't need real-time batch throughput.

ProviderHardwareToken cost (rough)Crypto paymentEgress feesStarts at
BareMetalServer.ai (CPU)EPYC, 96–768 GB DDR5, llama.cpp / vLLM$0.05–0.20 / 1M tokensUSDT, USDC, BTCIncluded (up to 100 TB/mo)$449/mo
Hetzner / OVH (CPU)AMD / Intel dedicated, 64–256 GB$0.05–0.30 / 1M tokensNoIncluded~€50–500/mo
AWS p4d / p5 (GPU)A100 / H100 instances$1–10 / 1M tokens (provisioned)NoPer-GB egress~$3,000–30,000/mo
Together / Replicate / OpenAIHosted GPU APIs$0.20–15 / 1M tokensNoN/A (per-call)Pay per token

Indicative pricing as of Q1 2026. Token cost is highly model- and quantization-dependent — figures assume 4-bit GGUF for CPU rows and FP16 for GPU rows. CPU inference is best for chat, RAG, copilots, and batch workloads where 5–15 tokens/sec/stream is enough; GPU is still the right answer for heavy concurrent serving.

Ready when you are

Stop renting GPUs you don't need 24/7.

Deploy in 5–20 minutes. Pay in USDC. Run vLLM, llama.cpp, Ollama, or your own stack with full root.

See recommended hardware
No setup fees · No contracts · USDC accepted · Deploy in 5–20 min