High-memory CPU inference
A 384GB plan loads a 70B model in 4-bit quant entirely in RAM. llama.cpp and vLLM handle the rest. 5–15 tok/s on EPYC — production-grade for chat, copilot, and RAG.
High-memory EPYC platforms with 96–768GB DDR5, NVMe storage, and 3–10 Gbps networking. CPU inference with llama.cpp and vLLM handles production LLM workloads at a fraction of the GPU price — and there's no per-token egress trap.
CPU LLM inference is the practice of running large language model generation on standard server processors instead of dedicated GPU accelerators, using runtimes like llama.cpp, vLLM, ExLlamaV2 with CPU offload, or Hugging Face Transformers with bitsandbytes. It became viable in 2024–2026 because modern AMD EPYC and Intel Xeon platforms ship with 12-channel DDR5 memory delivering 400–600 GB/s of aggregate bandwidth, which is the bottleneck that actually governs token throughput for memory-bound transformer decoding. With 4-bit GGUF quantization, a 70B-parameter model fits in roughly 40 GB of RAM, leaving headroom for KV-cache and concurrent requests on a 384 GB server. Real-world throughput on a 64-core EPYC is 5–15 tokens per second per stream — production-grade for chat assistants, copilots, RAG pipelines, and batch summarization. Compared to renting GPU instances, the cost per million tokens is typically 5–20 times lower and there is no per-token egress charge.
A 384GB plan loads a 70B model in 4-bit quant entirely in RAM. llama.cpp and vLLM handle the rest. 5–15 tok/s on EPYC — production-grade for chat, copilot, and RAG.
The same high-RAM platforms run Qdrant, Weaviate, Milvus, or pgvector at billion-scale embedding indexes. NVMe storage keeps recall latency in the single-digit-millisecond range.
Always-on inference is the worst possible workload for hourly cloud pricing. Fixed monthly bare metal flips the math — and there are no egress fees on inference traffic.
Cloud GPU is built for spiky training. Inference lives 24/7.
Filtered to plans with 96GB+ RAM. Pricing live from our deployment API.
What it actually costs to run a 70B-class model in production. CPU inference on high-RAM EPYC platforms is cheaper than rented GPUs for almost every workload that doesn't need real-time batch throughput.
| Provider | Hardware | Token cost (rough) | Crypto payment | Egress fees | Starts at |
|---|---|---|---|---|---|
| BareMetalServer.ai (CPU) | EPYC, 96–768 GB DDR5, llama.cpp / vLLM | $0.05–0.20 / 1M tokens | USDT, USDC, BTC | Included (up to 100 TB/mo) | $449/mo |
| Hetzner / OVH (CPU) | AMD / Intel dedicated, 64–256 GB | $0.05–0.30 / 1M tokens | No | Included | ~€50–500/mo |
| AWS p4d / p5 (GPU) | A100 / H100 instances | $1–10 / 1M tokens (provisioned) | No | Per-GB egress | ~$3,000–30,000/mo |
| Together / Replicate / OpenAI | Hosted GPU APIs | $0.20–15 / 1M tokens | No | N/A (per-call) | Pay per token |
Indicative pricing as of Q1 2026. Token cost is highly model- and quantization-dependent — figures assume 4-bit GGUF for CPU rows and FP16 for GPU rows. CPU inference is best for chat, RAG, copilots, and batch workloads where 5–15 tokens/sec/stream is enough; GPU is still the right answer for heavy concurrent serving.
Deploy in 5–20 minutes. Pay in USDC. Run vLLM, llama.cpp, Ollama, or your own stack with full root.
See recommended hardware