AgentsDocsBlogFAQLog inGet started
Back to Blog
Technical

TernaryPhysics-7B: Our Quantized LLM

10 min readApril 2026

TernaryPhysics-7B is the brain behind our agents' conversational capabilities. It's a 4-bit quantized model that runs entirely on CPU, without requiring a GPU. This post explains what it is, how we built it, and why we made the choices we did.

Model Specifications

Model Size7 billion parameters (quantized)
Disk Space4.7 GB
Context Window4,096 tokens
Inference Speed~17 tok/s on Apple Silicon (M-series), ~10 tok/s on commodity x86 CPU — no GPU
RAM Required8 GB minimum
GPU RequiredNo

Choosing the Right Model

We evaluated dozens of models for infrastructure investigation tasks. Our criteria:

  • Instruction following. The model needs to understand complex multi-step queries about infrastructure.
  • Technical knowledge. It needs to understand Kubernetes, databases, networking, Linux internals.
  • Reasoning ability. Root cause analysis requires multi-hop reasoning.
  • Efficiency. Must run well on CPU without excessive resource usage.
  • Permissive license. Must be deployable commercially without restrictions.

TernaryPhysics-7B is the result of extensive evaluation and optimization. It provides strong infrastructure reasoning capabilities while running efficiently on standard hardware.

What is Quantization?

Neural networks typically use 32-bit or 16-bit floating-point numbers to represent weights. Quantization reduces this precision to enable efficient CPU inference.

Why Quantize?

Smaller size: Reduces model size dramatically, from tens of GB to just a few GB.
Faster inference: Smaller models load faster and process more efficiently.
CPU-friendly: Enables real-time inference without GPU acceleration.

TernaryPhysics-7B uses advanced quantization techniques that balance size reduction with quality preservation, ensuring accurate infrastructure reasoning.

Optimized Inference

TernaryPhysics-7B uses an optimized inference engine designed for CPU execution. This enables real-time conversational responses without GPU acceleration.

Fast CPU Inference

Real-time conversational responses on modern hardware.

Memory Efficient

Optimized to run on systems with 8GB+ RAM.

Cross-Platform

Linux, macOS, Windows. x86_64, ARM64. Works everywhere.

No Dependencies

No GPU, no special drivers. Just standard hardware.

How It Fits the Architecture

TernaryPhysics-7B is the "Tier 2" brain in our two-tier architecture. It works alongside the TNN™ (Tier 1):

Normal Operation
────────────────
TNN™ runs continuously → minimal resource usage
TernaryPhysics-7B sleeps → 0 CPU usage

Anomaly Detected / Human Query
──────────────────────────────
TNN™ detects anomaly → wakes TernaryPhysics-7B
TernaryPhysics-7B analyzes logs/metrics
Returns findings → goes back to sleep

This pattern minimizes resource usage. During normal operation, only the tiny TNN consumes resources. The heavyweight LLM only activates when needed.

Hardware Requirements

TernaryPhysics-7B is designed to run on commodity hardware:

ComponentMinimumRecommended
RAM8 GB16 GB (multi-agent on one host)
Disk5 GB10 GB (logs + multi-agent)
CPUAny x86_64 or ARM64Modern multi-core
GPUNot requiredNot required

On modern hardware, you'll get real-time conversational responses. Older hardware still works, just with slightly longer response times.

Future Improvements

We're actively working on:

  • TNN™ integration. Using the TNN™ to accelerate LLM inference.
  • Infrastructure fine-tuning. Training on infrastructure-specific data for better technical understanding.
  • Smaller models. Exploring smaller models for resource-constrained environments.
  • Efficiency improvements. Continuous optimization for faster, leaner inference.

For more details on how the model fits into the broader architecture, see our Architecture documentation.