What does “safeguard” mean for chat models?

Safeguard-oriented weights are configured for stronger refusals and alignment with policy — useful for customer-facing or sensitive workflows.

Why is Groq mentioned in the throughput chart?

We route some models through Groq’s inference stack for very low latency. Throughput (tok/s) is an indicative comparison; your workload may differ.

Does WebVoice train on my chats?

We do not use your conversations to fine-tune a global consumer model. See the privacy and AI policy pages for provider retention and your controls.

How should I read the green vs orange vs blue bars?

Green highlights Groq-hosted models from our table; orange and blue bars are rounded illustrative baselines for other APIs; grey is generic contrast.

Where can I read more about data locality?

Start with the privacy policy and the technical security article linked from this section of the site.

Secure AI models, Groq speed & retention

This article explains the security posture of our AI chat layer—not as marketing fluff, but as architecture: which classes of models we expose, what “no training on your chats” means in practice, and how the Groq-powered routes deliver order-of-magnitude faster token generation than many general-purpose cloud APIs.

Retention: WebVoice does not use your conversations to fine-tune proprietary models. Third-party inference hosts apply their own short technical retention for abuse monitoring; we route through providers that fit our compliance story and document the rest in our Privacy and AI Policy. AI Policy · Privacy Policy

Internally curated models, not a random model zoo

Every chat model visible in WebVoice is registered in our control plane (display name, provider, credit cost). Administrators enable or retire endpoints deliberately. That means you are not exposed to arbitrary third-party models that were never reviewed: the catalogue is a closed list, versioned with migrations, aligned with billing and rate limits.

“Safeguard” oriented weights (for example OpenAI GPT-OSS variants configured for stronger alignment) can be offered side by side with general assistants. You choose the risk profile per thread—customer-facing answers vs internal brainstorming—without mixing policies accidentally.

Retention: what we do not do

We do not operate a consumer-style product that silently trains a single global model on all user content. Your prompts are processed to return a reply and to enforce quotas; they are not sold as training fodder.

Inference providers may keep ephemeral logs for a limited window to detect misuse; that is different from long-term retention for product analytics. Always read the latest provider terms—Groq publishes its data handling alongside performance numbers.

Why Groq is part of the story

Groq hosts large language models on its LPU™ inference hardware. For workloads that fit the context window, published throughput reaches hundreds to a thousand output tokens per second on several checkpoints—far above what many conventional GPU clouds achieve for the same class of open-weights models.

In WebVoice we map those endpoints into the same credit system as other chat providers: Groq-backed chats often cost fewer credits per message (see your live dashboard) while returning answers faster—ideal for interactive assistants and high-volume triage.

Numbers below come from our internal reference table for Groq (from public materials at integration time), plus illustrative bars for ChatGPT/OpenAI, Google Gemini, and generic APIs—order-of-magnitude comparisons from typical public benchmark ranges, not measured by WebVoice on your tenant. They are throughput indicators, not latency SLAs for your specific prompt size.

Output tokens per second (higher is faster generation)

GPT-OSS 120B on Groq LPU™

500 tok/s

GPT-OSS Safeguard 20B on Groq LPU™

1000 tok/s

GPT-OSS 20B on Groq LPU™

1000 tok/s

Llama 3.1 8B Instant on Groq

560 tok/s

Llama 3.3 70B Versatile on Groq

280 tok/s

Qwen3 Fast on Groq

662 tok/s

Kimi K2 instruct (Groq route, published TPS in stack)

200 tok/s

ChatGPT / OpenAI API (GPT-4o class, indicative output tok/s)

78 tok/s

Google Gemini API (Flash-tier, indicative output tok/s)

185 tok/s

Typical third-party chat API (indicative throughput)

95 tok/s

High-latency multi-hop cloud stack (indicative)

42 tok/s

Groq (WebVoice integration table) ChatGPT / OpenAI API (indicative) Gemini API (indicative) Other indicative baselines

ChatGPT and Gemini values are rounded illustrative output tok/s from common public benchmark bands (model tier, region, and load change real figures). They are shown for orientation only.

How to read the chart

Token throughput (tok/s) measures how quickly the model can stream output tokens under reference conditions; real chats vary with prompt length, batching, and network.
Green (Groq) bars use the same model IDs we store for pricing and docs (e.g. Safeguard 20B, Llama 3.1 8B Instant, Qwen3 Fast).
Orange (ChatGPT/OpenAI) and blue (Gemini) bars use indicative tok/s in the ballpark of GPT-4o-class and Flash-tier streaming reports—not live measurements against your OpenAI or Google account.
Grey bars illustrate unnamed generic or high-latency stacks for contrast.

Credits: faster does not mean “free”

Even with Groq-level speed, each completion still consumes credits. Typical defaults in WebVoice: Groq-class routes often use 0.5 credits per message, other providers often 1 — exact numbers appear in your dashboard.

Safe models, controlled retention, and Groq-class speed

Internally curated models, not a random model zoo

Retention: what we do not do

Why Groq is part of the story

Output tokens per second (higher is faster generation)

How to read the chart

Credits: faster does not mean “free”

Related reading

Trusted for production voice workloads

Frequently asked questions

Ready to try WebVoice?