This article explains the security posture of our AI chat layer—not as marketing fluff, but as architecture: which classes of models we expose, what “no training on your chats” means in practice, and how the Groq-powered routes deliver order-of-magnitude faster token generation than many general-purpose cloud APIs.
Internally curated models, not a random model zoo
Every chat model visible in WebVoice is registered in our control plane (display name, provider, credit cost). Administrators enable or retire endpoints deliberately. That means you are not exposed to arbitrary third-party models that were never reviewed: the catalogue is a closed list, versioned with migrations, aligned with billing and rate limits.
“Safeguard” oriented weights (for example OpenAI GPT-OSS variants configured for stronger alignment) can be offered side by side with general assistants. You choose the risk profile per thread—customer-facing answers vs internal brainstorming—without mixing policies accidentally.
Retention: what we do not do
We do not operate a consumer-style product that silently trains a single global model on all user content. Your prompts are processed to return a reply and to enforce quotas; they are not sold as training fodder.
Inference providers may keep ephemeral logs for a limited window to detect misuse; that is different from long-term retention for product analytics. Always read the latest provider terms—Groq publishes its data handling alongside performance numbers.
Why Groq is part of the story
Groq hosts large language models on its LPU™ inference hardware. For workloads that fit the context window, published throughput reaches hundreds to a thousand output tokens per second on several checkpoints—far above what many conventional GPU clouds achieve for the same class of open-weights models.
In WebVoice we map those endpoints into the same credit system as other chat providers: Groq-backed chats often cost fewer credits per message (see your live dashboard) while returning answers faster—ideal for interactive assistants and high-volume triage.
Numbers below come from our internal reference table for Groq (from public materials at integration time), plus illustrative bars for ChatGPT/OpenAI, Google Gemini, and generic APIs—order-of-magnitude comparisons from typical public benchmark ranges, not measured by WebVoice on your tenant. They are throughput indicators, not latency SLAs for your specific prompt size.
Output tokens per second (higher is faster generation)
Groq (WebVoice integration table) ChatGPT / OpenAI API (indicative) Gemini API (indicative) Other indicative baselines
ChatGPT and Gemini values are rounded illustrative output tok/s from common public benchmark bands (model tier, region, and load change real figures). They are shown for orientation only.
How to read the chart
- Token throughput (tok/s) measures how quickly the model can stream output tokens under reference conditions; real chats vary with prompt length, batching, and network.
- Green (Groq) bars use the same model IDs we store for pricing and docs (e.g. Safeguard 20B, Llama 3.1 8B Instant, Qwen3 Fast).
- Orange (ChatGPT/OpenAI) and blue (Gemini) bars use indicative tok/s in the ballpark of GPT-4o-class and Flash-tier streaming reports—not live measurements against your OpenAI or Google account.
- Grey bars illustrate unnamed generic or high-latency stacks for contrast.
Credits: faster does not mean “free”
Even with Groq-level speed, each completion still consumes credits. Typical defaults in WebVoice: Groq-class routes often use 0.5 credits per message, other providers often 1 — exact numbers appear in your dashboard.