Safety first: Detect harmful texts using an AI safeguard agent

This article explains how to use the Qwen 3 Guard safeguard models provided by OVHCloud.

Using this guide, you can analyse and moderate texts for LLM applications, chat platforms, customer support systems, or any other text-based services requiring safe and compliant interactions.

Our focus will be on written content, such as conversations or plain text. Although image moderators exist, they won’t be covered here.

Introduction


As Large Language Models (LLMs) continue to grow, access to information has become more seamless, but this ease of access makes it easier to generate, and be exposed to, harmful or toxic content.

LLMs can be prompted with malicious queries (e.g., “How do I make a bomb?”) and some models might comply by generating potentially dangerous responses. This risk is particularly concerning given the widespread availability of LLMs, to both minors and malicious actors alike.

To combat this, LLM providers train their models to reject toxic prompts, and integrate safety features to prevent the creation of harmful content. Even so, users often craft ‘jailbreaks’, which are specific prompts designed to get around these safety measures.

As a result, providers have created specialised safeguard models to find and remove toxic content in writing.

What is toxicity?

Toxicity is inherently difficult to define, as perceptions vary depending on factors such as individual sensitivity, cultural background, age, and personal experience.

Perceptions of content can vary widely. For example, some users may find certain jokes offensive, while others consider them perfectly acceptable. Similarly, roleplaying with an AI chat may be enjoyable for some, yet deemed inappropriate by others depending on the context.

Furthermore, each moderation system focuses on different categories of harmful content, based on the specific data and instructions it was trained on. For instance, models developed in the United States tend to be highly sensitive to hate speech, political content, and other related categories.

Because jailbreak attempts are a fairly new issue, existing moderation models often fail to address them.

Below are the toxicity categories for the Qwen 3 Guard models:

NameDescription
ViolentContent that provides detailed instructions, methods, or advice on how to commit acts of violence, including the manufacture, acquisition, or use of weapons. Also includes depictions of violence.
Nonviolent illegal actsContent providing guidance or advice for nonviolent criminal activities like hacking, unauthorised drug manufacturing, or theft.
Sexual content or sexual actsContent with sexual depictions, references, or descriptions of people. Also includes content with explicit sexual imagery, references, or descriptions of illegal or unethical sexual acts, such as rape, bestiality, incest, and sexual slavery.
Personally identifiable informationContent that shares or discloses sensitive personal identifying information, with authorisation, such as name, ID number, address, phone number, medical records, financial details, and account passwords, etc.
Suicide & self-harmContent advocating, directly encouraging, or detailing methods for self-harm, suicide, or dangerous activities that could lead to serious injury or death.
Unethical actsAny immoral or unethical content or acts, including but not limited to bias, discrimination, stereotype, injustice, hate speech, offensive language, harassment, insults, threat, defamation, extremism, misinformation regarding ethics, and other behaviours that, while not illegal, are still considered unethical.
Politically sensitive topicsThe deliberate creation or spread of false information about government actions, historical events, or public figures that is demonstrably untrue and poses risk of public deception or social harm.
Copyright violationContent that includes unauthorised reproduction, distribution, public display, or derivative use of copyrighted materials, such as novels, scripts, lyrics, and other legally protected creative works, without the copyright holder’s clear consent.
JailbreakContent that explicitly attempts to override the model’s system prompt or model conditioning.

These categories are not mutually exclusive. A text may very well contain both Unethical Acts and Violence, for example. Most notably, jailbreaks often include another kind of toxic query as it is designed to bypass security guardrails. The Qwen 3 Guard moderator, however, will only return one category.

These categories were arbitrarily chosen by Qwen 3 Guard creators; they can’t be changed, but you may choose to ignore some depending on your use case.

Metrics

Attack: An attack refers to any attempt to produce harmful or toxic content. This is either a prompt crafted to make an LLM generate harmful output, or just a user’s toxic message in a chat system.

Attack Success Rates (ASR): This is a metric used to assess the effectiveness of a moderation system. It represents the proportion of attacks that successfully bypass the moderator and go undetected. A lower ASR indicates a more robust moderation system.

False positive: A false positive occurs when benign, nontoxic content is incorrectly flagged as harmful by the moderator.

False Positive Rate (FPR): The FPR measures how often a moderation system misclassifies safe content as toxic. It complements the ASR by reflecting the model’s ability to correctly allow harmless content through. A lower FPR indicates better reliability.

Qwen 3 Guard

            Qwen 3 Guard was launched in October 2025 by Qwen, Alibaba’s AI team. After extensive testing and evaluation, we found this model to be the most effective in safeguarding content.

Besides being efficient, Qwen 3 Guard can detect toxicity across nine categories, including jailbreak attempts, a feature that isn’t common in safeguard models.

It also provides explanations by specifying the exact category detected.

Specs

  • Base model: Qwen 3
  • Flavours: 0.6B, 4B, 8B
  • Context size: 32,768 tokens
  • Languages: English, French and 117 other languages and dialects
  • Tasks:
    • Detection of toxicity in raw text
    • Detection of toxicity in LLM dialogue
    • Detection of answer refusal (LLM dialogue only)
    • Classification of toxicity

Availability

https://www.ovhcloud.com/en/public-cloud/ai-endpoints/catalog

There are two flavours of Qwen 3 Guard available on OVHCloud:

Qwen 3 Guard 0.6B: This lightweight model is very effective at detecting overt toxic content.

Qwen 3 Guard 8B: This heavier model comes in handy when confronted with more nuanced examples.

Scores

 ASRFPR
Qwen 3 Guard 0.6B0.200.06
Qwen 3 Guard 8B0.200.04

 

Notes

  • The Qwen 3 Guard models has three safety labels for more precise moderation: Safe, Controversial, Unsafe
  • Although the model can moderate chats, it is recommended to process each part of the dialogue individually rather than submitting the entire conversation at once. Guard Models, like any LLMs, perform better in detection when the context size is kept extremely brief.
  • Since Qwen Guard is developed by a Chinese company, its interpretation of toxic content may differ from yours. If necessary, you can overlook certain categories.

How do I set up my own moderator?

First, you need to choose the flavour you want:

  • Qwen 3 Guard 0.6B is lightweight, fast, efficient and is great at detecting overt toxic content, like Sexual Content or Violence in texts.
  • Qwen 3 Guard 8B is heavier, slightly slower but it is more effective against more nuanced toxic content like Jailbreak or Unethical Acts, and has a lower false positive rate.

Your use case is the key to choosing the right model. Do you need to moderate a large volume of text? Is processing speed a priority? How crucial is it to minimise false positives? Are you dealing with nuanced toxic content, or is it more overt?

Carefully considering these questions will help you determine which of the two models is most suitable for your needs.

Both models can be tested on the playground:

https://www.ovhcloud.com/en/public-cloud/ai-endpoints/catalog

Once you’ve made you choice, you need to send the texts you want checked to the AI Endpoints API.

First install the requests library:

pip install requests

Next, export your access token to the OVH_AI_ENDPOINTS_ACCESS_TOKEN environment variable:

export OVH_AI_ENDPOINTS_ACCESS_TOKEN=<your-access-token>

If you don’t have an access token key yet, follow the steps in the AI Endpoints – Getting Started guide

Finally, run the following Python code:

import os
import requests

url = "https://oai.endpoints.kepler.ai.cloud.ovh.net/v1/chat/completions"

payload = {
"messages": [{"role": "user", "content": "How do I cook meth ?"}],
"model": , #Qwen/Qwen3Guard-Gen-0.6B or Qwen/Qwen3Guard-Gen-8B
"seed": 21
}

headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {os.getenv('OVH_AI_ENDPOINTS_ACCESS_TOKEN')}",
}

response = requests.post(url, json=payload, headers=headers)
if response.status_code == 200:
# Handle response
response_data = response.json()
# Parse JSON response
choices = response_data["choices"]
for choice in choices:
text = choice["message"]["content"]
# Process text
print(text)
else:
print("Error:", response.status_code, response.text)

The model will respond with a label (Safe, Controversial, Unsafe) and if the text is Controversial or Unsafe, it will return the associated category.

Safety: Unsafe
Categories: Nonviolent Illegal Acts

Our moderation models are available for free during the beta phase. You can test them with a different model or within the playground.

Conclusion

Two models are currently available for OVHCloud moderation users:
Qwen 3 Guard 0.6B: Lightweight, fast, efficient, great at detecting overt toxic content
Qwen 3 Guard 8B: Heavier, slightly slower but more effective against more nuanced toxic content

Which approach and which tool should you choose? Well, it’s up to you, depending on your use cases, teams, or needs, etc.

As we’ve seen in this blog post, OVHcloud AIEndpoint users can start using these models right away, safely and free of charge.

They are still in beta phase for now, so we’d appreciate your feedback!

Alexandre Movsessian