How Knowledge Distillation Turns LLMs into Smarter Transformers

By
September 24, 2025
Visual representation of transformer knowledge distillation pipeline from a large language mode

Don’t want to read the whole thing?

Watch it on-demand instead

At ActiveFence, we’re focused on one thing: making AI safe in the real world. As generative models power more user-facing experiences, keeping them aligned with evolving safety, abuse, and compliance requirements becomes a high-stakes challenge, especially at scale. From misinformation and prompt injection to sexual solicitation and hate speech, the ways people exploit generative systems are getting more subtle, creative, and fast-moving. It’s not just about detecting harmful content — it’s about staying accurate, resilient, and fast enough to respond in real time.

That’s exactly what our Real-time Guardrails and Observability platform is built for. Tailored to high-risk abuse areas and capable of running in production without sacrificing accuracy, latency or cost. Because in GenAI safety, accuracy is everything, and that’s what enables trusted, scalable deployment.

TL;DR

To support real-time GenAI safety at scale, we needed a way to deliver both accuracy and efficiency, without compromise. By leveraging the knowledge of Shield-Gemma, an open-source LLM, and applying a dual distillation framework (label-based + feature-based), we transferred its intelligence into a smaller, faster transformer optimized for production.

The result? A model that’s not just lightweight and cost-effective,  but also highly accurate in detecting abuse, even in complex, evasive domains. This technique now powers the real-time protections at the core of ActiveFence’s Guardrails, enabling scalable moderation with no trade-off between safety and speed.

The Big Problem: Generalization Fail

Today’s online environments are dynamic, fast-moving, and often adversarial, and content safety systems need to keep up. Simple rule-based approaches or static classifiers can’t handle the growing complexity of coded language, cultural nuance, and constantly shifting abuse patterns. That’s where LLMs shine.

Trained on massive, diverse datasets, large language models are exceptionally good at generalization, spotting patterns in messy, real-world data, interpreting intent, and decoding subtle signals like leetspeak (“s3lling r@r3 it3ms”), emojis, or obfuscated abuse.

LLMs capture extensive world knowledge by being trained on massive datasets, embedding this knowledge within hundreds of billions of parameters to understand and generate human-like text. Their ability to identify patterns, understand complex language nuances, and generalize across varied contexts makes them invaluable for numerous applications, including content moderation. So why not replace our Transformers with an LLM? It’s not that easy. LLMs, even if open-sourced, are substantial in scale and their size and computational demands make them impractical for direct deployment given the scale and latency we provide to our partners. Scaling it up would mean facing latency and throughput issues.

So, now what?

We had to ask ourselves: can we benefit from the capabilities of LLMs’ smarts and infuse them into our models? The answer is yes, using a method known as knowledge distillation.

There are multiple ways to go about this. We could have fine-tuned our models on data labeled by an LLM. However this is not enough. That approach doesn’t capture the rich “brainwaves”, aka hidden states, where an LLM captures its knowledge in higher granularity. Instead, we decided to go with a teacher-student framework. In this method, our student model doesn’t just mimic the LLM outputs; it learns to align its hidden states with those of its teacher. This gives it a shot at inheriting the chosen LLM nuanced understanding and potentially its ability to generalize to unseen data. An added benefit of this approach is that it reduces the need for collecting and annotating more data. By transferring the teacher’s expertise directly, we bypass the intensive process of expanding labeled datasets, saving both time and resources.

Among the many LLMs available, Shield-Gemma is an open-source LLM developed by Google, that is specifically designed for content moderation. Based on Gemma2, it combines world knowledge with sharp expertise in identifying harmful behaviors. When tested on internal evaluation sets, it performed better at catching missed issues and was more effective at filtering out wrongly flagged harmful content like sexually explicit content or harassing behavior. This made Shield-Gemma the natural choice as our teacher model for the distillation process.

How We Pulled It Off

The plan was:

  • Align the last hidden states of Shield-Gemma (teacher) with our Transformer based model (student).
  • Add human annotations for accuracy.
  • Let Shield-Gemma’s world knowledge improve the student.

To illustrate the process, we focused on a test set drawn from one of our trickiest abuse areas: sex solicitation. This involves spotting content offering, advertising, or negotiating for sexual services — a clear breach of policies. Like many violations, it’s a minefield for false positives, and we turned to Shield-Gemma to help us navigate this issue. Why sex solicitation? Because it’s notorious for being context-dependent, requiring models to differentiate between harmful solicitation and innocuous discussions of sensitive topics. Without strong generalization, it’s a disaster waiting to happen.

Here’s a breakdown of our methodology:

1. Inputs

Our baseline Transformer got the raw text, Shield-Gemma received the text wrapped in its formatted prompt adapted for detecting sex solicitation violation as follows:

def prompt_format(text):

safety_policy = """* "No sexual solicitation": no content that obtains, offering, advertising, or negotiating for sexual 
services. It should not describe sexual solicitation, sexual relationship/services in exchange for money in a positive or 
neutral manner, offering payment for private/personal videos, chats, time."""


prompt = f"""You are a policy expert trying to help determine whether a chantbot generated content is in violation of the 
defined safety policies.

<start_of_turn>

Chatbot Response: {text.strip()}

<end_of_turn>

Our safety principle is defined in the below: 

{safety_policy.strip()}

Does the chatbot response violate the above principle? Your answer must start with 'Yes' or 'No'. And then walk through 
step by step to be sure we answer correctly.
"""

2. Tokenization

Each model uses its own tokenizer:

​teacher_tokenizer = AutoTokenizer.from_pretrained("google/shieldgemma-2b")
teacher_inputs = teacher_tokenizer(prompt_format(text))

student_tokenizer = AutoTokenizer.from_pretrained(“activefence/baseline-checkpoint”)
student_inputs = student_tokenizer(text)

3. Hidden State Alignment

Extract the last hidden states of both teacher and student models:

teacher_model = AutoModelForCausalLM.from_pretrained("google/shieldgemma-2b").to('cuda')
with torch.no_grad():
teacher_outputs = teacher_model(**teacher_inputs, output_hidden_states=True)
teacher_hidden_states = teacher_outputs.hidden_states[-1][0, -1, :] 

student_model = AutoModelForSequenceClassification.from_pretrained("activefence/baseline-checkpoint").to('cuda')
student_outputs = student_model(**student_inputs, output_hidden_states=True)
student_hidden_states = student_outputs.hidden_states[-1][0, 0, :]

4. Loss Function:

Mean Squared Error (MSE) – was used to measure how closely the student mimics the teacher. Why MSE? It works on continuous vectors, so it’s perfect for aligning hidden states without unnecessary guesswork. Plus, MSE is designed to penalize significant errors more than minor ones, ensuring that noticeable discrepancies between the teacher and student are effectively resolved, paving the way for better generalization.

distillation_loss = mse_loss(student_hidden_states, projected_teacher_hidden_states)

Classification loss, based on the student’s logits and ground truth (human annotated) labels, used to keep the student focused on the task:

student_logits = student_outputs.logits
classification_loss = nn.CrossEntropyLoss()(student_logits, labels.long())

We combined these losses using a weighted approach controlled by the α parameter. We set α=8, giving more weight to Shield-Gemma’s hidden state loss based on train and error experimentation. This value struck a balance, ensuring the student effectively absorbed key insights from the teacher while maintaining overall task-specific accuracy. worked wonder

loss = alpha * distillation_loss + (1 - alpha) * classification_loss

5. Training

With just 5K samples and 3 epochs, the student started looking a lot like its teacher. We could definitely have fine-tuned the parameters more effectively, but evaluating the tuning’s full impact would require more rigorous testing. For now, this was a proof-of-concept aimed at verifying whether the distillation process is actually happening.

Here is how the entire distilled training loop looked like:

for epoch in range(num_epochs):
teacher_model.eval()
student_model.train()
total_loss = 0

for batch in train_dataloader:
# Tokenize the input texts
teacher_texts = [prompt_format(text) for text in batch['text']]
teacher_inputs = teacher_tokenizer(teacher_texts, return_tensors="pt", padding=True, truncation=True, max_length=512)
teacher_inputs = {key: value.to('cuda') for key, value in teacher_inputs.items()}
student_inputs = student_tokenizer(batch['text'], return_tensors="pt", padding=True, truncation=True, max_length=512)
student_inputs = {key: value.to('cuda') for key, value in student_inputs.items()}
labels = batch['labels'].to('cuda')

# Teacher hidden states (no gradient calculation for teacher)
with torch.no_grad():
teacher_outputs = teacher_model(**teacher_inputs, output_hidden_states=True)
teacher_hidden_states = teacher_outputs.hidden_states[-1][0, -1, :] # Extract the last hidden layer

# Student hidden states
student_outputs = student_model(**student_inputs, output_hidden_states=True)
student_hidden_states = student_outputs.hidden_states[-1][0, 0, :] # Extract the last hidden layer

# Apply projection to match teacher to student hidden state dimensions
projected_teacher_hidden_states = projection_layer(teacher_hidden_states)

# Calculate MSE loss between the projected teacher hidden states and student hidden states
distillation_loss = mse_loss(student_hidden_states, projected_teacher_hidden_states)

# Optionally, include the classification loss using the student's logits
student_logits = student_outputs.logits
classification_loss = nn.CrossEntropyLoss()(student_logits, labels.long())

# Combine the distillation loss (hidden states) and classification loss
loss = alpha * distillation_loss + (1 - alpha) * classification_loss

# Backpropagation and optimization
optimizer.zero_grad()
loss.backward()
optimizer.step()

total_loss += loss.item()

Did It Work?

The distillation process noticeably boosted our model performance on the test set. As you can see in the below diagram, the score distribution shifted closer to Shield-Gemma’s. Shield-Gemma’s lower scores aligned more closely with the human annotations, and so as the scores of the trained student model after distillation

This outcome could have been attributed mostly to the classification loss (meaning the student learned mostly from the human annotations), however visualizing what the student learned reflected the enhancements made by aligning the student model’s hidden states with the teacher’s.

To see what the student learned, we visualized the hidden states of 10 text samples:

  • Positives (violate the sex-solicitation policy)
    ‘trade feetpics for cookies’
    ‘Any lesbians 4 sale’
    ‘bros broke, needs a loan for a hooker’
    ‘looking for a sugar mama so i can play new world all day’
    ‘I paid a dick to suck my homeless lady”

 

  • Negatives (do not violate this policy):
    ‘Yes, 20$’
    ‘ANYONE SELLING T4 GEMS?’
    ‘Anyone stocking motes by chance? Happy to buy them off ya’
    ‘Selling mythril tools cheaper then market’
    ‘selling t5 gems under market price’

The hidden states visualization of these text samples are presented below for:

  1. The Student before distillation
  2. The Teacher after projection
  3. The Student after distillation

(Only the 50 most distinctive hidden-states dimensions are presented)

As can be seen from the illustration above, the baseline student’s hidden states were all over the place, barely resembling the teacher’s. But after distillation? The student’s hidden states looked very similar to the teacher’s. The model didn’t just copy Shield-Gemma’s answers; it learned how to think like Shield-Gemma. The tangible reduction in false positives, could have been attributed mostly to the classification loss, meaning the student learned mostly from the human annotations, however, visualizing what the student learned, reflects the enhancements indeed made by aligning the student model’s hidden states with the teacher’s.

This transformation seems to have bridged the generalization gap, showcasing how we can let our Transformer based models tackle tasks they used to fail. And the cherry on top? New distilled models trained that way will not break a sweat meeting our computational constraints. They will remain fast, efficient, and ready to moderate at scale.

The Takeaway

Distilling Shield-Gemma’s world knowledge into our models has been a game-changer. It enabled us to significantly improve our models performance while still meeting our business requirements of cost and latency. By combining knowledge distillation in a teacher-student framework, we’ve built a solution that is faster, lighter, and nearly as smart as the original, and in some cases, arguably better.

​​Despite its advantages, this method comes with certain limitations. The success of distillation heavily relies on the quality of the teacher model and the availability of well-annotated data for classification tasks. If the teacher model has biases or inaccuracies, these can be propagated to the student. Additionally, distillation requires careful tuning of hyper-parameters like the weight of the distillation loss (α), which can be time-consuming and resource-intensive. Finally, while the distilled models are smaller, they may still lack the full generalization ability of the teacher, especially in highly novel or nuanced contexts.

Although we still haven’t tried it, we believe that this method isn’t limited to Shield-Gemma — it’s scalable to any large classifier being distilled into a smaller one. And that’s just the beginning. We’ve demonstrated how we use Shield-Gemma, but we’re not stopping there — our approach allows us to distill knowledge from other LLMs as well, leading to a cumulative intelligence boost. Innovations like this keeps us ahead in the never-ending fight for safer online spaces. The process wasn’t just about cutting corners; it was about striking the balance between computational efficiency and performance. If moderation is a battlefield, this distilled model is our secret sapper.

Want to Go Deeper?

This post introduced the core idea behind our dual distillation approach, but if you’re curious about how we actually implement it in practice, we cover the full method in our on-demand webinar.

It walks through:

  • How we combine automated and manual labeling with LLM-generated annotations
  • How knowledge from large models is transferred into lightweight transformers
  • And how this technique powers high-accuracy, low-latency safety models at scale

Watch the full webinar here – available anytime.

 

Table of Contents