Introduction

Modern search platforms rarely rely on a single retrieval strategy.
In practice, a production system often supports multiple search modes, such as:

  • Keyword search for exact identifiers and error codes
  • Semantic search for conceptual or explanatory queries
  • Structured search where natural language is translated into a query DSL (e.g., Lucene)

The challenge is not implementing these search engines individually — it is deciding, at query time, which engine to route a user query to.

This post describes a practical, ML-driven approach to intent-aware query routing, and dives deep into an important architectural trade-off:

Linear probing over frozen embeddings vs. end-to-end neural networks with partial or full fine-tuning

The focus is on engineering correctness, latency, and reliability, not model novelty.


Problem Statement

Consider a backend system that supports three search paths:

Query Example Intended Route
INC-88321 Keyword search
how to increase worker threads in helidon Semantic (embedding) search
find all incidents where status is closed Structured (NL → Lucene) search

The system must classify query intent before executing the search.

A purely rule-based solution quickly becomes brittle:

  • Regex rules explode in number
  • Query phrasing varies significantly
  • Maintenance cost increases over time

This makes machine learning a good fit — provided it is applied conservatively.


Framing the Problem as Intent Classification

At its core, this is a low-entropy, multi-class text classification problem.

We define routing-based labels:

Label Meaning
ROUTE_KEYWORD Exact / lexical match
ROUTE_EMBEDDING Semantic similarity search
ROUTE_LUCENE Structured query parsing

The classifier output is used only for routing, not for answering the query itself.

This distinction matters — misclassification cost is asymmetric:

  • Routing a structured query to keyword search often yields zero results
  • Routing an ambiguous query to semantic search usually degrades gracefully

Now we try to solve this problem using a frequently used transformer model in ML - SBERT. SBERT itself is a very generic model which can be tuned in many ways to solve the intent classification problem. Here we discuss mainly 3 approaches and the tradeoffs required.

Why SBERT?

SBERT converts a sentence into a dense numerical vector (embedding) that captures semantic meaning.

For example:

“enable log rotation in helidon” “how to configure logging rollover in helidon”

These two sentences produce very similar vectors, even though the words differ.

That makes SBERT ideal for:

intent classification

semantic search

clustering


Base Architecture (Common to All Approaches)

All three strategies use the same neural architecture:

  1. A pretrained Transformer encoder

  2. A small classification head on top

import torch
import torch.nn as nn
from transformers import AutoModel, AutoTokenizer
MODEL_NAME = "microsoft/MiniLM-L6-H384-uncased"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
class IntentClassifier(nn.Module):
    def __init__(self, model_name, num_labels):
        super().__init__()

        # Pretrained Transformer (SBERT-style encoder)
        self.encoder = AutoModel.from_pretrained(model_name)

        # Size of embeddings produced by the encoder
        hidden_size = self.encoder.config.hidden_size

        # Simple linear classifier
        self.classifier = nn.Linear(hidden_size, num_labels)

    def forward(self, input_ids, attention_mask):
        # Run input through Transformer
        outputs = self.encoder(
            input_ids=input_ids,
            attention_mask=attention_mask
        )

        # Use [CLS] token embedding as sentence representation
        cls_embedding = outputs.last_hidden_state[:, 0]

        # Map embedding to intent logits
        return self.classifier(cls_embedding)

The only difference between the three strategies is: Which parts of this model are allowed to learn.

Baseline: Linear Probe over Frozen SBERT

Architecture

The simplest effective solution uses pretrained sentence embeddings:

User Query -> SBERT Encoder (frozen) -> Linear Layer + Softmax -> Search Router

This approach is commonly called a linear probe.

What Is Linear Probing?

Linear probing means:

“Do NOT change the pretrained language model. Only train a lightweight classifier on top.”

we treat SBERT as a fixed feature extractor.

Why this works well

SBERT embeddings already encode:

  • Query length
  • Interrogative structure
  • Token shape (IDs vs sentences)
  • Intent routing is a coarse decision boundary
  • Small datasets (1–2k samples) are sufficient

Benefits

  • Extremely stable
  • Fast to train
  • Minimal overfitting risk
  • Excellent baseline for production systems

Limitations

The decision boundary is strictly linear in embedding space.

This leads to ambiguous cases such as:

  • helidon threadpool configuration
  • tickets status closed

These sit near class boundaries and are often misclassified.

How to Implement Linear Probing

Step 1: Freeze the Encoder

model = IntentClassifier(MODEL_NAME, num_labels=3)

# Freeze all Transformer weights
for param in model.encoder.parameters():
    param.requires_grad = False

This tells PyTorch:

“Do not update these weights during training.”

Step 2: Train Only the Classifier

from torch.optim import AdamW

optimizer = AdamW(
    model.classifier.parameters(),
    lr=1e-3
)

When to Use Linear Probing

Situation Recommendation
Very small dataset ✅ Yes
Fast iteration ✅ Yes
Low overfitting tolerance ✅ Yes
Domain-specific language ❌ Limited

Why This Is Not End-to-End Training

Although a neural network is used, the encoder never adapts to the task.

why ?

  • Embeddings remain generic
  • Task-specific cues are not reinforced
  • Boundary refinement is limited

This motivates moving toward end-to-end learning, but doing so carefully.


Introducing a True Neural Network

To improve boundary separation without sacrificing stability, we move to a partially fine-tuned transformer.

Architecture

MiniLM Transformer ├─ Frozen lower layers ├─ Trainable top layers → Non-linear classifier head → Softmax

Key ideas:

  • Lower layers encode general syntax and semantics
  • Upper layers adapt to intent-specific signals
  • Non-linear layers improve separability

Partial Unfreezing vs Full Fine-Tuning

This is the most important design decision.

Partial Unfreezing

What Is Partial Unfreezing?

Partial unfreezing means:

“Keep most of the model frozen, but allow the top Transformer layers to adapt.”

Only the top N transformer layers are trainable.

Why only the top layers?

Lower layers learn generic language features

Upper layers learn task-specific meaning

How to Implement Partial Unfreezing

Step 1: Freeze Everything

model = IntentClassifier(MODEL_NAME, num_labels=3)

for param in model.encoder.parameters():
    param.requires_grad = False

Step 2: Unfreeze the Top N Layers

def unfreeze_top_layers(model, n_layers=2):
    total_layers = model.encoder.config.num_hidden_layers

    for name, param in model.encoder.named_parameters():
        for layer_id in range(total_layers - n_layers, total_layers):
            if f"encoder.layer.{layer_id}." in name:
                param.requires_grad = True

This allows controlled adaptation without destabilizing the model. Pros

  • Works well with 1k–3k samples
  • Low overfitting risk
  • Stable training
  • Preserves pretrained knowledge

Cons

  • Limited representational shift
  • Smaller gains beyond a point

When to Use Partial Unfreezing

Situation Recommendation
Small–medium dataset ✅ Yes
Domain adaptation needed ✅ Yes
Production stability ✅ Yes
Very limited compute ⚠️ Maybe

Full Fine-Tuning

All transformer layers are trainable. Full fine-tuning means:

“Allow every weight in the model to update.”

This gives maximum flexibility—but also maximum risk.

How to Implement Full Fine-Tuning

Step 1: Do Not Freeze Anything

model = IntentClassifier(MODEL_NAME, num_labels=3)

Step 2: Use a Very Small Learning Rate

optimizer = AdamW(
    model.parameters(),
    lr=2e-5
)

Pros

  • Maximum task adaptation
  • Best performance at scale

Cons

  • Requires large datasets (5k–10k+)
  • Risk of catastrophic forgetting
  • Unstable with small or noisy data
  • Harder to debug misclassifications

When to Use Full Fine-Tuning

Situation Recommendation
Large labeled dataset ✅ Yes
Highly specialized domain ✅ Yes
Research experiments ✅ Yes
Small datasets ❌ No

Practical Comparison

Dimension Partial Unfreezing Full Fine-Tuning
Dataset size 1k–3k 5k–10k+
Overfitting risk Low High
Training stability High Medium
Latency Same Same
Maintenance cost Low High
Production safety High Medium

For intent routing, partial unfreezing captures most of the benefit with far less risk.


Production Guardrails (Critical)

ML alone should never be the only line of defense.

  1. Regex pre-routing
    • IDs, error codes → keyword search
  2. Confidence thresholds
    • Low confidence → semantic fallback
  3. Hybrid execution
    • Run keyword + semantic for ambiguous queries
  4. Logging misroutes
    • Use real queries for retraining

In practice, heuristics + ML outperform ML alone.


Why Not Use an LLM for Routing?

Large language models are powerful, but poor routers.

Concern Impact
Latency Too high
Cost Unnecessary
Determinism Low
Debuggability Poor

LLMs are better used after routing, for tasks like:

  • NL → Lucene translation
  • Query explanation
  • Result summarization

Lessons Learned

  • Data quality matters more than model choice
  • Linear probes are strong baselines
  • Partial unfreezing gives the best risk-reward ratio
  • Confidence-aware routing improves UX dramatically
  • Intent routing is a systems problem, not a pure ML problem

Conclusion

Intent-aware query routing is not about building the most sophisticated model —
it is about building the most reliable decision boundary under real-world constraints.

For most hybrid search systems:

  • Start with a linear probe
  • Add heuristics and confidence thresholds
  • Move to partial fine-tuning when data grows
  • Avoid full fine-tuning unless scale demands it

This approach keeps the system fast, interpretable, and production-safe.


If you’re building a hybrid search platform, intent classification is the quiet component that determines whether everything else works.