Intent-Aware Query Routing for Hybrid Search Systems

Introduction

Modern search platforms rarely rely on a single retrieval strategy.
In practice, a production system often supports multiple search modes, such as:

Keyword search for exact identifiers and error codes
Semantic search for conceptual or explanatory queries
Structured search where natural language is translated into a query DSL (e.g., Lucene)

The challenge is not implementing these search engines individually — it is deciding, at query time, which engine to route a user query to.

This post describes a practical, ML-driven approach to intent-aware query routing, and dives deep into an important architectural trade-off:

Linear probing over frozen embeddings vs. end-to-end neural networks with partial or full fine-tuning

The focus is on engineering correctness, latency, and reliability, not model novelty.

Problem Statement

Consider a backend system that supports three search paths:

Query Example	Intended Route
`INC-88321`	Keyword search
`how to increase worker threads in helidon`	Semantic (embedding) search
`find all incidents where status is closed`	Structured (NL → Lucene) search

The system must classify query intent before executing the search.

A purely rule-based solution quickly becomes brittle:

Regex rules explode in number
Query phrasing varies significantly
Maintenance cost increases over time

This makes machine learning a good fit — provided it is applied conservatively.

Framing the Problem as Intent Classification

At its core, this is a low-entropy, multi-class text classification problem.

We define routing-based labels:

Label	Meaning
`ROUTE_KEYWORD`	Exact / lexical match
`ROUTE_EMBEDDING`	Semantic similarity search
`ROUTE_LUCENE`	Structured query parsing

The classifier output is used only for routing, not for answering the query itself.

This distinction matters — misclassification cost is asymmetric:

Routing a structured query to keyword search often yields zero results
Routing an ambiguous query to semantic search usually degrades gracefully

Now we try to solve this problem using a frequently used transformer model in ML - SBERT. SBERT itself is a very generic model which can be tuned in many ways to solve the intent classification problem. Here we discuss mainly 3 approaches and the tradeoffs required.

Why SBERT?

SBERT converts a sentence into a dense numerical vector (embedding) that captures semantic meaning.

For example:

“enable log rotation in helidon” “how to configure logging rollover in helidon”

These two sentences produce very similar vectors, even though the words differ.

That makes SBERT ideal for:

intent classification

semantic search

clustering

Base Architecture (Common to All Approaches)

All three strategies use the same neural architecture:

A pretrained Transformer encoder
A small classification head on top

import torch
import torch.nn as nn
from transformers import AutoModel, AutoTokenizer
MODEL_NAME = "microsoft/MiniLM-L6-H384-uncased"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
class IntentClassifier(nn.Module):
    def __init__(self, model_name, num_labels):
        super().__init__()

        # Pretrained Transformer (SBERT-style encoder)
        self.encoder = AutoModel.from_pretrained(model_name)

        # Size of embeddings produced by the encoder
        hidden_size = self.encoder.config.hidden_size

        # Simple linear classifier
        self.classifier = nn.Linear(hidden_size, num_labels)

    def forward(self, input_ids, attention_mask):
        # Run input through Transformer
        outputs = self.encoder(
            input_ids=input_ids,
            attention_mask=attention_mask
        )

        # Use [CLS] token embedding as sentence representation
        cls_embedding = outputs.last_hidden_state[:, 0]

        # Map embedding to intent logits
        return self.classifier(cls_embedding)

The only difference between the three strategies is: Which parts of this model are allowed to learn.

Baseline: Linear Probe over Frozen SBERT

Architecture

The simplest effective solution uses pretrained sentence embeddings:

User Query -> SBERT Encoder (frozen) -> Linear Layer + Softmax -> Search Router

This approach is commonly called a linear probe.

What Is Linear Probing?

Linear probing means:

“Do NOT change the pretrained language model. Only train a lightweight classifier on top.”

we treat SBERT as a fixed feature extractor.

Why this works well

SBERT embeddings already encode:

Query length
Interrogative structure
Token shape (IDs vs sentences)
Intent routing is a coarse decision boundary
Small datasets (1–2k samples) are sufficient

Benefits

Extremely stable
Fast to train
Minimal overfitting risk
Excellent baseline for production systems

Limitations

The decision boundary is strictly linear in embedding space.

This leads to ambiguous cases such as:

helidon threadpool configuration
tickets status closed

These sit near class boundaries and are often misclassified.

How to Implement Linear Probing

Step 1: Freeze the Encoder

model = IntentClassifier(MODEL_NAME, num_labels=3)

# Freeze all Transformer weights
for param in model.encoder.parameters():
    param.requires_grad = False

This tells PyTorch:

“Do not update these weights during training.”

Step 2: Train Only the Classifier

from torch.optim import AdamW

optimizer = AdamW(
    model.classifier.parameters(),
    lr=1e-3
)

When to Use Linear Probing

Situation	Recommendation
Very small dataset	✅ Yes
Fast iteration	✅ Yes
Low overfitting tolerance	✅ Yes
Domain-specific language	❌ Limited

Why This Is Not End-to-End Training

Although a neural network is used, the encoder never adapts to the task.

why ?

Embeddings remain generic
Task-specific cues are not reinforced
Boundary refinement is limited

This motivates moving toward end-to-end learning, but doing so carefully.

Introducing a True Neural Network

To improve boundary separation without sacrificing stability, we move to a partially fine-tuned transformer.

Architecture

MiniLM Transformer ├─ Frozen lower layers ├─ Trainable top layers → Non-linear classifier head → Softmax

Key ideas:

Lower layers encode general syntax and semantics
Upper layers adapt to intent-specific signals
Non-linear layers improve separability

Partial Unfreezing vs Full Fine-Tuning

This is the most important design decision.

Partial Unfreezing

What Is Partial Unfreezing?

Partial unfreezing means:

“Keep most of the model frozen, but allow the top Transformer layers to adapt.”

Only the top N transformer layers are trainable.

Why only the top layers?

Lower layers learn generic language features

Upper layers learn task-specific meaning

How to Implement Partial Unfreezing

Step 1: Freeze Everything

model = IntentClassifier(MODEL_NAME, num_labels=3)

for param in model.encoder.parameters():
    param.requires_grad = False

Step 2: Unfreeze the Top N Layers

def unfreeze_top_layers(model, n_layers=2):
    total_layers = model.encoder.config.num_hidden_layers

    for name, param in model.encoder.named_parameters():
        for layer_id in range(total_layers - n_layers, total_layers):
            if f"encoder.layer.{layer_id}." in name:
                param.requires_grad = True

This allows controlled adaptation without destabilizing the model. Pros

Works well with 1k–3k samples
Low overfitting risk
Stable training
Preserves pretrained knowledge

Cons

Limited representational shift
Smaller gains beyond a point

When to Use Partial Unfreezing

Situation	Recommendation
Small–medium dataset	✅ Yes
Domain adaptation needed	✅ Yes
Production stability	✅ Yes
Very limited compute	⚠️ Maybe

Full Fine-Tuning

All transformer layers are trainable. Full fine-tuning means:

“Allow every weight in the model to update.”

This gives maximum flexibility—but also maximum risk.

How to Implement Full Fine-Tuning

Step 1: Do Not Freeze Anything

model = IntentClassifier(MODEL_NAME, num_labels=3)

Step 2: Use a Very Small Learning Rate

optimizer = AdamW(
    model.parameters(),
    lr=2e-5
)

Pros

Maximum task adaptation
Best performance at scale

Cons

Requires large datasets (5k–10k+)
Risk of catastrophic forgetting
Unstable with small or noisy data
Harder to debug misclassifications

When to Use Full Fine-Tuning

Situation	Recommendation
Large labeled dataset	✅ Yes
Highly specialized domain	✅ Yes
Research experiments	✅ Yes
Small datasets	❌ No

Practical Comparison

Dimension	Partial Unfreezing	Full Fine-Tuning
Dataset size	1k–3k	5k–10k+
Overfitting risk	Low	High
Training stability	High	Medium
Latency	Same	Same
Maintenance cost	Low	High
Production safety	High	Medium

For intent routing, partial unfreezing captures most of the benefit with far less risk.

Production Guardrails (Critical)

ML alone should never be the only line of defense.

Recommended safeguards

Regex pre-routing
- IDs, error codes → keyword search
Confidence thresholds
- Low confidence → semantic fallback
Hybrid execution
- Run keyword + semantic for ambiguous queries
Logging misroutes
- Use real queries for retraining

In practice, heuristics + ML outperform ML alone.

Why Not Use an LLM for Routing?

Large language models are powerful, but poor routers.

Concern	Impact
Latency	Too high
Cost	Unnecessary
Determinism	Low
Debuggability	Poor

LLMs are better used after routing, for tasks like:

NL → Lucene translation
Query explanation
Result summarization

Lessons Learned

Data quality matters more than model choice
Linear probes are strong baselines
Partial unfreezing gives the best risk-reward ratio
Confidence-aware routing improves UX dramatically
Intent routing is a systems problem, not a pure ML problem

Conclusion

Intent-aware query routing is not about building the most sophisticated model —
it is about building the most reliable decision boundary under real-world constraints.

For most hybrid search systems:

Start with a linear probe
Add heuristics and confidence thresholds
Move to partial fine-tuning when data grows
Avoid full fine-tuning unless scale demands it

This approach keeps the system fast, interpretable, and production-safe.

If you’re building a hybrid search platform, intent classification is the quiet component that determines whether everything else works.

Introduction#

Problem Statement#

Framing the Problem as Intent Classification#

Why SBERT?#

Base Architecture (Common to All Approaches)#

Baseline: Linear Probe over Frozen SBERT#

Architecture#

What Is Linear Probing?#

Why this works well#

Benefits#

Limitations#

How to Implement Linear Probing#

Step 1: Freeze the Encoder#

When to Use Linear Probing#

Why This Is Not End-to-End Training#

why ?#

Introducing a True Neural Network#

Architecture#

Partial Unfreezing vs Full Fine-Tuning#

Partial Unfreezing#

How to Implement Partial Unfreezing#

Step 1: Freeze Everything#

Step 2: Unfreeze the Top N Layers#

When to Use Partial Unfreezing#

Full Fine-Tuning#

How to Implement Full Fine-Tuning#

Step 1: Do Not Freeze Anything#

Step 2: Use a Very Small Learning Rate#

Practical Comparison#

Production Guardrails (Critical)#

Recommended safeguards#

Why Not Use an LLM for Routing?#

Lessons Learned#

Conclusion#

Introduction

Problem Statement

Framing the Problem as Intent Classification

Why SBERT?

Base Architecture (Common to All Approaches)

Baseline: Linear Probe over Frozen SBERT

Architecture

What Is Linear Probing?

Why this works well

Benefits

Limitations

How to Implement Linear Probing

Step 1: Freeze the Encoder

When to Use Linear Probing

Why This Is Not End-to-End Training

why ?

Introducing a True Neural Network

Architecture

Partial Unfreezing vs Full Fine-Tuning

Partial Unfreezing

How to Implement Partial Unfreezing

Step 1: Freeze Everything

Step 2: Unfreeze the Top N Layers

When to Use Partial Unfreezing

Full Fine-Tuning

How to Implement Full Fine-Tuning

Step 1: Do Not Freeze Anything

Step 2: Use a Very Small Learning Rate

Practical Comparison

Production Guardrails (Critical)

Recommended safeguards

Why Not Use an LLM for Routing?

Lessons Learned

Conclusion