Introduction
Modern search platforms rarely rely on a single retrieval strategy.
In practice, a production system often supports multiple search modes, such as:
- Keyword search for exact identifiers and error codes
- Semantic search for conceptual or explanatory queries
- Structured search where natural language is translated into a query DSL (e.g., Lucene)
The challenge is not implementing these search engines individually — it is deciding, at query time, which engine to route a user query to.
This post describes a practical, ML-driven approach to intent-aware query routing, and dives deep into an important architectural trade-off:
Linear probing over frozen embeddings vs. end-to-end neural networks with partial or full fine-tuning
The focus is on engineering correctness, latency, and reliability, not model novelty.
Problem Statement
Consider a backend system that supports three search paths:
| Query Example | Intended Route |
|---|---|
INC-88321 |
Keyword search |
how to increase worker threads in helidon |
Semantic (embedding) search |
find all incidents where status is closed |
Structured (NL → Lucene) search |
The system must classify query intent before executing the search.
A purely rule-based solution quickly becomes brittle:
- Regex rules explode in number
- Query phrasing varies significantly
- Maintenance cost increases over time
This makes machine learning a good fit — provided it is applied conservatively.
Framing the Problem as Intent Classification
At its core, this is a low-entropy, multi-class text classification problem.
We define routing-based labels:
| Label | Meaning |
|---|---|
ROUTE_KEYWORD |
Exact / lexical match |
ROUTE_EMBEDDING |
Semantic similarity search |
ROUTE_LUCENE |
Structured query parsing |
The classifier output is used only for routing, not for answering the query itself.
This distinction matters — misclassification cost is asymmetric:
- Routing a structured query to keyword search often yields zero results
- Routing an ambiguous query to semantic search usually degrades gracefully
Now we try to solve this problem using a frequently used transformer model in ML - SBERT. SBERT itself is a very generic model which can be tuned in many ways to solve the intent classification problem. Here we discuss mainly 3 approaches and the tradeoffs required.
Why SBERT?
SBERT converts a sentence into a dense numerical vector (embedding) that captures semantic meaning.
For example:
“enable log rotation in helidon” “how to configure logging rollover in helidon”
These two sentences produce very similar vectors, even though the words differ.
That makes SBERT ideal for:
intent classification
semantic search
clustering
Base Architecture (Common to All Approaches)
All three strategies use the same neural architecture:
-
A pretrained Transformer encoder
-
A small classification head on top
import torch
import torch.nn as nn
from transformers import AutoModel, AutoTokenizer
MODEL_NAME = "microsoft/MiniLM-L6-H384-uncased"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
class IntentClassifier(nn.Module):
def __init__(self, model_name, num_labels):
super().__init__()
# Pretrained Transformer (SBERT-style encoder)
self.encoder = AutoModel.from_pretrained(model_name)
# Size of embeddings produced by the encoder
hidden_size = self.encoder.config.hidden_size
# Simple linear classifier
self.classifier = nn.Linear(hidden_size, num_labels)
def forward(self, input_ids, attention_mask):
# Run input through Transformer
outputs = self.encoder(
input_ids=input_ids,
attention_mask=attention_mask
)
# Use [CLS] token embedding as sentence representation
cls_embedding = outputs.last_hidden_state[:, 0]
# Map embedding to intent logits
return self.classifier(cls_embedding)
The only difference between the three strategies is: Which parts of this model are allowed to learn.
Baseline: Linear Probe over Frozen SBERT
Architecture
The simplest effective solution uses pretrained sentence embeddings:
User Query -> SBERT Encoder (frozen) -> Linear Layer + Softmax -> Search Router
This approach is commonly called a linear probe.
What Is Linear Probing?
Linear probing means:
“Do NOT change the pretrained language model. Only train a lightweight classifier on top.”
we treat SBERT as a fixed feature extractor.
Why this works well
SBERT embeddings already encode:
- Query length
- Interrogative structure
- Token shape (IDs vs sentences)
- Intent routing is a coarse decision boundary
- Small datasets (1–2k samples) are sufficient
Benefits
- Extremely stable
- Fast to train
- Minimal overfitting risk
- Excellent baseline for production systems
Limitations
The decision boundary is strictly linear in embedding space.
This leads to ambiguous cases such as:
helidon threadpool configurationtickets status closed
These sit near class boundaries and are often misclassified.
How to Implement Linear Probing
Step 1: Freeze the Encoder
model = IntentClassifier(MODEL_NAME, num_labels=3)
# Freeze all Transformer weights
for param in model.encoder.parameters():
param.requires_grad = False
This tells PyTorch:
“Do not update these weights during training.”
Step 2: Train Only the Classifier
from torch.optim import AdamW
optimizer = AdamW(
model.classifier.parameters(),
lr=1e-3
)
When to Use Linear Probing
| Situation | Recommendation |
|---|---|
| Very small dataset | ✅ Yes |
| Fast iteration | ✅ Yes |
| Low overfitting tolerance | ✅ Yes |
| Domain-specific language | ❌ Limited |
Why This Is Not End-to-End Training
Although a neural network is used, the encoder never adapts to the task.
why ?
- Embeddings remain generic
- Task-specific cues are not reinforced
- Boundary refinement is limited
This motivates moving toward end-to-end learning, but doing so carefully.
Introducing a True Neural Network
To improve boundary separation without sacrificing stability, we move to a partially fine-tuned transformer.
Architecture
MiniLM Transformer ├─ Frozen lower layers ├─ Trainable top layers → Non-linear classifier head → Softmax
Key ideas:
- Lower layers encode general syntax and semantics
- Upper layers adapt to intent-specific signals
- Non-linear layers improve separability
Partial Unfreezing vs Full Fine-Tuning
This is the most important design decision.
Partial Unfreezing
What Is Partial Unfreezing?
Partial unfreezing means:
“Keep most of the model frozen, but allow the top Transformer layers to adapt.”
Only the top N transformer layers are trainable.
Why only the top layers?
Lower layers learn generic language features
Upper layers learn task-specific meaning
How to Implement Partial Unfreezing
Step 1: Freeze Everything
model = IntentClassifier(MODEL_NAME, num_labels=3)
for param in model.encoder.parameters():
param.requires_grad = False
Step 2: Unfreeze the Top N Layers
def unfreeze_top_layers(model, n_layers=2):
total_layers = model.encoder.config.num_hidden_layers
for name, param in model.encoder.named_parameters():
for layer_id in range(total_layers - n_layers, total_layers):
if f"encoder.layer.{layer_id}." in name:
param.requires_grad = True
This allows controlled adaptation without destabilizing the model. Pros
- Works well with 1k–3k samples
- Low overfitting risk
- Stable training
- Preserves pretrained knowledge
Cons
- Limited representational shift
- Smaller gains beyond a point
When to Use Partial Unfreezing
| Situation | Recommendation |
|---|---|
| Small–medium dataset | ✅ Yes |
| Domain adaptation needed | ✅ Yes |
| Production stability | ✅ Yes |
| Very limited compute | ⚠️ Maybe |
Full Fine-Tuning
All transformer layers are trainable. Full fine-tuning means:
“Allow every weight in the model to update.”
This gives maximum flexibility—but also maximum risk.
How to Implement Full Fine-Tuning
Step 1: Do Not Freeze Anything
model = IntentClassifier(MODEL_NAME, num_labels=3)
Step 2: Use a Very Small Learning Rate
optimizer = AdamW(
model.parameters(),
lr=2e-5
)
Pros
- Maximum task adaptation
- Best performance at scale
Cons
- Requires large datasets (5k–10k+)
- Risk of catastrophic forgetting
- Unstable with small or noisy data
- Harder to debug misclassifications
When to Use Full Fine-Tuning
| Situation | Recommendation |
|---|---|
| Large labeled dataset | ✅ Yes |
| Highly specialized domain | ✅ Yes |
| Research experiments | ✅ Yes |
| Small datasets | ❌ No |
Practical Comparison
| Dimension | Partial Unfreezing | Full Fine-Tuning |
|---|---|---|
| Dataset size | 1k–3k | 5k–10k+ |
| Overfitting risk | Low | High |
| Training stability | High | Medium |
| Latency | Same | Same |
| Maintenance cost | Low | High |
| Production safety | High | Medium |
For intent routing, partial unfreezing captures most of the benefit with far less risk.
Production Guardrails (Critical)
ML alone should never be the only line of defense.
Recommended safeguards
- Regex pre-routing
- IDs, error codes → keyword search
- Confidence thresholds
- Low confidence → semantic fallback
- Hybrid execution
- Run keyword + semantic for ambiguous queries
- Logging misroutes
- Use real queries for retraining
In practice, heuristics + ML outperform ML alone.
Why Not Use an LLM for Routing?
Large language models are powerful, but poor routers.
| Concern | Impact |
|---|---|
| Latency | Too high |
| Cost | Unnecessary |
| Determinism | Low |
| Debuggability | Poor |
LLMs are better used after routing, for tasks like:
- NL → Lucene translation
- Query explanation
- Result summarization
Lessons Learned
- Data quality matters more than model choice
- Linear probes are strong baselines
- Partial unfreezing gives the best risk-reward ratio
- Confidence-aware routing improves UX dramatically
- Intent routing is a systems problem, not a pure ML problem
Conclusion
Intent-aware query routing is not about building the most sophisticated model —
it is about building the most reliable decision boundary under real-world constraints.
For most hybrid search systems:
- Start with a linear probe
- Add heuristics and confidence thresholds
- Move to partial fine-tuning when data grows
- Avoid full fine-tuning unless scale demands it
This approach keeps the system fast, interpretable, and production-safe.
If you’re building a hybrid search platform, intent classification is the quiet component that determines whether everything else works.