Why I Built This: When “Good Search” Still Feels Wrong

Most enterprise search stacks do a decent job at matching words. OpenSearch (and similar engines) commonly rely on lexical ranking families like TF‑IDF/BM25. These are strong baselines, but they tend to miss something important: user-specific intent.

Three failure modes show up repeatedly in real systems:

  1. Ambiguity: the same term can mean different things to different users (or even the same user on different days).
  2. Intent drift: users’ interests change over time; yesterday’s context matters.
  3. One-size-fits-all ranking: a uniform ranking is applied across all users even though patterns differ by role, team, or experience level.

A typical reaction is: “Let’s use embeddings” (dense retrieval, sentence transformers, vector databases, LTR models). That can work—but it also brings cost, latency, new infra, and sometimes reduced explainability.

This case study explores a different approach:

A lightweight, explainable personalization reranker that learns from a user’s own query history using hash-based word correlation vectors—no pretraining, no embeddings, and no external vector store.


Design Goals & Constraints

This architecture was driven by very practical constraints:

  • Low memory per user: must scale to many users without storing full query logs.
  • Low operational complexity: avoid adding heavy ML services or vector DBs.
  • Explainability: keep a clear “why did this rank higher?” story.
  • Compatibility: should work as a reranking layer on top of standard OpenSearch results.
  • Incremental learning: adapt in real time as users search, not through offline training.

High-Level Architecture

At a high level, the system wraps around OpenSearch like this:

  1. User submits a query
  2. The system updates a per-user word map (a compact model of word-to-word correlations)
  3. OpenSearch returns top‑N results using its standard relevance score
  4. We compute a similarity score between:
    • the user’s current query context, and
    • each candidate result title (or snippet)
  5. We blend the similarity with the OpenSearch score and re-rank.

Visually:

  • Learning path (write): Query → tokenize → hash → update sparse vectors
  • Serving path (read): Query + Top‑N results → vectorize → cosine similarity → blend scores → re-rank

Data Model: The Per-User Word Store

The core state is a per-user map:

  • Map<String, float[]> userWordMap
  • Key: word (or hashed representation of the word)
  • Value: a fixed-size float vector representing correlations to other words.

Each vector is intentionally:

  • Fixed-size (bounded memory)
  • Sparse (most entries stay 0)
  • Incrementally updated on every query.

Vector Interpretation

Think of each word as having a “memory” of other words it has appeared near in the user’s searches.

  • Vector index = hash slot of a co-occurring word
  • Vector value = strength of correlation (based on proximity within queries)

This is not an embedding in the neural sense. It’s closer to a user-personalized, hashed co-occurrence graph stored as sparse arrays.


Step 1 — Tokenization and Hashing

When a query arrives, we tokenize it into words, then hash each token into a bounded range:

int hash(String word) {
  return abs(MurmurHash3(word) % VECTOR_SIZE);
}

The disclosure example uses a vector size such as 100 (and notes the hash values are illustrative).

Why hashing?

  • Keeps memory bounded regardless of vocabulary growth
  • Avoids maintaining a global dictionary
  • Enables fast $O(1)$ updates into a fixed vector slot
  • Trade-off: Hashing can cause collisions. In practice, you manage this by tuning vector size and observing impact.

Step 2 — Learning: Updating Word Correlation Vectors

Here’s the learning rule:
For each query, for each pair of words, update the vector of the source word at the target word’s hash slot.

The update weight is inverse distance within the query:

float weight = 1.0 / distance;
vector[targetHash] += weight;

So, closer words get stronger correlation updates.

Concrete Example (Intuition)

Query: “create oci object storage bucket”

  • If “create” is 1 position away from “oci”, distance = 1 implies weight = 1.0
  • If “create” is 4 positions away from “bucket”, distance = 4 implies weight = 0.25

This builds a correlation profile that captures the structure of the query, not just the bag-of-words.

Why Inverse Distance?

Inverse distance is simple and explainable:

  1. It encodes that adjacency matters (“object storage” is a meaningful phrase).
  2. It is stable and fast.
  3. It avoids needing any ML model training.

Step 3 — Storage Behavior Over Time (Incremental Learning)

Each time a user searches:

  • New word -> create a new zero vector and start learning.
  • Existing word -> update its vector “in-place”.

Over time, the userWordMap becomes a compact summary of the user’s search behavior: which words tend to appear together (and how tightly). This is what enables personalization without storing full query history.


Serving: Re-Ranking Search Results Using User Context

Learning builds the user context. Serving uses it.

Serving Step 1 — Get Candidate Results

OpenSearch returns top‑N results with its semantic/lexical score (BM25 or your stack’s scoring). We do not replace OpenSearch. We treat it as the retrieval stage and add personalization after.

Serving Step 2 — Vectorize the Current Query and Result Titles

We tokenize:

  • The user’s current query
  • Each result’s title (or another short text field)

Then for each word token, we look up its vector from the userWordMap. At this point, we have vector sets representing:

  • Current query context
  • Each candidate result’s title context

Serving Step 3 — Cosine Similarity

We compute similarity between the query context and candidate result context using cosine similarity.

Cosine similarity is a natural fit because:

  • Vectors are sparse and high-dimensional.
  • Direction (co-occurrence pattern) matters more than raw magnitude.
  • It’s easy to implement and optimize.

Serving Step 4 — Blend with OpenSearch Score

Final ranking score is a linear blend:

finalScore = α * openSearchScore + β * similarityScore

Example values:

  • α = 0.6
  • β = 0.4

Then we sort by finalScore and return the re-ranked results. This blending is important:

  • OpenSearch score preserves global relevance (query/document match).
  • Similarity score injects user-specific context.
  • α , β lets you tune how strong personalization should be.

Walkthrough Example: Why “oci object storage” Rises to the Top

The disclosure provides a scenario like:

  • Query history: * “put data to oci object storage”
    • “get data from oci object storage bucket”
  • Later query: “storage”

OpenSearch might return generic storage results:

  1. “File storage encryption”
  2. “persistent volume storage”
  3. “oci object storage”

After reranking, “oci object storage” rises because the user’s history strongly correlates oci -> object -> storage.

The “Aha” Moment: Even when the query is short and ambiguous (“storage”), the system can infer that this user often means “OCI object storage”.


Memory & Performance Characteristics

This design is intentionally lightweight.

Memory Optimization

Key properties from the implementation notes:

  • Only distinct words stored per user.
  • No redundant storage of duplicate queries.
  • Vector size is bounded and fixed.
  • Updates are sparse (most entries are zeros).

This is why the approach can be attractive where embeddings + external vector stores are too heavy.

Runtime Overhead

There is overhead: cosine similarity per result. The disclosure notes this is mitigated by limiting to top‑N results.

In practice, you can also:

  • Only rerank top‑K (e.g., 20–50).
  • Cache vectorized titles.
  • Prune low-activity users.
  • Periodically decay old correlations.

Strengths: Why This Works in Real Systems

  • No training pipeline: The model updates in real time from queries—no offline training loop, no model registry, no feature store.
  • No external vector store: Everything is stored as small per-user sparse vectors.
  • Explainable personalization: If a result rises, you can often explain it: “This user has historically searched oci with storage frequently.”
  • Works with existing OpenSearch: It’s a reranker, not a replacement. That reduces adoption friction.

Limitations & Practical Considerations

No architecture is free. The disclosure highlights several realistic constraints:

  • Cold start: If the user has no history, similarity score will be weak. In that case, rely mostly on OpenSearch score (α close to 1) and gradually increase β as history accumulates.
  • Query quality dependency: If user queries are inconsistent or noisy, learned correlations may be noisy.
  • Runtime cost: Cosine similarity for each candidate adds computation. Limiting to top‑N helps.
  • Hash collisions: Hashing is bounded but can collide.

Mitigations:

  • Increase vector size (within memory budget).
  • Use multiple hashes (like a Bloom-style approach).
  • Monitor impact with offline evaluation.