feat: Add support for `SparseEncoder` and sparse embedding models in Sentence Transformers #9588

Ryzhtus · 2025-07-03T17:52:06Z

Related Issues

SentenceTransformers introduced support for sparse embedding models via the SparseEncoder class in v5.0.0. I thought it would be cool to support these in Haystack as well, since sparse models were previously only available through the FastEmbed integration (e.g. FastembedSparseTextEmbedder)

Proposed Changes:

Introduced two new embedder classes and also a class to manage these embedding classes:

SentenceTransformersSparseTextEmbedder
SentenceTransformersSparseDocumentEmbedder
SentenceTransformersSparseEncoderEmbeddingBackend

How did you test it?

I added unit tests for both embedders

Notes for the reviewer

Some tests are currently failing — I’d appreciate your support in resolving them.
And we’ll likely need to add documentation as well.

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
I documented my code
I ran pre-commit hooks and fixed any issue

…arseTextEmbedder

anakin87 · 2025-07-04T06:11:59Z

Hello and thanks for this idea!

I think it's a big topic and will probably require some work.

Some high-level notes:

I would create a completely separate _SentenceTransformersSparseEmbeddingBackendFactory as we do for FastEmbed.
Let's try to fit the returned sparse embedding into the existing Haystack SparseEmbedding dataclass.
Let's add tests for the backend and integration tests for the two embedders.
If you share a script or a raw Colab notebook with an end-to-end example, this would help validating and reviewing the implementation.

… and added tests

anakin87 · 2025-08-11T09:42:32Z

Hey, ping me when you need another review.

In the meantime, feel free to:

fix failing tests
put the computed embedding into the Document.sparse_embedding attribute

💙

Ryzhtus · 2025-08-11T16:13:52Z

Hey @anakin87, sure. Thank you, I'll ping you when this PR is ready for review. I probably manage to finish it this week, if there will be no urgent tasks at work

Ryzhtus · 2025-08-22T16:55:49Z

@anakin87 Hey, I think it's finished. Ran tests locally, passed. However, could you please help me with the formatting? Something strange on my side, because format tests found many errors, though when I used hatch run fmt locally, all checks have passed.

And, just in case, if you need a code snippet to check if new sparse models work:

import os

from haystack.components.embedders import (
    SentenceTransformersSparseDocumentEmbedder,
    SentenceTransformersSparseTextEmbedder,
)
from haystack.dataclasses import Document
from haystack.utils.device import ComponentDevice

document_list = [
    Document(
        content="Oxidative stress generated within inflammatory joints can produce autoimmune phenomena and joint destruction. Radical species with oxidative activity, including reactive nitrogen species, represent mediators of inflammation and cartilage damage.",
        meta={
            "pubid": "25,445,628",
            "long_answer": "yes",
        },
    ),
    Document(
        content="Plasma levels of pancreatic polypeptide (PP) rise upon food intake. Although other pancreatic islet hormones, such as insulin and glucagon, have been extensively investigated, PP secretion and actions are still poorly understood.",
        meta={
            "pubid": "25,445,712",
            "long_answer": "yes",
        },
    ),
]

document_embedder = SentenceTransformersSparseDocumentEmbedder(device=ComponentDevice.from_str("cpu"))
document_embedder.warm_up()
documents_with_embeddings = document_embedder.run(document_list)["documents"]

for doc in documents_with_embeddings:
    print(f"Document Text: {doc.content}")
    print(f"Document Sparse Embedding: {doc.embedding.to_dict()}")

…formers

anakin87 · 2025-08-25T09:59:37Z

I'll take a look in the next few days. @Ryzhtus please ping me if I forget to do that.

coveralls · 2025-08-28T08:38:13Z

Pull Request Test Coverage Report for Build 17290431721

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.05%) to 92.158%

Totals
Change from base Build 17289305803:	0.05%
Covered Lines:	13044
Relevant Lines:	14154

💛 - Coveralls

anakin87

I fixed format and left a few comments.

Please also adjust types. You can run mypy locally with hatch run test:types.

(Reminder to myself: if we add integration tests, follow the process for slow/unstable)

anakin87 · 2025-08-28T08:42:02Z

haystack/components/embedders/backends/sentence_transformers_backend.py

+    def get_embedding_backend(  # pylint: disable=too-many-positional-arguments
+        model: str,
+        device: Optional[str] = None,


Suggested change

def get_embedding_backend( # pylint: disable=too-many-positional-arguments

model: str,

device: Optional[str] = None,

def get_embedding_backend(

*,

model: str,

device: Optional[str] = None,

could you explore using only keyword args? This would probably imply updating some other code.

anakin87 · 2025-08-28T08:42:30Z

haystack/components/embedders/backends/sentence_transformers_backend.py

+    Class to manage Sparse embeddings from Sentence Transformers.
+    """
+
+    def __init__(  # pylint: disable=too-many-positional-arguments


let's use keyword args if possible

anakin87 · 2025-08-28T08:52:55Z

haystack/components/embedders/backends/sentence_transformers_backend.py

+    def embed(self, data: list[str], **kwargs) -> list[SparseEmbedding]:
+        embeddings = self.model.encode(data, **kwargs).coalesce()
+
+        rows, columns = embeddings.indices()
+        values = embeddings.values()
+        batch_size = embeddings.size(0)
+
+        sparse_embeddings: list[SparseEmbedding] = []
+        for embedding in range(batch_size):
+            mask = rows == embedding
+            embedding_columns = columns[mask].tolist()
+            embedding_values = values[mask].tolist()
+            sparse_embeddings.append(SparseEmbedding(indices=embedding_columns, values=embedding_values))


Please add some high-level comments to explain this code

We need to test it

anakin87 · 2025-08-28T08:53:56Z

haystack/components/embedders/backends/sentence_transformers_backend.py

@@ -50,10 +52,51 @@ def get_embedding_backend(  # pylint: disable=too-many-positional-arguments
            config_kwargs=config_kwargs,
            backend=backend,
        )
+
        _SentenceTransformersEmbeddingBackendFactory._instances[embedding_backend_id] = embedding_backend
        return embedding_backend



I would prefer to put the new classes and logic in a new module: sentence_transformers_sparse_backend.py

anakin87 · 2025-08-28T08:54:56Z

haystack/components/embedders/sentence_transformers_sparse_document_embedder.py

+        for doc, emb in zip(documents, embeddings):
+            doc.embedding = emb


Suggested change

for doc, emb in zip(documents, embeddings):

doc.embedding = emb

for doc, emb in zip(documents, embeddings):

doc.sparse_embedding = emb

Let's put the sparse embedding in the corresponding field and update docstrings as needed.

anakin87 · 2025-08-28T08:56:03Z

haystack/components/embedders/sentence_transformers_sparse_text_embedder.py

+            if self.tokenizer_kwargs and self.tokenizer_kwargs.get("model_max_length"):
+                self.embedding_backend.model.max_seq_length = self.tokenizer_kwargs["model_max_length"]
+
+    @component.output_types(embedding=list[float])


Suggested change

@component.output_types(embedding=list[float])

@component.output_types(sparse_embedding=SparseEmbedding)

anakin87 · 2025-08-28T08:56:14Z

haystack/components/embedders/sentence_transformers_sparse_text_embedder.py

+            show_progress_bar=self.progress_bar,
+            **(self.encode_kwargs if self.encode_kwargs else {}),
+        )[0]
+        return {"embedding": embedding}


Suggested change

return {"embedding": embedding}

return {"sparse_embedding": embedding}

anakin87 · 2025-08-28T08:56:53Z

test/components/embedders/test_sentence_transformers_embedding_backend.py

@@ -5,9 +5,11 @@
 from unittest.mock import patch

 import pytest
+import torch



Let's create a new module: test_sentence_transformers_sparse_embedding_backend.py

anakin87 · 2025-08-28T09:02:25Z

test/components/embedders/test_sentence_transformers_sparse_text_embedder.py

+            tokenizer_kwargs=None,
+            config_kwargs=None,
+            backend="torch",
+        )


let's add single integration test

anakin87 · 2025-08-28T09:02:44Z

test/components/embedders/test_sentence_transformers_sparse_document_embedder.py

+            tokenizer_kwargs=None,
+            config_kwargs=None,
+            backend="torch",
+        )


let's add single integration test

Ryzhtus added 2 commits July 3, 2025 17:44

Added backend class for SparseEncoder and also SentenceTransformersSp…

2506f4f

…arseTextEmbedder

Added SentenceTransformersSparseDocumentEmbedder

abd7ea5

Ryzhtus requested a review from a team as a code owner July 3, 2025 17:52

Ryzhtus requested review from vblagoje and removed request for a team July 3, 2025 17:52

github-actions bot added topic:tests topic:build/distribution type:documentation Improvements on the docs labels Jul 3, 2025

anakin87 self-requested a review July 4, 2025 05:55

Ryzhtus added 3 commits August 3, 2025 17:18

Created a separate _SentenceTransformersSparseEmbeddingBackendFactory…

82b87c2

… and added tests

Remove unused parameter

73eaa97

Wrapped output into SparseEmbedding dataclass + fix tests

74c222e

Return correct SparseEmbedding, imports and tests

4ddde78

Merge branch 'main' into feat/support_sparse_models_in_sentence_trans…

3ed6005

…formers

anakin87 added 2 commits August 28, 2025 10:28

Merge branch 'main' into st-sparse

341767b

fix fmt

71950af

anakin87 requested changes Aug 28, 2025

View reviewed changes

		for doc, emb in zip(documents, embeddings):
		doc.embedding = emb

	@component.output_types(embedding=list[float])
	@component.output_types(sparse_embedding=SparseEmbedding)

	return {"embedding": embedding}
	return {"sparse_embedding": embedding}

feat: Add support for SparseEncoder and sparse embedding models in Sentence Transformers #9588

Are you sure you want to change the base?

feat: Add support for SparseEncoder and sparse embedding models in Sentence Transformers #9588

Uh oh!

Conversation

Ryzhtus commented Jul 3, 2025

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

Uh oh!

anakin87 commented Jul 4, 2025

Uh oh!

anakin87 commented Aug 11, 2025

Uh oh!

Ryzhtus commented Aug 11, 2025

Uh oh!

Ryzhtus commented Aug 22, 2025

Uh oh!

anakin87 commented Aug 25, 2025

Uh oh!

coveralls commented Aug 28, 2025

Pull Request Test Coverage Report for Build 17290431721

Details

💛 - Coveralls

Uh oh!

anakin87 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

feat: Add support for `SparseEncoder` and sparse embedding models in Sentence Transformers #9588

feat: Add support for `SparseEncoder` and sparse embedding models in Sentence Transformers #9588