Skip to content

Conversation

Ryzhtus
Copy link
Contributor

@Ryzhtus Ryzhtus commented Jul 3, 2025

Related Issues

SentenceTransformers introduced support for sparse embedding models via the SparseEncoder class in v5.0.0. I thought it would be cool to support these in Haystack as well, since sparse models were previously only available through the FastEmbed integration (e.g. FastembedSparseTextEmbedder)

Proposed Changes:

Introduced two new embedder classes and also a class to manage these embedding classes:

  • SentenceTransformersSparseTextEmbedder
  • SentenceTransformersSparseDocumentEmbedder
  • SentenceTransformersSparseEncoderEmbeddingBackend

How did you test it?

I added unit tests for both embedders

Notes for the reviewer

Some tests are currently failing — I’d appreciate your support in resolving them.
And we’ll likely need to add documentation as well.

Checklist

  • I have read the contributors guidelines and the code of conduct
  • I have updated the related issue with new insights and changes
  • I added unit tests and updated the docstrings
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
  • I documented my code
  • I ran pre-commit hooks and fixed any issue

@Ryzhtus Ryzhtus requested a review from a team as a code owner July 3, 2025 17:52
@Ryzhtus Ryzhtus requested review from vblagoje and removed request for a team July 3, 2025 17:52
@anakin87 anakin87 self-requested a review July 4, 2025 05:55
@anakin87
Copy link
Member

anakin87 commented Jul 4, 2025

Hello and thanks for this idea!

I think it's a big topic and will probably require some work.

Some high-level notes:

  1. I would create a completely separate _SentenceTransformersSparseEmbeddingBackendFactory as we do for FastEmbed.
  2. Let's try to fit the returned sparse embedding into the existing Haystack SparseEmbedding dataclass.
  3. Let's add tests for the backend and integration tests for the two embedders.
  4. If you share a script or a raw Colab notebook with an end-to-end example, this would help validating and reviewing the implementation.

@anakin87
Copy link
Member

Hey, ping me when you need another review.

In the meantime, feel free to:

  • fix failing tests
  • put the computed embedding into the Document.sparse_embedding attribute

💙

@Ryzhtus
Copy link
Contributor Author

Ryzhtus commented Aug 11, 2025

Hey @anakin87, sure. Thank you, I'll ping you when this PR is ready for review. I probably manage to finish it this week, if there will be no urgent tasks at work

@Ryzhtus
Copy link
Contributor Author

Ryzhtus commented Aug 22, 2025

@anakin87 Hey, I think it's finished. Ran tests locally, passed. However, could you please help me with the formatting? Something strange on my side, because format tests found many errors, though when I used hatch run fmt locally, all checks have passed.

And, just in case, if you need a code snippet to check if new sparse models work:

import os

from haystack.components.embedders import (
    SentenceTransformersSparseDocumentEmbedder,
    SentenceTransformersSparseTextEmbedder,
)
from haystack.dataclasses import Document
from haystack.utils.device import ComponentDevice

document_list = [
    Document(
        content="Oxidative stress generated within inflammatory joints can produce autoimmune phenomena and joint destruction. Radical species with oxidative activity, including reactive nitrogen species, represent mediators of inflammation and cartilage damage.",
        meta={
            "pubid": "25,445,628",
            "long_answer": "yes",
        },
    ),
    Document(
        content="Plasma levels of pancreatic polypeptide (PP) rise upon food intake. Although other pancreatic islet hormones, such as insulin and glucagon, have been extensively investigated, PP secretion and actions are still poorly understood.",
        meta={
            "pubid": "25,445,712",
            "long_answer": "yes",
        },
    ),
]

document_embedder = SentenceTransformersSparseDocumentEmbedder(device=ComponentDevice.from_str("cpu"))
document_embedder.warm_up()
documents_with_embeddings = document_embedder.run(document_list)["documents"]

for doc in documents_with_embeddings:
    print(f"Document Text: {doc.content}")
    print(f"Document Sparse Embedding: {doc.embedding.to_dict()}")

@anakin87
Copy link
Member

I'll take a look in the next few days. @Ryzhtus please ping me if I forget to do that.

@coveralls
Copy link
Collaborator

Pull Request Test Coverage Report for Build 17290431721

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.05%) to 92.158%

Totals Coverage Status
Change from base Build 17289305803: 0.05%
Covered Lines: 13044
Relevant Lines: 14154

💛 - Coveralls

Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed format and left a few comments.

Please also adjust types. You can run mypy locally with hatch run test:types.

(Reminder to myself: if we add integration tests, follow the process for slow/unstable)

Comment on lines +68 to +70
def get_embedding_backend( # pylint: disable=too-many-positional-arguments
model: str,
device: Optional[str] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def get_embedding_backend( # pylint: disable=too-many-positional-arguments
model: str,
device: Optional[str] = None,
def get_embedding_backend(
*,
model: str,
device: Optional[str] = None,

could you explore using only keyword args? This would probably imply updating some other code.

Class to manage Sparse embeddings from Sentence Transformers.
"""

def __init__( # pylint: disable=too-many-positional-arguments
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use keyword args if possible

Comment on lines +171 to +183
def embed(self, data: list[str], **kwargs) -> list[SparseEmbedding]:
embeddings = self.model.encode(data, **kwargs).coalesce()

rows, columns = embeddings.indices()
values = embeddings.values()
batch_size = embeddings.size(0)

sparse_embeddings: list[SparseEmbedding] = []
for embedding in range(batch_size):
mask = rows == embedding
embedding_columns = columns[mask].tolist()
embedding_values = values[mask].tolist()
sparse_embeddings.append(SparseEmbedding(indices=embedding_columns, values=embedding_values))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Please add some high-level comments to explain this code
  2. We need to test it

@@ -50,10 +52,51 @@ def get_embedding_backend( # pylint: disable=too-many-positional-arguments
config_kwargs=config_kwargs,
backend=backend,
)

_SentenceTransformersEmbeddingBackendFactory._instances[embedding_backend_id] = embedding_backend
return embedding_backend

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to put the new classes and logic in a new module: sentence_transformers_sparse_backend.py

Comment on lines +227 to +228
for doc, emb in zip(documents, embeddings):
doc.embedding = emb
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for doc, emb in zip(documents, embeddings):
doc.embedding = emb
for doc, emb in zip(documents, embeddings):
doc.sparse_embedding = emb

Let's put the sparse embedding in the corresponding field and update docstrings as needed.

if self.tokenizer_kwargs and self.tokenizer_kwargs.get("model_max_length"):
self.embedding_backend.model.max_seq_length = self.tokenizer_kwargs["model_max_length"]

@component.output_types(embedding=list[float])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@component.output_types(embedding=list[float])
@component.output_types(sparse_embedding=SparseEmbedding)

show_progress_bar=self.progress_bar,
**(self.encode_kwargs if self.encode_kwargs else {}),
)[0]
return {"embedding": embedding}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return {"embedding": embedding}
return {"sparse_embedding": embedding}

@@ -5,9 +5,11 @@
from unittest.mock import patch

import pytest
import torch

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's create a new module: test_sentence_transformers_sparse_embedding_backend.py

tokenizer_kwargs=None,
config_kwargs=None,
backend="torch",
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add single integration test

tokenizer_kwargs=None,
config_kwargs=None,
backend="torch",
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add single integration test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants