LCORE-178 Token counting #215

csibbitt · 2025-07-09T16:54:20Z

Description

Returns token usage data in query and streaming query responses
llama-stack currently doesn't include usage in Agent API
- We estimate locally with an embedding model
- Accuracy will vary, but it should be nearly 100% for GPT models.

Type of change

Related Tickets & Documents

Related Issue OSPRH-17603 LCORE-178

Checklist before requesting a review

I have performed a self-review of my code.
PR has passed all pre-merge test jobs.
If it is a core feature, I have added thorough tests.

Testing

$ curl -X POST http://127.0.0.1:8080/v1/query -H 'accept: ap
plication/json' -H 'Content-Type: application/json' -d '{"query": "Hi"}' | jq .

{
  "conversation_id": "7c0d30b1-a0c8-4be1-9496-6827d1745ae6",
  "response": "How can I assist you today? Do you have a question, need help with something, or just want to chat?",
  "input_tokens": 18,
  "output_tokens": 24
}

$ curl -X POST http://127.0.0.1:8080/v1/query -H 'accept: application/json' -H 'Content-Type: application/json' -d '{"query": "Please give me 200 words about the beauty of summer"}' | jq .

{
  "conversation_id": "79e2c910-2fb1-4b9e-91a0-3ffe71d53f78",
  "response": "Summer is here, and with it comes the warmth, sunshine, and endless possibilities. The long days stretch out before us like a canvas, painted with hues of blue and green as far as the eye can see. The air is filled with the sweet scent of blooming flowers and the gentle chirping of birds, creating a symphony that's both soothing and invigorating.\n\nAs we bask in the sun's warm rays, our skin feels alive, radiant and refreshed. The heat may be intense, but it's also a reminder to slow down, to take time to appreciate the simple things in life. Whether you're lounging by the pool, strolling through a park, or simply sitting on a beach, summer is a time for relaxation and rejuvenation.\n\nThe colors of summer are vibrant and bold, from the bright yellows and oranges of sunflowers to the deep blues and greens of the ocean. The warmth of the sun seeps into our souls, filling us with a sense of freedom and joy. As we soak up the sunshine, we're reminded that life is short, but it's also full of beauty and wonder. So let's make the most of this season, and let the warmth of summer fill our hearts with happiness!",
  "input_tokens": 28,
  "output_tokens": 252
}

I have not tested the streaming api yet, but am working on that now. I wanted to get this published for early feedback.

Summary by CodeRabbit

New Features
- Token usage metrics (input_tokens and output_tokens) are now included in API responses for both standard and streaming queries, providing visibility into tokens sent to and received from language models.
- Added support for estimating token usage when the model is unsupported, with a configurable default tokenizer.
Bug Fixes
- Streaming query endpoints now accurately report token usage metrics instead of placeholder values.
Documentation
- API documentation updated to reflect new response fields and configuration options.
Tests
- Expanded test coverage to validate token counting utilities and ensure correct reporting of token usage in responses.

* Returns token usage data in query and streaming query responses * llama-stack currently doesn't include usage in Agent API * We estimate locally with an embedding model * Accuracy will vary, but it should be nearly 100% for GPT models.

coderabbitai · 2025-07-09T16:54:27Z

Important

Review skipped

Review was skipped due to path filters

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

CodeRabbit blocks several paths by default. You can override this behavior by explicitly including those paths in the path filters. For example, including **/dist/** will override the default block on the dist directory, by removing the pattern from both the lists.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

Token usage tracking was integrated into the query and streaming query endpoints. Token counting utilities using the tiktoken library were introduced, with configuration for a default tokenizer. API schemas and response models were updated to include input and output token counts. Tests were adapted and extended to verify token metrics.

Changes

File(s)	Change Summary
docs/openapi.json, src/models/responses.py	Added `input_tokens` and `output_tokens` fields to API schema and response models.
src/app/endpoints/query.py, src/app/endpoints/streaming_query.py	Integrated token usage estimation and reporting into query and streaming endpoints; refactored toolgroup logic.
src/utils/token_counter.py	Introduced token counting utilities using `tiktoken`, with support for conversation and message token counting.
src/models/config.py, src/constants.py	Added `default_estimation_tokenizer` config field and constant.
pyproject.toml	Added `tiktoken` as a dependency.
tests/unit/app/endpoints/test_query.py, tests/unit/app/endpoints/test_streaming_query.py	Updated tests to handle token usage metrics; improved mocking.
tests/unit/utils/test_token_counter.py	Added new unit tests for the `TokenCounter` utility.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant API
    participant TokenCounter
    participant LLM

    Client->>API: Send query request
    API->>TokenCounter: Estimate input tokens
    API->>LLM: Forward query
    LLM-->>API: Return response (may not include usage)
    API->>TokenCounter: Estimate output tokens (if needed)
    API-->>Client: Return response with input_tokens, output_tokens

Poem

A rabbit counts tokens, both in and out,
With tiktoken’s help, there’s never a doubt.
Endpoints now measure, responses are wise,
Tests check the numbers, with keen rabbit eyes.
Hop, hop, hooray for metrics anew—
This code’s got some carrots, and token counts too! 🥕

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

src/utils/token_counter.py (3)
56-63: Consider more robust configuration access.

The configuration access pattern configuration.configuration.default_estimation_tokenizer creates a tight coupling and could fail if the configuration structure changes. Consider using a more defensive approach.
-        fallback_encoding = configuration.configuration.default_estimation_tokenizer
+        try:
+            fallback_encoding = configuration.configuration.default_estimation_tokenizer
+        except AttributeError:
+            fallback_encoding = "cl100k_base"  # Safe default
138-139: Hardcoded overhead values may not be universally accurate.

The role formatting overhead of 4 tokens is a reasonable estimate for OpenAI models, but may not be accurate for all model types. Consider making this configurable or model-specific.
-        # Add role overhead
-        role_formatting_overhead = 4
+        # Add role overhead (varies by model, 4 is typical for OpenAI models)
+        role_formatting_overhead = 4  # TODO: Make this model-specific
159-161: Document the overhead calculation rationale.

The conversation overhead calculation logic is reasonable, but the specific values (3 for base overhead, 1 for separators) should be documented or referenced to their source.
-        base_overhead = 3  # Start of conversation
-        separator_overhead = max(0, (message_count - 1) * 1)  # Between messages
+        base_overhead = 3  # Start of conversation tokens (based on OpenAI chat format)
+        separator_overhead = max(0, (message_count - 1) * 1)  # Message separator tokens

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b78ee00 and 53a5441.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (17)

docs/openapi.json (1 hunks)
lightspeed-stack.yaml (1 hunks)
pyproject.toml (1 hunks)
src/app/endpoints/config.py (1 hunks)
src/app/endpoints/query.py (5 hunks)
src/app/endpoints/streaming_query.py (7 hunks)
src/models/config.py (1 hunks)
src/models/responses.py (1 hunks)
src/utils/token_counter.py (1 hunks)
tests/unit/app/endpoints/test_config.py (1 hunks)
tests/unit/app/endpoints/test_info.py (1 hunks)
tests/unit/app/endpoints/test_models.py (3 hunks)
tests/unit/app/endpoints/test_query.py (19 hunks)
tests/unit/app/endpoints/test_streaming_query.py (19 hunks)
tests/unit/models/test_config.py (4 hunks)
tests/unit/test_configuration.py (4 hunks)
tests/unit/utils/test_token_counter.py (1 hunks)

🧰 Additional context used

🧬 Code Graph Analysis (3)

tests/unit/app/endpoints/test_query.py (1)

src/app/endpoints/query.py (1)

retrieve_response (204-281)

tests/unit/app/endpoints/test_streaming_query.py (2)

src/app/endpoints/streaming_query.py (1)

retrieve_response (282-364)

tests/unit/app/endpoints/test_query.py (2)

MockShield (380-382)

MockShield (423-425)

src/utils/token_counter.py (1)

src/configuration.py (1)

configuration (47-52)

🔇 Additional comments (35)

src/models/config.py (1)

157-157: LGTM: Configuration field addition is well-implemented.

The default_estimation_tokenizer field addition is consistent with the token counting feature. The default value "cl100k_base" is the standard tiktoken encoding for GPT-3.5-turbo and GPT-4 models.

lightspeed-stack.yaml (1)

25-25: LGTM: Configuration key addition is consistent.

The default_estimation_tokenizer configuration key addition aligns with the Configuration model and maintains consistency across the codebase.

tests/unit/app/endpoints/test_info.py (1)

26-26: LGTM: Test configuration update is necessary.

The test configuration dictionary update is required to match the updated Configuration model that now includes the default_estimation_tokenizer field.

src/app/endpoints/config.py (1)

49-49: LGTM: Response example update maintains consistency.

The addition of default_estimation_tokenizer to the response example maintains consistency with the updated Configuration model and ensures accurate API documentation.

tests/unit/app/endpoints/test_config.py (1)

47-47: LGTM! Test configuration updated for new tokenizer field.

The addition of default_estimation_tokenizer to the test configuration aligns with the new token counting functionality and maintains test consistency.

tests/unit/test_configuration.py (1)

53-53: LGTM! Consistent test configuration updates.

The addition of default_estimation_tokenizer to all test configuration dictionaries ensures consistent test coverage for the new token counting configuration field.

Also applies to: 116-116, 224-224, 261-261

src/models/responses.py (1)

29-30: LGTM! Token counting fields properly added to response model.

The addition of input_tokens and output_tokens as optional integer fields correctly implements the token counting feature described in the PR objectives. The field types and documentation are appropriate.

Also applies to: 35-36

tests/unit/app/endpoints/test_models.py (1)

49-49: LGTM! Test configurations updated for token counting feature.

The addition of default_estimation_tokenizer to all test configuration dictionaries maintains consistency with the new configuration schema for token counting functionality.

Also applies to: 88-88, 122-122

tests/unit/models/test_config.py (1)

324-324: LGTM! Configuration model tests updated for new tokenizer field.

The addition of default_estimation_tokenizer to both the Configuration constructor calls and the expected JSON assertions ensures proper test coverage for the new token counting configuration field.

Also applies to: 379-379, 454-454, 547-547

docs/openapi.json (1)

846-875: Schema changes look correct and well-structured.

The new optional input_tokens and output_tokens fields are properly defined with appropriate union types allowing integer or null values. The updated description clearly explains what these fields represent.

tests/unit/utils/test_token_counter.py (1)

1-68: Comprehensive test coverage for TokenCounter utility.

The test suite provides excellent coverage of the TokenCounter functionality:

Tests edge cases (empty strings, empty message lists)

Validates both string and message token counting

Uses realistic configuration and message types

Includes proper token overhead calculations

tests/unit/app/endpoints/test_query.py (4)

47-47: Configuration updated to include default tokenizer.

The addition of "default_estimation_tokenizer": "cl100k_base" to the test configuration aligns with the new token counting feature requirements.

122-127: Mock return value updated to include token usage.

The mock is correctly updated to return a tuple with token usage counts, matching the new retrieve_response signature.

322-329: Test properly validates token usage response.

The test correctly unpacks the token usage dictionary and validates that both input and output tokens are greater than zero, ensuring the token counting functionality works as expected.

303-304: Mock agent setup includes proper output_message specification.

The mock properly specifies the output_message attribute to ensure compatibility with the updated agent interface used in the token counting logic.

src/app/endpoints/query.py (7)

34-34: Import added for token counting utility.

The import of get_token_counter provides access to the token estimation functionality required for the new feature.

110-116: Query handler updated to include token usage.

The endpoint handler correctly unpacks the token usage information from retrieve_response and passes it to the response constructor, maintaining clean separation of concerns.

133-138: QueryResponse updated with token usage fields.

The response object now includes the token usage metrics, providing clients with visibility into LLM token consumption for cost tracking and optimization.

196-201: Good refactoring with _build_toolgroups helper function.

The extraction of toolgroup building logic into a separate helper function improves code organization and readability while maintaining the same functionality.

210-210: Return type signature updated to include token usage.

The function signature correctly reflects the new return value structure, improving type safety and documentation.

262-280: Robust token usage estimation with proper error handling.

The implementation includes:

Clear comment explaining the current limitation of the API

Proper exception handling with broad catch to prevent service disruption

Fallback to zero counts when estimation fails

Informative warning logging for debugging

This approach ensures service reliability while providing useful token usage data when possible.

281-281: Updated return statement includes token usage.

The return statement correctly includes the token usage dictionary, completing the implementation of the new feature.

tests/unit/app/endpoints/test_streaming_query.py (4)

144-145: Good simulation of API behavior.

The explicit deletion of the usage attribute effectively simulates the current API behavior where usage data is not returned, preventing pytest from returning a Mock object. This is a clean way to test the fallback token estimation logic.

186-191: Correct mock update for token usage.

The mock is properly updated to return token usage data alongside the response and conversation ID, matching the new function signature. The hardcoded values (10 input tokens, 20 output tokens) are appropriate for testing.

273-280: Consistent mock setup pattern.

The addition of spec=["output_message"] to constrain the mock interface and the mock_client.models.list setup is consistent across all test functions. This ensures proper mocking of the model selection functionality.

298-306: Proper handling of updated function signature.

The test correctly unpacks the third return value token_usage from retrieve_response and asserts that input tokens are greater than zero, which validates the token counting functionality.

src/utils/token_counter.py (1)

164-177: Appropriate use of caching for performance.

The LRU cache with maxsize=8 is well-suited for this use case, as it avoids repeated tokenizer initialization while keeping memory usage reasonable. The cache size should handle typical model variety in production.

src/app/endpoints/streaming_query.py (8)

27-27: Proper import of token counting utility.

The import is correctly placed and follows the established import organization pattern in the file.

98-105: Improved function signature with actual metrics.

The updated stream_end_event function signature now accepts metrics_map and uses actual token counts instead of hardcoded zeros. This is a significant improvement for providing accurate usage data to clients.

120-121: Correct usage of metrics data.

The function now properly extracts input_tokens and output_tokens from the metrics map with appropriate fallback values, replacing the previous hardcoded zeros.

208-214: Proper unpacking of token usage data.

The code correctly unpacks the third return value token_usage from retrieve_response, maintaining compatibility with the updated function signature.

238-245: Robust error handling for token estimation.

The try-catch block properly handles potential failures in token counting while logging appropriate messages. The fallback to 0 tokens ensures the system continues to function even when estimation fails.

247-251: Clean metrics construction and usage.

The metrics map construction clearly separates input tokens (from initial estimation) and output tokens (from streaming calculation), then passes both to the end event. This maintains a clean separation of concerns.

288-288: Updated function signature reflects new functionality.

The retrieve_response function signature is correctly updated to return the token usage dictionary as the third tuple element, maintaining backward compatibility for the first two elements.

353-362: Appropriate error handling for input token estimation.

The token counting logic properly handles the case where the API doesn't provide usage data by estimating input tokens using the system prompt and query. The error handling ensures graceful degradation with appropriate logging.

pyproject.toml

manstis

I guess this is an approximation in the absence of Token counting in llama-stack.

The problem is that we enable llama-stack session persistence in order to have a conversation.

Therefore the prompt sent to the LLM by llama-stack will include system_prompt, query_request.query and the previous queries relating to the session/conversation. This is further compounded if, for example, RAG or MCP Tool Calls are included in the prompt sent to the LLM by llama-stack.

I suspect the more User messages in a conversation, and the use of RAG and MCP, the more inaccurate the approach in this PR becomes.

Do you know if llama-stack have plans to support Token counting?

Their InferenceApi.completion returns CompletionResponse that includes Token metrics however it seems this currently gets lost when using an Agent 😢

It might be worth finding out their plans and keep this as a temporary solution.

WDYT?

csibbitt · 2025-07-09T19:26:21Z

Therefore the prompt sent to the LLM by llama-stack will include system_prompt, query_request.query and the previous queries relating to the session/conversation. This is further compounded if, for example, RAG or MCP Tool Calls are included in the prompt sent to the LLM by llama-stack.
I suspect the more User messages in a conversation, and the use of RAG and MCP, the more inaccurate the approach in this PR becomes.

Ah yes, I expect I will see this when I look at it using a front-end. :)

Do you know if llama-stack have plans to support Token counting?

I could not find any public open issues related to this, but our plan was to propose it.

Their InferenceApi.completion returns CompletionResponse that includes Token metrics however it seems this currently gets lost when using an Agent 😢

Yes; I initially thought this would be trivial because I'd seen the usage states in the Inference API.

It might be worth finding out their plans and keep this as a temporary solution.

This is 100% expected to be a temporary solution to get us closer to parity with road-core. You can review the internal discussion linked from OSPRH-17603 where this is framed more clearly as a stop-gap.

csibbitt · 2025-07-09T22:12:42Z

AFAICT Michael is right, to get an accurate count I actually need to include the entire message history as input on each turn, not just the new inputs. We don't currently cache this anywhere in lightspeed-stack. There is a llama-stack session API which looks like it would return them to us, but that is a bit absurd, I think; perhaps less so if we were to use llama-stack-as-library?

We briefly discussed the ultimate limitations of this (like the context size of the LLM), persistence of the session, etc, and it's not clear that there are any limitations on the llama-stack side, or how they are dealing with the context limit. We do ask for persistence of the sessions, so at least it's not strictly memory bound, but some more reading may be required in order to figure out how this is ultimately limited.

Some links to relevant parts (thanks to Michael for the digging):

Adding previous messages to the current turn: https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/inline/agents/meta_reference/agent_instance.py#L223

Passing all messages to the turn loop: https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/inline/agents/meta_reference/agent_instance.py#L277

Passing all messages to the InferenceApi : https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/inline/agents/meta_reference/agent_instance.py#L513

I'd considered a local cache of the Message history, but could probably get by with just keeping a running count of input tokens by conversation_id and adding to it on each turn.

Any other suggestions on this topic?

tisnik

in overall it looks good. As @manstis wrote the actual TokenCounter implementation will be a bit more complicated, and will need to be updated when conversation history, RAG, MCP will be in place. Just some nitpicks from my side.

tisnik · 2025-07-10T08:18:52Z

lightspeed-stack.yaml

@@ -22,3 +22,4 @@ user_data_collection:
  transcripts_storage: "/tmp/data/transcripts"
 authentication:
  module: "noop"
+default_estimation_tokenizer: "cl100k_base"


prob. worth to setup it in constants.py and allow to override it by anything else in customization block? WDYT?

tisnik · 2025-07-10T08:19:45Z

src/app/endpoints/query.py

@@ -187,13 +193,21 @@ def select_model_id(models: ModelListResponse, query_request: QueryRequest) -> s
    return model_id


+def _build_toolgroups(client: LlamaStackClient) -> list[Toolgroup] | None:


is it part of this PR?

It was to alleviate a "too many local variables" error that appeared while I was fixing merge conflicts.

Probably we can just # pylint: disable=too-many-locals for this

tisnik · 2025-07-10T08:21:08Z

src/models/config.py

@@ -154,6 +154,7 @@ class Configuration(BaseModel):
        AuthenticationConfiguration()
    )
    customization: Optional[Customization] = None
+    default_estimation_tokenizer: str = "cl100k_base"


dtto, the constant might be stored in constants.py with some comments what it means.

tisnik · 2025-07-10T08:21:54Z

src/models/responses.py

    """

    conversation_id: Optional[str] = None
    response: str
+    input_tokens: Optional[int] = None


AGAIK it is not optional - you set 0 for any exceptions, so it's still int ?

manstis · 2025-07-10T09:51:37Z

@csibbitt @tisnik FWIW llamastack/llama-stack#2704

csibbitt · 2025-07-10T13:08:39Z

@csibbitt @tisnik FWIW meta-llama/llama-stack#2704

Thank you

csibbitt · 2025-07-10T13:50:48Z

I changed the status to Draft pending additions of history counting and at least some comments about where to add MCP and RAG related counting.

* Token counts consider full conversation history * Token counts not defined as optional * Added default_estimation_tokenizer to constants and customization

…a-stack

csibbitt · 2025-07-17T17:44:42Z

I've implemented the estimation for message history as well as attachments, and addressed the other feedback. AFAICT so far, I think it would only really be possible to count tokens in RAG chunks and MCP queries from the llama-stack side; please clue me in if I've misunderstood.

Ultimately I'm not sure how useful this all is, but I'm opening it up for review so we can decide whether it's worth including or not, and so y'all can continue to point out what I've missed. :)

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (4)

src/utils/token_counter.py (3)
87-126: Fix parameter documentation mismatch.

The docstring parameter description doesn't match the actual method signature.
        Args:
            system_prompt: The system prompt message content.
-            query: The user's query message content.
+            query_request: The user's query request object.
            response: The assistant's response message content (optional).
204-206: Use hardcoded constants appropriately.

The role formatting overhead is hardcoded to 4 tokens, which is documented as typical for OpenAI models. Consider making this configurable per model type for better accuracy.
-            # Add role overhead (varies by model, 4 is typical for OpenAI models)
-            role_formatting_overhead = 4
+            # Add role overhead (varies by model, 4 is typical for OpenAI models)
+            role_formatting_overhead = self._get_role_formatting_overhead()
Then add a method to calculate model-specific overhead:
def _get_role_formatting_overhead(self) -> int:
    """Get role formatting overhead based on model type."""
    # Could be made configurable based on model_id in the future
    return 4
226-228: Review hardcoded conversation overhead values.

The conversation overhead calculation uses hardcoded values that may not be accurate for all models or conversation formats.

Consider making these values configurable or model-specific:
-        base_overhead = 3  # Start of conversation tokens (based on OpenAI chat format)
-        separator_overhead = max(0, (message_count - 1) * 1)  # Message separator tokens
+        base_overhead = self._get_base_conversation_overhead()
+        separator_overhead = max(0, (message_count - 1) * self._get_separator_overhead())
src/app/endpoints/streaming_query.py (1)

167-167: Verify TODO comment relevance.

The TODO comment for truncation implementation should be addressed or tracked properly.

The TODO comment indicates missing truncation functionality. Would you like me to create a tracking issue for this feature or help implement the truncation logic?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4c1bd95 and 3c5a3e1.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (11)

docs/openapi.json (2 hunks)
pyproject.toml (1 hunks)
src/app/endpoints/query.py (5 hunks)
src/app/endpoints/streaming_query.py (8 hunks)
src/constants.py (1 hunks)
src/models/config.py (1 hunks)
src/models/responses.py (1 hunks)
src/utils/token_counter.py (1 hunks)
tests/unit/app/endpoints/test_query.py (19 hunks)
tests/unit/app/endpoints/test_streaming_query.py (20 hunks)
tests/unit/utils/test_token_counter.py (1 hunks)

✅ Files skipped from review due to trivial changes (1)

src/constants.py

🚧 Files skipped from review as they are similar to previous changes (8)

pyproject.toml
src/models/config.py
src/models/responses.py
tests/unit/app/endpoints/test_query.py
tests/unit/utils/test_token_counter.py
tests/unit/app/endpoints/test_streaming_query.py
src/app/endpoints/query.py
docs/openapi.json

🔇 Additional comments (10)

src/utils/token_counter.py (5)

1-32: Well-structured module with comprehensive imports and documentation.

The module setup is clean with proper imports, type annotations, and a clear module docstring. The global TTL cache for conversation tracking is appropriately scoped and configured.

46-73: Robust tokenizer initialization with proper fallback handling.

The initialization logic correctly handles model-specific tokenization with fallback to a default tokenizer. The error handling and logging are appropriate.

74-85: Simple and correct token counting implementation.

The method handles edge cases (empty/None text) appropriately and uses the tiktoken encoder correctly.

231-244: Appropriate use of LRU cache for performance optimization.

The cached factory function is well-implemented and will prevent redundant tokenizer initialization.

247-256: Clean configuration access pattern.

The function correctly handles nested configuration access with proper fallback to the default constant.

src/app/endpoints/streaming_query.py (5)

29-29: Proper import of token counter utility.

The import is correctly placed and follows the established import organization.

103-130: Good API improvement with token metrics integration.

The function signature update properly integrates token usage metrics while maintaining backward compatibility through default values in the metrics_map.

215-221: Correct unpacking of token usage from retrieve_response.

The function call correctly handles the new third return value containing token usage data.

288-296: Well-designed helper function for toolgroup construction.

The extraction of toolgroup building logic into a separate async function improves code organization and reusability.

304-304: Verify return type annotation matches implementation.

The function signature indicates it returns a tuple with token usage data, which aligns with the implementation.

coderabbitai · 2025-07-17T17:47:36Z

src/utils/token_counter.py

+    def count_conversation_turn_tokens(
+        self,
+        conversation_id: str,
+        system_prompt: str,
+        query_request: QueryRequest,
+        response: str = "",
+    ) -> dict[str, int]:
+        """Count tokens for a conversation turn with cumulative tracking.
+
+        This method estimates token usage for a conversation turn and tracks
+        cumulative input tokens across the conversation. It accounts for the
+        fact that Agent conversations include the entire message history in
+        each turn.
+
+        Args:
+            conversation_id: The conversation ID to track tokens for.
+            system_prompt: The system prompt message content.
+            query: The user's query message content.
+            response: The assistant's response message content (optional).
+
+        Returns:
+            A dictionary containing:
+                - 'input_tokens': Cumulative input tokens for the conversation
+                - 'output_tokens': Total tokens in the response message
+        """
+        # Get the current turn's token usage
+        turn_token_usage = self.count_turn_tokens(
+            system_prompt, query_request, response
+        )
+
+        # Get cumulative input tokens for this conversation
+        cumulative_input_tokens = _conversation_cache.get(conversation_id, 0)
+
+        # Add this turn's input tokens to the cumulative total
+        new_cumulative_input_tokens = (
+            cumulative_input_tokens + turn_token_usage["input_tokens"]
+        )
+        _conversation_cache[conversation_id] = new_cumulative_input_tokens
+
+        # TODO(csibbitt) - Add counting for MCP and RAG content
+
+        logger.debug(
+            "Token usage for conversation %s: turn input=%d, cumulative input=%d, output=%d",
+            conversation_id,
+            turn_token_usage["input_tokens"],
+            new_cumulative_input_tokens,
+            turn_token_usage["output_tokens"],
+        )
+
+        return {
+            "input_tokens": new_cumulative_input_tokens,
+            "output_tokens": turn_token_usage["output_tokens"],
+        }


💡 Verification agent

🧩 Analysis chain

Fix parameter documentation and consider thread safety.

The docstring has a parameter mismatch, and the global cache usage may have thread safety implications.

Args: conversation_id: The conversation ID to track tokens for. system_prompt: The system prompt message content. - query: The user's query message content. + query_request: The user's query request object. response: The assistant's response message content (optional).

Consider thread safety for the global cache:
The global _conversation_cache is accessed without synchronization, which could lead to race conditions in concurrent environments. Consider using thread-safe operations or documenting the thread safety assumptions.

🏁 Script executed:

#!/bin/bash # Check if the application uses async/concurrent processing that might affect cache safety rg -A 10 -B 5 "async def|asyncio|threading|concurrent" --type py

Length of output: 66393

Update parameter docs and synchronize conversation cache access

The docstring’s Args section still refers to query instead of query_request, and accessing the global _conversation_cache without coordination can lead to lost updates under concurrent requests.

• File src/utils/token_counter.py, method count_conversation_turn_tokens
– Fix the docstring parameter name
– Wrap cache reads/writes in an asyncio.Lock or similar to prevent race conditions

def count_conversation_turn_tokens( self, conversation_id: str, system_prompt: str, query_request: QueryRequest, response: str = "", ) -> dict[str, int]: """ - Args: - conversation_id: The conversation ID to track tokens for. - system_prompt: The system prompt message content. - query: The user's query message content. - response: The assistant's response message content (optional). + Args: + conversation_id: The conversation ID to track tokens for. + system_prompt: The system prompt message content. + query_request: The user's query request object. + response: The assistant's response message content (optional). """

Suggested thread-safety refactor:

Add at module scope:
_cache_lock = asyncio.Lock()

In count_conversation_turn_tokens, guard cache operations:
async with _cache_lock: cumulative = _conversation_cache.get(conversation_id, 0) new_total = cumulative + turn_token_usage["input_tokens"] _conversation_cache[conversation_id] = new_total

Or document that this cache is only safe under a single-threaded event loop.

🤖 Prompt for AI Agents

In src/utils/token_counter.py lines 128 to 180, update the docstring in count_conversation_turn_tokens to replace the parameter name 'query' with 'query_request' for accuracy. Additionally, to prevent race conditions when accessing the global _conversation_cache, introduce an asyncio.Lock at the module level (e.g., _cache_lock = asyncio.Lock()) and wrap all cache read and write operations inside an 'async with _cache_lock' block within the method. This ensures thread-safe updates to the conversation token counts under concurrent requests.

coderabbitai · 2025-07-17T17:47:36Z

src/app/endpoints/streaming_query.py

+            # Currently (2025-07-08) the usage is not returned by the API, so we need to estimate
+            # try:
+            #     output_tokens = response.usage.get("completion_tokens", 0)
+            # except AttributeError:
+            # Estimate output tokens from complete response
+            try:
+                token_counter = get_token_counter(model_id)
+                output_tokens = token_counter.count_tokens(complete_response)
+                logger.debug("Estimated output tokens: %s", output_tokens)
+            except Exception as e:  # pylint: disable=broad-exception-caught
+                logger.warning("Failed to estimate output tokens: %s", e)
+                output_tokens = 0
+
+            metrics_map = {
+                "input_tokens": token_usage["input_tokens"],
+                "output_tokens": output_tokens,
+            }
+            yield stream_end_event(metadata_map, metrics_map)


🛠️ Refactor suggestion

Improve error handling and remove commented code.

The token counting implementation has broad exception handling and contains commented-out code that should be cleaned up.

- # Currently (2025-07-08) the usage is not returned by the API, so we need to estimate - # try: - # output_tokens = response.usage.get("completion_tokens", 0) - # except AttributeError: - # Estimate output tokens from complete response try: token_counter = get_token_counter(model_id) output_tokens = token_counter.count_tokens(complete_response) logger.debug("Estimated output tokens: %s", output_tokens) - except Exception as e: # pylint: disable=broad-exception-caught + except (KeyError, ValueError, AttributeError) as e: logger.warning("Failed to estimate output tokens: %s", e) output_tokens = 0

Rationale: Remove outdated commented code and use more specific exception handling instead of catching all exceptions.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Currently (2025-07-08) the usage is not returned by the API, so we need to estimate

# try:

# output_tokens = response.usage.get("completion_tokens", 0)

# except AttributeError:

# Estimate output tokens from complete response

try:

token_counter = get_token_counter(model_id)

output_tokens = token_counter.count_tokens(complete_response)

logger.debug("Estimated output tokens: %s", output_tokens)

except Exception as e: # pylint: disable=broad-exception-caught

logger.warning("Failed to estimate output tokens: %s", e)

output_tokens = 0

metrics_map = {

"input_tokens": token_usage["input_tokens"],

"output_tokens": output_tokens,

}

yield stream_end_event(metadata_map, metrics_map)

try:

token_counter = get_token_counter(model_id)

output_tokens = token_counter.count_tokens(complete_response)

logger.debug("Estimated output tokens: %s", output_tokens)

except (KeyError, ValueError, AttributeError) as e:

logger.warning("Failed to estimate output tokens: %s", e)

output_tokens = 0

metrics_map = {

"input_tokens": token_usage["input_tokens"],

"output_tokens": output_tokens,

}

yield stream_end_event(metadata_map, metrics_map)

🤖 Prompt for AI Agents

In src/app/endpoints/streaming_query.py around lines 240 to 257, remove the commented-out code related to usage retrieval since it is outdated. Replace the broad except Exception clause with more specific exception handling that targets only the expected errors from get_token_counter or count_tokens, such as AttributeError or a custom token counting error if applicable. This will improve error clarity and maintain cleaner code.

coderabbitai · 2025-07-17T17:47:36Z

src/app/endpoints/streaming_query.py

+    # Currently (2025-07-08) the usage is not returned by the API, so we need to estimate it
+    # try:
+    #     token_usage = {
+    #         "input_tokens": response.usage.get("prompt_tokens", 0),
+    #         "output_tokens": 0,  # Will be calculated during streaming
+    #     }
+    # except AttributeError:
+    #     # Estimate input tokens (Output will be calculated during streaming)
+    try:
+        token_counter = get_token_counter(model_id)
+        token_usage = token_counter.count_conversation_turn_tokens(
+            conversation_id, system_prompt, query_request
+        )
+    except Exception as e:  # pylint: disable=broad-exception-caught
+        logger.warning("Failed to estimate token usage: %s", e)
+        token_usage = {
+            "input_tokens": 0,
+            "output_tokens": 0,
+        }
+
+    return response, conversation_id, token_usage


🛠️ Refactor suggestion

Improve error handling and remove commented code.

Similar to the earlier issue, this section has broad exception handling and outdated commented code.

- # Currently (2025-07-08) the usage is not returned by the API, so we need to estimate it - # try: - # token_usage = { - # "input_tokens": response.usage.get("prompt_tokens", 0), - # "output_tokens": 0, # Will be calculated during streaming - # } - # except AttributeError: - # # Estimate input tokens (Output will be calculated during streaming) try: token_counter = get_token_counter(model_id) token_usage = token_counter.count_conversation_turn_tokens( conversation_id, system_prompt, query_request ) - except Exception as e: # pylint: disable=broad-exception-caught + except (KeyError, ValueError, AttributeError) as e: logger.warning("Failed to estimate token usage: %s", e) token_usage = { "input_tokens": 0, "output_tokens": 0, }

Rationale: Remove outdated commented code and use more specific exception handling.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Currently (2025-07-08) the usage is not returned by the API, so we need to estimate it

# try:

# token_usage = {

# "input_tokens": response.usage.get("prompt_tokens", 0),

# "output_tokens": 0, # Will be calculated during streaming

# }

# except AttributeError:

# # Estimate input tokens (Output will be calculated during streaming)

try:

token_counter = get_token_counter(model_id)

token_usage = token_counter.count_conversation_turn_tokens(

conversation_id, system_prompt, query_request

)

except Exception as e: # pylint: disable=broad-exception-caught

logger.warning("Failed to estimate token usage: %s", e)

token_usage = {

"input_tokens": 0,

"output_tokens": 0,

}

return response, conversation_id, token_usage

try:

token_counter = get_token_counter(model_id)

token_usage = token_counter.count_conversation_turn_tokens(

conversation_id, system_prompt, query_request

)

except (KeyError, ValueError, AttributeError) as e:

logger.warning("Failed to estimate token usage: %s", e)

token_usage = {

"input_tokens": 0,

"output_tokens": 0,

}

return response, conversation_id, token_usage

🤖 Prompt for AI Agents

In src/app/endpoints/streaming_query.py around lines 359 to 379, remove the outdated commented code related to token usage estimation and replace the broad exception handling with more specific exceptions that can be raised by get_token_counter or count_conversation_turn_tokens. This will clean up the code and improve error handling by catching only expected errors instead of all exceptions.

This patch is adding the /metrics endpoint to the project. Differences with road-core/service: * The metrics are prefixed with "ls_" instead of "ols_" * The "provider_model_configuration" does not set the non-default model/providers to 0 because we currently do not have a way to set a default model/provider in the configuration. A TODO was left in the code. Supported metrics: * rest_api_calls_total: Counter to track REST API calls * response_duration_seconds: Histogram to measure how long it takes to handle requests * provider_model_configuration: Indicates what provider + model customers are using * llm_calls_total: How many LLM calls were made for each provider + model * llm_calls_failures_total: How many LLM calls failed * llm_calls_validation_errors_total: How many LLM calls had validation errors Missing metrics: * llm_token_sent_total: How many tokens were sent * llm_token_received_total: How many tokens were received The above metrics are missing because token counting PR (lightspeed-core#215) is not merged yet. Signed-off-by: Lucas Alvares Gomes <lucasagomes@gmail.com>

manstis · 2025-07-18T15:33:42Z

src/utils/token_counter.py

+logger = logging.getLogger(__name__)
+
+# Class-level cache to track cumulative input tokens for each conversation
+_conversation_cache: TTLCache[str, int] = TTLCache(maxsize=1000, ttl=3600)


I know @eranco74 and @tisnik already have concerns about caching if there are multiple lightspeed-stack replicas.

Just calling this out here too...

+1, there's a ticket for it at https://issues.redhat.com/browse/LCORE-298

Nothing much to do as part of this patch until it's properly implemented.

This patch is adding the /metrics endpoint to the project. Differences with road-core/service: * The metrics are prefixed with "ls_" instead of "ols_" * The "provider_model_configuration" does not set the non-default model/providers to 0 because we currently do not have a way to set a default model/provider in the configuration. A TODO was left in the code. Supported metrics: * rest_api_calls_total: Counter to track REST API calls * response_duration_seconds: Histogram to measure how long it takes to handle requests * provider_model_configuration: Indicates what provider + model customers are using * llm_calls_total: How many LLM calls were made for each provider + model * llm_calls_failures_total: How many LLM calls failed * llm_calls_validation_errors_total: How many LLM calls had validation errors Missing metrics: * llm_token_sent_total: How many tokens were sent * llm_token_received_total: How many tokens were received The above metrics are missing because token counting PR (lightspeed-core#215) is not merged yet. Signed-off-by: Lucas Alvares Gomes <lucasagomes@gmail.com>

umago

Thanks Chris! Overall it looks pretty good I would say. It does need to be rebased and have the merge conflicts resolved

umago · 2025-07-23T10:49:43Z

src/app/endpoints/query.py

@@ -187,13 +193,21 @@ def select_model_id(models: ModelListResponse, query_request: QueryRequest) -> s
    return model_id


+def _build_toolgroups(client: LlamaStackClient) -> list[Toolgroup] | None:


Probably we can just # pylint: disable=too-many-locals for this

umago · 2025-07-23T10:50:23Z

src/app/endpoints/query.py

+            conversation_id, system_prompt, query_request, response_content
+        )
+    except Exception as e:  # pylint: disable=broad-exception-caught
+        logger.warning("Failed to estimate token usage: %s", e)


nit: This probably should be an error instead of a warning

umago · 2025-07-23T10:50:50Z

src/app/endpoints/streaming_query.py

+                output_tokens = token_counter.count_tokens(complete_response)
+                logger.debug("Estimated output tokens: %s", output_tokens)
+            except Exception as e:  # pylint: disable=broad-exception-caught
+                logger.warning("Failed to estimate output tokens: %s", e)


ditto: s/warning/error/g

umago · 2025-07-23T11:01:09Z

src/utils/token_counter.py

+logger = logging.getLogger(__name__)
+
+# Class-level cache to track cumulative input tokens for each conversation
+_conversation_cache: TTLCache[str, int] = TTLCache(maxsize=1000, ttl=3600)


+1, there's a ticket for it at https://issues.redhat.com/browse/LCORE-298

Nothing much to do as part of this patch until it's properly implemented.

This patch is adding the /metrics endpoint to the project. Differences with road-core/service: * The metrics are prefixed with "ls_" instead of "ols_" * The "provider_model_configuration" does not set the non-default model/providers to 0 because we currently do not have a way to set a default model/provider in the configuration. A TODO was left in the code. Supported metrics: * rest_api_calls_total: Counter to track REST API calls * response_duration_seconds: Histogram to measure how long it takes to handle requests * provider_model_configuration: Indicates what provider + model customers are using * llm_calls_total: How many LLM calls were made for each provider + model * llm_calls_failures_total: How many LLM calls failed * llm_calls_validation_errors_total: How many LLM calls had validation errors Missing metrics: * llm_token_sent_total: How many tokens were sent * llm_token_received_total: How many tokens were received The above metrics are missing because token counting PR (lightspeed-core#215) is not merged yet. Signed-off-by: Lucas Alvares Gomes <lucasagomes@gmail.com>

Token counting

53a5441

* Returns token usage data in query and streaming query responses * llama-stack currently doesn't include usage in Agent API * We estimate locally with an embedding model * Accuracy will vary, but it should be nearly 100% for GPT models.

coderabbitai bot reviewed Jul 9, 2025

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

manstis reviewed Jul 9, 2025

View reviewed changes

csibbitt added 2 commits July 9, 2025 15:26

Merge branch 'main' into csibbitt/OSPRH-17603_token_counts

ac90f08

raise minimum version of tiktoken

4c1bd95

csibbitt changed the title ~~Token counting~~ LCORE-178 Token counting Jul 9, 2025

tisnik requested changes Jul 10, 2025

View reviewed changes

csibbitt marked this pull request as draft July 10, 2025 13:49

csibbitt added 5 commits July 17, 2025 09:35

Fixes from review

67269cb

* Token counts consider full conversation history * Token counts not defined as optional * Added default_estimation_tokenizer to constants and customization

Merge branch 'main' into csibbitt/OSPRH-17603_token_counts

ebd1424

Count tokens in attachments

4ef7798

Lint picking

cd5bb03

Fix attachments counting to be closer to how they are handled in llam…

3c5a3e1

…a-stack

csibbitt marked this pull request as ready for review July 17, 2025 17:44

coderabbitai bot reviewed Jul 17, 2025

View reviewed changes

More optimistic version pin for tiktoken

5818297

umago mentioned this pull request Jul 18, 2025

LCORE-220: Add metrics to lightspeed-stack #256

Merged

18 tasks

manstis reviewed Jul 18, 2025

View reviewed changes

umago reviewed Jul 23, 2025

View reviewed changes

		@@ -187,13 +193,21 @@ def select_model_id(models: ModelListResponse, query_request: QueryRequest) -> s
		return model_id


		def _build_toolgroups(client: LlamaStackClient) -> list[Toolgroup] \| None:

LCORE-178 Token counting #215

Are you sure you want to change the base?

LCORE-178 Token counting #215

Uh oh!

Conversation

csibbitt commented Jul 9, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Related Tickets & Documents

Checklist before requesting a review

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

manstis left a comment

Choose a reason for hiding this comment

Uh oh!

csibbitt commented Jul 9, 2025

Uh oh!

csibbitt commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tisnik left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

manstis commented Jul 10, 2025

Uh oh!

csibbitt commented Jul 10, 2025

Uh oh!

csibbitt commented Jul 10, 2025

Uh oh!

csibbitt commented Jul 17, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

umago left a comment

csibbitt commented Jul 9, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jul 9, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

csibbitt commented Jul 9, 2025 •

edited

Loading