-
Notifications
You must be signed in to change notification settings - Fork 39
LCORE-178 Token counting #215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
LCORE-178 Token counting #215
Conversation
* Returns token usage data in query and streaming query responses * llama-stack currently doesn't include usage in Agent API * We estimate locally with an embedding model * Accuracy will vary, but it should be nearly 100% for GPT models.
Important Review skippedReview was skipped due to path filters ⛔ Files ignored due to path filters (1)
CodeRabbit blocks several paths by default. You can override this behavior by explicitly including those paths in the path filters. For example, including You can disable this status message by setting the WalkthroughToken usage tracking was integrated into the query and streaming query endpoints. Token counting utilities using the Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant API
participant TokenCounter
participant LLM
Client->>API: Send query request
API->>TokenCounter: Estimate input tokens
API->>LLM: Forward query
LLM-->>API: Return response (may not include usage)
API->>TokenCounter: Estimate output tokens (if needed)
API-->>Client: Return response with input_tokens, output_tokens
Poem
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
src/utils/token_counter.py (3)
56-63
: Consider more robust configuration access.The configuration access pattern
configuration.configuration.default_estimation_tokenizer
creates a tight coupling and could fail if the configuration structure changes. Consider using a more defensive approach.- fallback_encoding = configuration.configuration.default_estimation_tokenizer + try: + fallback_encoding = configuration.configuration.default_estimation_tokenizer + except AttributeError: + fallback_encoding = "cl100k_base" # Safe default
138-139
: Hardcoded overhead values may not be universally accurate.The role formatting overhead of 4 tokens is a reasonable estimate for OpenAI models, but may not be accurate for all model types. Consider making this configurable or model-specific.
- # Add role overhead - role_formatting_overhead = 4 + # Add role overhead (varies by model, 4 is typical for OpenAI models) + role_formatting_overhead = 4 # TODO: Make this model-specific
159-161
: Document the overhead calculation rationale.The conversation overhead calculation logic is reasonable, but the specific values (3 for base overhead, 1 for separators) should be documented or referenced to their source.
- base_overhead = 3 # Start of conversation - separator_overhead = max(0, (message_count - 1) * 1) # Between messages + base_overhead = 3 # Start of conversation tokens (based on OpenAI chat format) + separator_overhead = max(0, (message_count - 1) * 1) # Message separator tokens
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
uv.lock
is excluded by!**/*.lock
📒 Files selected for processing (17)
docs/openapi.json
(1 hunks)lightspeed-stack.yaml
(1 hunks)pyproject.toml
(1 hunks)src/app/endpoints/config.py
(1 hunks)src/app/endpoints/query.py
(5 hunks)src/app/endpoints/streaming_query.py
(7 hunks)src/models/config.py
(1 hunks)src/models/responses.py
(1 hunks)src/utils/token_counter.py
(1 hunks)tests/unit/app/endpoints/test_config.py
(1 hunks)tests/unit/app/endpoints/test_info.py
(1 hunks)tests/unit/app/endpoints/test_models.py
(3 hunks)tests/unit/app/endpoints/test_query.py
(19 hunks)tests/unit/app/endpoints/test_streaming_query.py
(19 hunks)tests/unit/models/test_config.py
(4 hunks)tests/unit/test_configuration.py
(4 hunks)tests/unit/utils/test_token_counter.py
(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (3)
tests/unit/app/endpoints/test_query.py (1)
src/app/endpoints/query.py (1)
retrieve_response
(204-281)
tests/unit/app/endpoints/test_streaming_query.py (2)
src/app/endpoints/streaming_query.py (1)
retrieve_response
(282-364)tests/unit/app/endpoints/test_query.py (2)
MockShield
(380-382)MockShield
(423-425)
src/utils/token_counter.py (1)
src/configuration.py (1)
configuration
(47-52)
🔇 Additional comments (35)
src/models/config.py (1)
157-157
: LGTM: Configuration field addition is well-implemented.The
default_estimation_tokenizer
field addition is consistent with the token counting feature. The default value"cl100k_base"
is the standard tiktoken encoding for GPT-3.5-turbo and GPT-4 models.lightspeed-stack.yaml (1)
25-25
: LGTM: Configuration key addition is consistent.The
default_estimation_tokenizer
configuration key addition aligns with the Configuration model and maintains consistency across the codebase.tests/unit/app/endpoints/test_info.py (1)
26-26
: LGTM: Test configuration update is necessary.The test configuration dictionary update is required to match the updated Configuration model that now includes the
default_estimation_tokenizer
field.src/app/endpoints/config.py (1)
49-49
: LGTM: Response example update maintains consistency.The addition of
default_estimation_tokenizer
to the response example maintains consistency with the updated Configuration model and ensures accurate API documentation.tests/unit/app/endpoints/test_config.py (1)
47-47
: LGTM! Test configuration updated for new tokenizer field.The addition of
default_estimation_tokenizer
to the test configuration aligns with the new token counting functionality and maintains test consistency.tests/unit/test_configuration.py (1)
53-53
: LGTM! Consistent test configuration updates.The addition of
default_estimation_tokenizer
to all test configuration dictionaries ensures consistent test coverage for the new token counting configuration field.Also applies to: 116-116, 224-224, 261-261
src/models/responses.py (1)
29-30
: LGTM! Token counting fields properly added to response model.The addition of
input_tokens
andoutput_tokens
as optional integer fields correctly implements the token counting feature described in the PR objectives. The field types and documentation are appropriate.Also applies to: 35-36
tests/unit/app/endpoints/test_models.py (1)
49-49
: LGTM! Test configurations updated for token counting feature.The addition of
default_estimation_tokenizer
to all test configuration dictionaries maintains consistency with the new configuration schema for token counting functionality.Also applies to: 88-88, 122-122
tests/unit/models/test_config.py (1)
324-324
: LGTM! Configuration model tests updated for new tokenizer field.The addition of
default_estimation_tokenizer
to both the Configuration constructor calls and the expected JSON assertions ensures proper test coverage for the new token counting configuration field.Also applies to: 379-379, 454-454, 547-547
docs/openapi.json (1)
846-875
: Schema changes look correct and well-structured.The new optional
input_tokens
andoutput_tokens
fields are properly defined with appropriate union types allowing integer or null values. The updated description clearly explains what these fields represent.tests/unit/utils/test_token_counter.py (1)
1-68
: Comprehensive test coverage for TokenCounter utility.The test suite provides excellent coverage of the TokenCounter functionality:
- Tests edge cases (empty strings, empty message lists)
- Validates both string and message token counting
- Uses realistic configuration and message types
- Includes proper token overhead calculations
tests/unit/app/endpoints/test_query.py (4)
47-47
: Configuration updated to include default tokenizer.The addition of
"default_estimation_tokenizer": "cl100k_base"
to the test configuration aligns with the new token counting feature requirements.
122-127
: Mock return value updated to include token usage.The mock is correctly updated to return a tuple with token usage counts, matching the new
retrieve_response
signature.
322-329
: Test properly validates token usage response.The test correctly unpacks the token usage dictionary and validates that both input and output tokens are greater than zero, ensuring the token counting functionality works as expected.
303-304
: Mock agent setup includes proper output_message specification.The mock properly specifies the
output_message
attribute to ensure compatibility with the updated agent interface used in the token counting logic.src/app/endpoints/query.py (7)
34-34
: Import added for token counting utility.The import of
get_token_counter
provides access to the token estimation functionality required for the new feature.
110-116
: Query handler updated to include token usage.The endpoint handler correctly unpacks the token usage information from
retrieve_response
and passes it to the response constructor, maintaining clean separation of concerns.
133-138
: QueryResponse updated with token usage fields.The response object now includes the token usage metrics, providing clients with visibility into LLM token consumption for cost tracking and optimization.
196-201
: Good refactoring with _build_toolgroups helper function.The extraction of toolgroup building logic into a separate helper function improves code organization and readability while maintaining the same functionality.
210-210
: Return type signature updated to include token usage.The function signature correctly reflects the new return value structure, improving type safety and documentation.
262-280
: Robust token usage estimation with proper error handling.The implementation includes:
- Clear comment explaining the current limitation of the API
- Proper exception handling with broad catch to prevent service disruption
- Fallback to zero counts when estimation fails
- Informative warning logging for debugging
This approach ensures service reliability while providing useful token usage data when possible.
281-281
: Updated return statement includes token usage.The return statement correctly includes the token usage dictionary, completing the implementation of the new feature.
tests/unit/app/endpoints/test_streaming_query.py (4)
144-145
: Good simulation of API behavior.The explicit deletion of the
usage
attribute effectively simulates the current API behavior where usage data is not returned, preventing pytest from returning a Mock object. This is a clean way to test the fallback token estimation logic.
186-191
: Correct mock update for token usage.The mock is properly updated to return token usage data alongside the response and conversation ID, matching the new function signature. The hardcoded values (10 input tokens, 20 output tokens) are appropriate for testing.
273-280
: Consistent mock setup pattern.The addition of
spec=["output_message"]
to constrain the mock interface and themock_client.models.list
setup is consistent across all test functions. This ensures proper mocking of the model selection functionality.
298-306
: Proper handling of updated function signature.The test correctly unpacks the third return value
token_usage
fromretrieve_response
and asserts that input tokens are greater than zero, which validates the token counting functionality.src/utils/token_counter.py (1)
164-177
: Appropriate use of caching for performance.The LRU cache with maxsize=8 is well-suited for this use case, as it avoids repeated tokenizer initialization while keeping memory usage reasonable. The cache size should handle typical model variety in production.
src/app/endpoints/streaming_query.py (8)
27-27
: Proper import of token counting utility.The import is correctly placed and follows the established import organization pattern in the file.
98-105
: Improved function signature with actual metrics.The updated
stream_end_event
function signature now acceptsmetrics_map
and uses actual token counts instead of hardcoded zeros. This is a significant improvement for providing accurate usage data to clients.
120-121
: Correct usage of metrics data.The function now properly extracts
input_tokens
andoutput_tokens
from the metrics map with appropriate fallback values, replacing the previous hardcoded zeros.
208-214
: Proper unpacking of token usage data.The code correctly unpacks the third return value
token_usage
fromretrieve_response
, maintaining compatibility with the updated function signature.
238-245
: Robust error handling for token estimation.The try-catch block properly handles potential failures in token counting while logging appropriate messages. The fallback to 0 tokens ensures the system continues to function even when estimation fails.
247-251
: Clean metrics construction and usage.The metrics map construction clearly separates input tokens (from initial estimation) and output tokens (from streaming calculation), then passes both to the end event. This maintains a clean separation of concerns.
288-288
: Updated function signature reflects new functionality.The
retrieve_response
function signature is correctly updated to return the token usage dictionary as the third tuple element, maintaining backward compatibility for the first two elements.
353-362
: Appropriate error handling for input token estimation.The token counting logic properly handles the case where the API doesn't provide usage data by estimating input tokens using the system prompt and query. The error handling ensures graceful degradation with appropriate logging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is an approximation in the absence of Token counting in llama-stack
.
The problem is that we enable llama-stack
session persistence in order to have a conversation.
Therefore the prompt sent to the LLM by llama-stack
will include system_prompt
, query_request.query
and the previous queries relating to the session/conversation. This is further compounded if, for example, RAG or MCP Tool Calls are included in the prompt sent to the LLM by llama-stack
.
I suspect the more User messages in a conversation, and the use of RAG and MCP, the more inaccurate the approach in this PR becomes.
Do you know if llama-stack
have plans to support Token counting?
Their InferenceApi.completion
returns CompletionResponse
that includes Token metrics however it seems this currently gets lost when using an Agent
😢
It might be worth finding out their plans and keep this as a temporary solution.
WDYT?
Ah yes, I expect I will see this when I look at it using a front-end. :)
I could not find any public open issues related to this, but our plan was to propose it.
Yes; I initially thought this would be trivial because I'd seen the usage states in the Inference API.
This is 100% expected to be a temporary solution to get us closer to parity with road-core. You can review the internal discussion linked from OSPRH-17603 where this is framed more clearly as a stop-gap. |
AFAICT Michael is right, to get an accurate count I actually need to include the entire message history as input on each turn, not just the new inputs. We don't currently cache this anywhere in lightspeed-stack. There is a llama-stack session API which looks like it would return them to us, but that is a bit absurd, I think; perhaps less so if we were to use llama-stack-as-library? We briefly discussed the ultimate limitations of this (like the context size of the LLM), persistence of the session, etc, and it's not clear that there are any limitations on the llama-stack side, or how they are dealing with the context limit. We do ask for persistence of the sessions, so at least it's not strictly memory bound, but some more reading may be required in order to figure out how this is ultimately limited. Some links to relevant parts (thanks to Michael for the digging): Adding previous messages to the current turn: https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/inline/agents/meta_reference/agent_instance.py#L223 Passing all messages to the turn loop: https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/inline/agents/meta_reference/agent_instance.py#L277 Passing all messages to the InferenceApi : https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/inline/agents/meta_reference/agent_instance.py#L513 I'd considered a local cache of the Message history, but could probably get by with just keeping a running count of input tokens by conversation_id and adding to it on each turn. Any other suggestions on this topic? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in overall it looks good. As @manstis wrote the actual TokenCounter implementation will be a bit more complicated, and will need to be updated when conversation history, RAG, MCP will be in place. Just some nitpicks from my side.
lightspeed-stack.yaml
Outdated
@@ -22,3 +22,4 @@ user_data_collection: | |||
transcripts_storage: "/tmp/data/transcripts" | |||
authentication: | |||
module: "noop" | |||
default_estimation_tokenizer: "cl100k_base" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prob. worth to setup it in constants.py
and allow to override it by anything else in customization
block? WDYT?
@@ -187,13 +193,21 @@ def select_model_id(models: ModelListResponse, query_request: QueryRequest) -> s | |||
return model_id | |||
|
|||
|
|||
def _build_toolgroups(client: LlamaStackClient) -> list[Toolgroup] | None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it part of this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was to alleviate a "too many local variables" error that appeared while I was fixing merge conflicts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably we can just # pylint: disable=too-many-locals for this
src/models/config.py
Outdated
@@ -154,6 +154,7 @@ class Configuration(BaseModel): | |||
AuthenticationConfiguration() | |||
) | |||
customization: Optional[Customization] = None | |||
default_estimation_tokenizer: str = "cl100k_base" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dtto, the constant might be stored in constants.py
with some comments what it means.
src/models/responses.py
Outdated
""" | ||
|
||
conversation_id: Optional[str] = None | ||
response: str | ||
input_tokens: Optional[int] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AGAIK it is not optional - you set 0
for any exceptions, so it's still int
?
Thank you |
I changed the status to Draft pending additions of history counting and at least some comments about where to add MCP and RAG related counting. |
* Token counts consider full conversation history * Token counts not defined as optional * Added default_estimation_tokenizer to constants and customization
I've implemented the estimation for message history as well as attachments, and addressed the other feedback. AFAICT so far, I think it would only really be possible to count tokens in RAG chunks and MCP queries from the llama-stack side; please clue me in if I've misunderstood. Ultimately I'm not sure how useful this all is, but I'm opening it up for review so we can decide whether it's worth including or not, and so y'all can continue to point out what I've missed. :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (4)
src/utils/token_counter.py (3)
87-126
: Fix parameter documentation mismatch.The docstring parameter description doesn't match the actual method signature.
Args: system_prompt: The system prompt message content. - query: The user's query message content. + query_request: The user's query request object. response: The assistant's response message content (optional).
204-206
: Use hardcoded constants appropriately.The role formatting overhead is hardcoded to 4 tokens, which is documented as typical for OpenAI models. Consider making this configurable per model type for better accuracy.
- # Add role overhead (varies by model, 4 is typical for OpenAI models) - role_formatting_overhead = 4 + # Add role overhead (varies by model, 4 is typical for OpenAI models) + role_formatting_overhead = self._get_role_formatting_overhead()Then add a method to calculate model-specific overhead:
def _get_role_formatting_overhead(self) -> int: """Get role formatting overhead based on model type.""" # Could be made configurable based on model_id in the future return 4
226-228
: Review hardcoded conversation overhead values.The conversation overhead calculation uses hardcoded values that may not be accurate for all models or conversation formats.
Consider making these values configurable or model-specific:
- base_overhead = 3 # Start of conversation tokens (based on OpenAI chat format) - separator_overhead = max(0, (message_count - 1) * 1) # Message separator tokens + base_overhead = self._get_base_conversation_overhead() + separator_overhead = max(0, (message_count - 1) * self._get_separator_overhead())src/app/endpoints/streaming_query.py (1)
167-167
: Verify TODO comment relevance.The TODO comment for truncation implementation should be addressed or tracked properly.
The TODO comment indicates missing truncation functionality. Would you like me to create a tracking issue for this feature or help implement the truncation logic?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
uv.lock
is excluded by!**/*.lock
📒 Files selected for processing (11)
docs/openapi.json
(2 hunks)pyproject.toml
(1 hunks)src/app/endpoints/query.py
(5 hunks)src/app/endpoints/streaming_query.py
(8 hunks)src/constants.py
(1 hunks)src/models/config.py
(1 hunks)src/models/responses.py
(1 hunks)src/utils/token_counter.py
(1 hunks)tests/unit/app/endpoints/test_query.py
(19 hunks)tests/unit/app/endpoints/test_streaming_query.py
(20 hunks)tests/unit/utils/test_token_counter.py
(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- src/constants.py
🚧 Files skipped from review as they are similar to previous changes (8)
- pyproject.toml
- src/models/config.py
- src/models/responses.py
- tests/unit/app/endpoints/test_query.py
- tests/unit/utils/test_token_counter.py
- tests/unit/app/endpoints/test_streaming_query.py
- src/app/endpoints/query.py
- docs/openapi.json
🔇 Additional comments (10)
src/utils/token_counter.py (5)
1-32
: Well-structured module with comprehensive imports and documentation.The module setup is clean with proper imports, type annotations, and a clear module docstring. The global TTL cache for conversation tracking is appropriately scoped and configured.
46-73
: Robust tokenizer initialization with proper fallback handling.The initialization logic correctly handles model-specific tokenization with fallback to a default tokenizer. The error handling and logging are appropriate.
74-85
: Simple and correct token counting implementation.The method handles edge cases (empty/None text) appropriately and uses the tiktoken encoder correctly.
231-244
: Appropriate use of LRU cache for performance optimization.The cached factory function is well-implemented and will prevent redundant tokenizer initialization.
247-256
: Clean configuration access pattern.The function correctly handles nested configuration access with proper fallback to the default constant.
src/app/endpoints/streaming_query.py (5)
29-29
: Proper import of token counter utility.The import is correctly placed and follows the established import organization.
103-130
: Good API improvement with token metrics integration.The function signature update properly integrates token usage metrics while maintaining backward compatibility through default values in the metrics_map.
215-221
: Correct unpacking of token usage from retrieve_response.The function call correctly handles the new third return value containing token usage data.
288-296
: Well-designed helper function for toolgroup construction.The extraction of toolgroup building logic into a separate async function improves code organization and reusability.
304-304
: Verify return type annotation matches implementation.The function signature indicates it returns a tuple with token usage data, which aligns with the implementation.
def count_conversation_turn_tokens( | ||
self, | ||
conversation_id: str, | ||
system_prompt: str, | ||
query_request: QueryRequest, | ||
response: str = "", | ||
) -> dict[str, int]: | ||
"""Count tokens for a conversation turn with cumulative tracking. | ||
|
||
This method estimates token usage for a conversation turn and tracks | ||
cumulative input tokens across the conversation. It accounts for the | ||
fact that Agent conversations include the entire message history in | ||
each turn. | ||
|
||
Args: | ||
conversation_id: The conversation ID to track tokens for. | ||
system_prompt: The system prompt message content. | ||
query: The user's query message content. | ||
response: The assistant's response message content (optional). | ||
|
||
Returns: | ||
A dictionary containing: | ||
- 'input_tokens': Cumulative input tokens for the conversation | ||
- 'output_tokens': Total tokens in the response message | ||
""" | ||
# Get the current turn's token usage | ||
turn_token_usage = self.count_turn_tokens( | ||
system_prompt, query_request, response | ||
) | ||
|
||
# Get cumulative input tokens for this conversation | ||
cumulative_input_tokens = _conversation_cache.get(conversation_id, 0) | ||
|
||
# Add this turn's input tokens to the cumulative total | ||
new_cumulative_input_tokens = ( | ||
cumulative_input_tokens + turn_token_usage["input_tokens"] | ||
) | ||
_conversation_cache[conversation_id] = new_cumulative_input_tokens | ||
|
||
# TODO(csibbitt) - Add counting for MCP and RAG content | ||
|
||
logger.debug( | ||
"Token usage for conversation %s: turn input=%d, cumulative input=%d, output=%d", | ||
conversation_id, | ||
turn_token_usage["input_tokens"], | ||
new_cumulative_input_tokens, | ||
turn_token_usage["output_tokens"], | ||
) | ||
|
||
return { | ||
"input_tokens": new_cumulative_input_tokens, | ||
"output_tokens": turn_token_usage["output_tokens"], | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Fix parameter documentation and consider thread safety.
The docstring has a parameter mismatch, and the global cache usage may have thread safety implications.
Args:
conversation_id: The conversation ID to track tokens for.
system_prompt: The system prompt message content.
- query: The user's query message content.
+ query_request: The user's query request object.
response: The assistant's response message content (optional).
Consider thread safety for the global cache:
The global _conversation_cache
is accessed without synchronization, which could lead to race conditions in concurrent environments. Consider using thread-safe operations or documenting the thread safety assumptions.
🏁 Script executed:
#!/bin/bash
# Check if the application uses async/concurrent processing that might affect cache safety
rg -A 10 -B 5 "async def|asyncio|threading|concurrent" --type py
Length of output: 66393
Update parameter docs and synchronize conversation cache access
The docstring’s Args
section still refers to query
instead of query_request
, and accessing the global _conversation_cache
without coordination can lead to lost updates under concurrent requests.
• File src/utils/token_counter.py
, method count_conversation_turn_tokens
– Fix the docstring parameter name
– Wrap cache reads/writes in an asyncio.Lock
or similar to prevent race conditions
def count_conversation_turn_tokens(
self,
conversation_id: str,
system_prompt: str,
query_request: QueryRequest,
response: str = "",
) -> dict[str, int]:
"""
- Args:
- conversation_id: The conversation ID to track tokens for.
- system_prompt: The system prompt message content.
- query: The user's query message content.
- response: The assistant's response message content (optional).
+ Args:
+ conversation_id: The conversation ID to track tokens for.
+ system_prompt: The system prompt message content.
+ query_request: The user's query request object.
+ response: The assistant's response message content (optional).
"""
Suggested thread-safety refactor:
- Add at module scope:
_cache_lock = asyncio.Lock()
- In
count_conversation_turn_tokens
, guard cache operations:async with _cache_lock: cumulative = _conversation_cache.get(conversation_id, 0) new_total = cumulative + turn_token_usage["input_tokens"] _conversation_cache[conversation_id] = new_total
Or document that this cache is only safe under a single-threaded event loop.
🤖 Prompt for AI Agents
In src/utils/token_counter.py lines 128 to 180, update the docstring in
count_conversation_turn_tokens to replace the parameter name 'query' with
'query_request' for accuracy. Additionally, to prevent race conditions when
accessing the global _conversation_cache, introduce an asyncio.Lock at the
module level (e.g., _cache_lock = asyncio.Lock()) and wrap all cache read and
write operations inside an 'async with _cache_lock' block within the method.
This ensures thread-safe updates to the conversation token counts under
concurrent requests.
# Currently (2025-07-08) the usage is not returned by the API, so we need to estimate | ||
# try: | ||
# output_tokens = response.usage.get("completion_tokens", 0) | ||
# except AttributeError: | ||
# Estimate output tokens from complete response | ||
try: | ||
token_counter = get_token_counter(model_id) | ||
output_tokens = token_counter.count_tokens(complete_response) | ||
logger.debug("Estimated output tokens: %s", output_tokens) | ||
except Exception as e: # pylint: disable=broad-exception-caught | ||
logger.warning("Failed to estimate output tokens: %s", e) | ||
output_tokens = 0 | ||
|
||
metrics_map = { | ||
"input_tokens": token_usage["input_tokens"], | ||
"output_tokens": output_tokens, | ||
} | ||
yield stream_end_event(metadata_map, metrics_map) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Improve error handling and remove commented code.
The token counting implementation has broad exception handling and contains commented-out code that should be cleaned up.
- # Currently (2025-07-08) the usage is not returned by the API, so we need to estimate
- # try:
- # output_tokens = response.usage.get("completion_tokens", 0)
- # except AttributeError:
- # Estimate output tokens from complete response
try:
token_counter = get_token_counter(model_id)
output_tokens = token_counter.count_tokens(complete_response)
logger.debug("Estimated output tokens: %s", output_tokens)
- except Exception as e: # pylint: disable=broad-exception-caught
+ except (KeyError, ValueError, AttributeError) as e:
logger.warning("Failed to estimate output tokens: %s", e)
output_tokens = 0
Rationale: Remove outdated commented code and use more specific exception handling instead of catching all exceptions.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
# Currently (2025-07-08) the usage is not returned by the API, so we need to estimate | |
# try: | |
# output_tokens = response.usage.get("completion_tokens", 0) | |
# except AttributeError: | |
# Estimate output tokens from complete response | |
try: | |
token_counter = get_token_counter(model_id) | |
output_tokens = token_counter.count_tokens(complete_response) | |
logger.debug("Estimated output tokens: %s", output_tokens) | |
except Exception as e: # pylint: disable=broad-exception-caught | |
logger.warning("Failed to estimate output tokens: %s", e) | |
output_tokens = 0 | |
metrics_map = { | |
"input_tokens": token_usage["input_tokens"], | |
"output_tokens": output_tokens, | |
} | |
yield stream_end_event(metadata_map, metrics_map) | |
try: | |
token_counter = get_token_counter(model_id) | |
output_tokens = token_counter.count_tokens(complete_response) | |
logger.debug("Estimated output tokens: %s", output_tokens) | |
except (KeyError, ValueError, AttributeError) as e: | |
logger.warning("Failed to estimate output tokens: %s", e) | |
output_tokens = 0 | |
metrics_map = { | |
"input_tokens": token_usage["input_tokens"], | |
"output_tokens": output_tokens, | |
} | |
yield stream_end_event(metadata_map, metrics_map) |
🤖 Prompt for AI Agents
In src/app/endpoints/streaming_query.py around lines 240 to 257, remove the
commented-out code related to usage retrieval since it is outdated. Replace the
broad except Exception clause with more specific exception handling that targets
only the expected errors from get_token_counter or count_tokens, such as
AttributeError or a custom token counting error if applicable. This will improve
error clarity and maintain cleaner code.
# Currently (2025-07-08) the usage is not returned by the API, so we need to estimate it | ||
# try: | ||
# token_usage = { | ||
# "input_tokens": response.usage.get("prompt_tokens", 0), | ||
# "output_tokens": 0, # Will be calculated during streaming | ||
# } | ||
# except AttributeError: | ||
# # Estimate input tokens (Output will be calculated during streaming) | ||
try: | ||
token_counter = get_token_counter(model_id) | ||
token_usage = token_counter.count_conversation_turn_tokens( | ||
conversation_id, system_prompt, query_request | ||
) | ||
except Exception as e: # pylint: disable=broad-exception-caught | ||
logger.warning("Failed to estimate token usage: %s", e) | ||
token_usage = { | ||
"input_tokens": 0, | ||
"output_tokens": 0, | ||
} | ||
|
||
return response, conversation_id, token_usage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Improve error handling and remove commented code.
Similar to the earlier issue, this section has broad exception handling and outdated commented code.
- # Currently (2025-07-08) the usage is not returned by the API, so we need to estimate it
- # try:
- # token_usage = {
- # "input_tokens": response.usage.get("prompt_tokens", 0),
- # "output_tokens": 0, # Will be calculated during streaming
- # }
- # except AttributeError:
- # # Estimate input tokens (Output will be calculated during streaming)
try:
token_counter = get_token_counter(model_id)
token_usage = token_counter.count_conversation_turn_tokens(
conversation_id, system_prompt, query_request
)
- except Exception as e: # pylint: disable=broad-exception-caught
+ except (KeyError, ValueError, AttributeError) as e:
logger.warning("Failed to estimate token usage: %s", e)
token_usage = {
"input_tokens": 0,
"output_tokens": 0,
}
Rationale: Remove outdated commented code and use more specific exception handling.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
# Currently (2025-07-08) the usage is not returned by the API, so we need to estimate it | |
# try: | |
# token_usage = { | |
# "input_tokens": response.usage.get("prompt_tokens", 0), | |
# "output_tokens": 0, # Will be calculated during streaming | |
# } | |
# except AttributeError: | |
# # Estimate input tokens (Output will be calculated during streaming) | |
try: | |
token_counter = get_token_counter(model_id) | |
token_usage = token_counter.count_conversation_turn_tokens( | |
conversation_id, system_prompt, query_request | |
) | |
except Exception as e: # pylint: disable=broad-exception-caught | |
logger.warning("Failed to estimate token usage: %s", e) | |
token_usage = { | |
"input_tokens": 0, | |
"output_tokens": 0, | |
} | |
return response, conversation_id, token_usage | |
try: | |
token_counter = get_token_counter(model_id) | |
token_usage = token_counter.count_conversation_turn_tokens( | |
conversation_id, system_prompt, query_request | |
) | |
except (KeyError, ValueError, AttributeError) as e: | |
logger.warning("Failed to estimate token usage: %s", e) | |
token_usage = { | |
"input_tokens": 0, | |
"output_tokens": 0, | |
} | |
return response, conversation_id, token_usage |
🤖 Prompt for AI Agents
In src/app/endpoints/streaming_query.py around lines 359 to 379, remove the
outdated commented code related to token usage estimation and replace the broad
exception handling with more specific exceptions that can be raised by
get_token_counter or count_conversation_turn_tokens. This will clean up the code
and improve error handling by catching only expected errors instead of all
exceptions.
This patch is adding the /metrics endpoint to the project. Differences with road-core/service: * The metrics are prefixed with "ls_" instead of "ols_" * The "provider_model_configuration" does not set the non-default model/providers to 0 because we currently do not have a way to set a default model/provider in the configuration. A TODO was left in the code. Supported metrics: * rest_api_calls_total: Counter to track REST API calls * response_duration_seconds: Histogram to measure how long it takes to handle requests * provider_model_configuration: Indicates what provider + model customers are using * llm_calls_total: How many LLM calls were made for each provider + model * llm_calls_failures_total: How many LLM calls failed * llm_calls_validation_errors_total: How many LLM calls had validation errors Missing metrics: * llm_token_sent_total: How many tokens were sent * llm_token_received_total: How many tokens were received The above metrics are missing because token counting PR (lightspeed-core#215) is not merged yet. Signed-off-by: Lucas Alvares Gomes <lucasagomes@gmail.com>
logger = logging.getLogger(__name__) | ||
|
||
# Class-level cache to track cumulative input tokens for each conversation | ||
_conversation_cache: TTLCache[str, int] = TTLCache(maxsize=1000, ttl=3600) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, there's a ticket for it at https://issues.redhat.com/browse/LCORE-298
Nothing much to do as part of this patch until it's properly implemented.
This patch is adding the /metrics endpoint to the project. Differences with road-core/service: * The metrics are prefixed with "ls_" instead of "ols_" * The "provider_model_configuration" does not set the non-default model/providers to 0 because we currently do not have a way to set a default model/provider in the configuration. A TODO was left in the code. Supported metrics: * rest_api_calls_total: Counter to track REST API calls * response_duration_seconds: Histogram to measure how long it takes to handle requests * provider_model_configuration: Indicates what provider + model customers are using * llm_calls_total: How many LLM calls were made for each provider + model * llm_calls_failures_total: How many LLM calls failed * llm_calls_validation_errors_total: How many LLM calls had validation errors Missing metrics: * llm_token_sent_total: How many tokens were sent * llm_token_received_total: How many tokens were received The above metrics are missing because token counting PR (lightspeed-core#215) is not merged yet. Signed-off-by: Lucas Alvares Gomes <lucasagomes@gmail.com>
This patch is adding the /metrics endpoint to the project. Differences with road-core/service: * The metrics are prefixed with "ls_" instead of "ols_" * The "provider_model_configuration" does not set the non-default model/providers to 0 because we currently do not have a way to set a default model/provider in the configuration. A TODO was left in the code. Supported metrics: * rest_api_calls_total: Counter to track REST API calls * response_duration_seconds: Histogram to measure how long it takes to handle requests * provider_model_configuration: Indicates what provider + model customers are using * llm_calls_total: How many LLM calls were made for each provider + model * llm_calls_failures_total: How many LLM calls failed * llm_calls_validation_errors_total: How many LLM calls had validation errors Missing metrics: * llm_token_sent_total: How many tokens were sent * llm_token_received_total: How many tokens were received The above metrics are missing because token counting PR (lightspeed-core#215) is not merged yet. Signed-off-by: Lucas Alvares Gomes <lucasagomes@gmail.com>
This patch is adding the /metrics endpoint to the project. Differences with road-core/service: * The metrics are prefixed with "ls_" instead of "ols_" * The "provider_model_configuration" does not set the non-default model/providers to 0 because we currently do not have a way to set a default model/provider in the configuration. A TODO was left in the code. Supported metrics: * rest_api_calls_total: Counter to track REST API calls * response_duration_seconds: Histogram to measure how long it takes to handle requests * provider_model_configuration: Indicates what provider + model customers are using * llm_calls_total: How many LLM calls were made for each provider + model * llm_calls_failures_total: How many LLM calls failed * llm_calls_validation_errors_total: How many LLM calls had validation errors Missing metrics: * llm_token_sent_total: How many tokens were sent * llm_token_received_total: How many tokens were received The above metrics are missing because token counting PR (lightspeed-core#215) is not merged yet. Signed-off-by: Lucas Alvares Gomes <lucasagomes@gmail.com>
This patch is adding the /metrics endpoint to the project. Differences with road-core/service: * The metrics are prefixed with "ls_" instead of "ols_" * The "provider_model_configuration" does not set the non-default model/providers to 0 because we currently do not have a way to set a default model/provider in the configuration. A TODO was left in the code. Supported metrics: * rest_api_calls_total: Counter to track REST API calls * response_duration_seconds: Histogram to measure how long it takes to handle requests * provider_model_configuration: Indicates what provider + model customers are using * llm_calls_total: How many LLM calls were made for each provider + model * llm_calls_failures_total: How many LLM calls failed * llm_calls_validation_errors_total: How many LLM calls had validation errors Missing metrics: * llm_token_sent_total: How many tokens were sent * llm_token_received_total: How many tokens were received The above metrics are missing because token counting PR (lightspeed-core#215) is not merged yet. Signed-off-by: Lucas Alvares Gomes <lucasagomes@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Chris! Overall it looks pretty good I would say. It does need to be rebased and have the merge conflicts resolved
@@ -187,13 +193,21 @@ def select_model_id(models: ModelListResponse, query_request: QueryRequest) -> s | |||
return model_id | |||
|
|||
|
|||
def _build_toolgroups(client: LlamaStackClient) -> list[Toolgroup] | None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably we can just # pylint: disable=too-many-locals for this
conversation_id, system_prompt, query_request, response_content | ||
) | ||
except Exception as e: # pylint: disable=broad-exception-caught | ||
logger.warning("Failed to estimate token usage: %s", e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This probably should be an error instead of a warning
output_tokens = token_counter.count_tokens(complete_response) | ||
logger.debug("Estimated output tokens: %s", output_tokens) | ||
except Exception as e: # pylint: disable=broad-exception-caught | ||
logger.warning("Failed to estimate output tokens: %s", e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto: s/warning/error/g
logger = logging.getLogger(__name__) | ||
|
||
# Class-level cache to track cumulative input tokens for each conversation | ||
_conversation_cache: TTLCache[str, int] = TTLCache(maxsize=1000, ttl=3600) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, there's a ticket for it at https://issues.redhat.com/browse/LCORE-298
Nothing much to do as part of this patch until it's properly implemented.
This patch is adding the /metrics endpoint to the project. Differences with road-core/service: * The metrics are prefixed with "ls_" instead of "ols_" * The "provider_model_configuration" does not set the non-default model/providers to 0 because we currently do not have a way to set a default model/provider in the configuration. A TODO was left in the code. Supported metrics: * rest_api_calls_total: Counter to track REST API calls * response_duration_seconds: Histogram to measure how long it takes to handle requests * provider_model_configuration: Indicates what provider + model customers are using * llm_calls_total: How many LLM calls were made for each provider + model * llm_calls_failures_total: How many LLM calls failed * llm_calls_validation_errors_total: How many LLM calls had validation errors Missing metrics: * llm_token_sent_total: How many tokens were sent * llm_token_received_total: How many tokens were received The above metrics are missing because token counting PR (lightspeed-core#215) is not merged yet. Signed-off-by: Lucas Alvares Gomes <lucasagomes@gmail.com>
This patch is adding the /metrics endpoint to the project. Differences with road-core/service: * The metrics are prefixed with "ls_" instead of "ols_" * The "provider_model_configuration" does not set the non-default model/providers to 0 because we currently do not have a way to set a default model/provider in the configuration. A TODO was left in the code. Supported metrics: * rest_api_calls_total: Counter to track REST API calls * response_duration_seconds: Histogram to measure how long it takes to handle requests * provider_model_configuration: Indicates what provider + model customers are using * llm_calls_total: How many LLM calls were made for each provider + model * llm_calls_failures_total: How many LLM calls failed * llm_calls_validation_errors_total: How many LLM calls had validation errors Missing metrics: * llm_token_sent_total: How many tokens were sent * llm_token_received_total: How many tokens were received The above metrics are missing because token counting PR (lightspeed-core#215) is not merged yet. Signed-off-by: Lucas Alvares Gomes <lucasagomes@gmail.com>
Description
Type of change
Related Tickets & Documents
Checklist before requesting a review
Testing
I have not tested the streaming api yet, but am working on that now. I wanted to get this published for early feedback.
Summary by CodeRabbit
New Features
input_tokens
andoutput_tokens
) are now included in API responses for both standard and streaming queries, providing visibility into tokens sent to and received from language models.Bug Fixes
Documentation
Tests