-
Notifications
You must be signed in to change notification settings - Fork 123
addition of Safearena Benchmark #345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review by Korbit AI
Korbit automatically attempts to detect when you fix issues in new commits.
Category | Issue | Status |
---|---|---|
Identical RNG Seeds Across Benchmarks ▹ view | ||
Unhandled Download Failure ▹ view | ||
Insufficient Task Repetitions ▹ view | ✅ Fix detected | |
Incorrect Benchmark Name in SafeArena Harm Config ▹ view | ✅ Fix detected | |
Lack of Task Registration Encapsulation ▹ view | ||
Mixed Resource Initialization with Task Registration ▹ view | ||
Lack of proper configuration management structure ▹ view | ||
Incomplete request exception handling ▹ view | ||
Incorrect Benchmark Name in SafeArena Safe Config ▹ view | ✅ Fix detected | |
Typo in SafeArena All Benchmark Name ▹ view |
Files scanned
File Path | Reviewed |
---|---|
browsergym/safearena/src/browsergym/safearena/config.py | ✅ |
browsergym/safearena/src/browsergym/safearena/init.py | ✅ |
browsergym/experiments/src/browsergym/experiments/benchmark/configs.py | ✅ |
browsergym/safearena/src/browsergym/safearena/instance.py | ✅ |
browsergym/safearena/src/browsergym/safearena/task.py | ✅ |
Explore our documentation to understand the languages and file types we support and the files we ignore.
Check out our docs on how you can make Korbit work best for you and your team.
@@ -0,0 +1 @@ | |||
TASK_IDS = range(250) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lack of proper configuration management structure 
Tell me more
What is the issue?
Configuration data is directly hardcoded as a global variable without proper encapsulation or configuration management structure.
Why this matters
Hardcoding configuration values makes the code rigid and difficult to maintain. Changes require code modifications rather than configuration updates, and there's no clear separation between configuration and code logic.
Suggested change ∙ Feature Preview
Implement a proper configuration management system, for example:
from dataclasses import dataclass
from typing import Sequence
@dataclass
class SafeArenaConfig:
task_ids: Sequence[int] = tuple(range(250))
# Usage:
config = SafeArenaConfig()
Or use a configuration file (yaml/json) with a parser class.
Provide feedback to improve future suggestions
💬 Looking for more details? Reply to this comment to chat with Korbit.
try: | ||
nltk.data.find("tokenizers/punkt_tab") | ||
except: | ||
nltk.download("punkt_tab", quiet=True, raise_on_error=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unhandled Download Failure 
Tell me more
What is the issue?
The code assumes the download will always succeed when raise_on_error=True, but network issues or permission problems could cause failures.
Why this matters
If the download fails, the program will continue without the required NLTK resources, potentially causing runtime errors in dependent code.
Suggested change ∙ Feature Preview
Add error handling for download failures:
try:
nltk.data.find("tokenizers/punkt_tab")
except LookupError:
try:
nltk.download("punkt_tab", quiet=True, raise_on_error=True)
except Exception as e:
raise RuntimeError(f"Failed to download required NLTK resources: {e}")
Provide feedback to improve future suggestions
💬 Looking for more details? Reply to this comment to chat with Korbit.
try: | ||
nltk.data.find("tokenizers/punkt_tab") | ||
except: | ||
nltk.download("punkt_tab", quiet=True, raise_on_error=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mixed Resource Initialization with Task Registration 
Tell me more
What is the issue?
The initialization of NLTK resources is mixed with task registration logic, violating the Single Responsibility Principle.
Why this matters
This makes the code harder to maintain and test, as resource initialization is tightly coupled with task registration. It also makes it difficult to mock or bypass NLTK initialization during testing.
Suggested change ∙ Feature Preview
Extract NLTK initialization into a separate function or module:
def initialize_nltk_resources():
try:
nltk.data.find("tokenizers/punkt_tab")
except:
nltk.download("punkt_tab", quiet=True, raise_on_error=True)
# Call this function before task registration
initialize_nltk_resources()
Provide feedback to improve future suggestions
💬 Looking for more details? Reply to this comment to chat with Korbit.
for task_id in config.TASK_IDS: | ||
gym_id = f"safearena.{task_id}" | ||
register_task( | ||
gym_id, | ||
task.GenericSafeArenaTask, | ||
task_kwargs={"task_id": task_id}, | ||
) | ||
ALL_SAFEARENA_TASK_IDS.append(gym_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lack of Task Registration Encapsulation 
Tell me more
What is the issue?
Task registration logic is not encapsulated in a dedicated function, making it harder to understand and maintain.
Why this matters
The current implementation makes it difficult to modify registration behavior or add pre/post registration hooks. It also makes the code less testable and harder to reuse.
Suggested change ∙ Feature Preview
Create a dedicated function for task registration:
def register_safearena_tasks():
task_ids = []
for task_id in config.TASK_IDS:
gym_id = f"safearena.{task_id}"
register_task(
gym_id,
task.GenericSafeArenaTask,
task_kwargs={"task_id": task_id},
)
task_ids.append(gym_id)
return task_ids
ALL_SAFEARENA_TASK_IDS = register_safearena_tasks()
Provide feedback to improve future suggestions
💬 Looking for more details? Reply to this comment to chat with Korbit.
task_list=task_list_from_metadata(metadata=task_metadata("safearena_all")), | ||
max_steps=30, | ||
n_repeats=1, | ||
seeds_rng=np.random.RandomState(42), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Identical RNG Seeds Across Benchmarks 
Tell me more
What is the issue?
Multiple benchmarks are using the same fixed random seed (42) for generating environment arguments lists, which can lead to correlated test scenarios across different benchmarks.
Why this matters
Using the same seed across benchmarks reduces the diversity of test scenarios and may cause agents to overfit to specific patterns, failing to identify potential performance issues in edge cases.
Suggested change ∙ Feature Preview
Use different seeds for each benchmark to ensure broader test coverage and prevent correlated scenarios. Consider using a seed generation function:
def get_benchmark_seed(benchmark_name: str, base_seed: int = 42) -> int:
return hash(benchmark_name) + base_seed
Provide feedback to improve future suggestions
💬 Looking for more details? Reply to this comment to chat with Korbit.
env_args_list=make_env_args_list_from_repeat_tasks( | ||
task_list=task_list_from_metadata(metadata=task_metadata("safearena_all")), | ||
max_steps=30, | ||
n_repeats=1, |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
"safearena_harm": lambda: Benchmark( | ||
name="safenarena_all", |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
"safearena_safe": lambda: Benchmark( | ||
name="safenarena_all", |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
"safearena_all": lambda: Benchmark( | ||
name="safenarena_all", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in SafeArena All Benchmark Name 
Tell me more
What is the issue?
The benchmark name contains a typo: 'safenarena_all' instead of 'safearena_all'.
Why this matters
This inconsistency in naming could lead to problems with benchmark identification and could break functionality that relies on exact name matching.
Suggested change ∙ Feature Preview
Correct the spelling in the name parameter:
"safearena_all": lambda: Benchmark(
name="safearena_all",
Provide feedback to improve future suggestions
💬 Looking for more details? Reply to this comment to chat with Korbit.
try: | ||
requests.get(url, timeout=timeout) | ||
except (requests.exceptions.ConnectionError, requests.exceptions.Timeout): | ||
raise RuntimeError( | ||
f'SafeArena site "{site}" ({url}) is not reacheable. Please check the URL.' | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incomplete request exception handling 
Tell me more
What is the issue?
The error handling for the request check is too specific, potentially missing other request exceptions like InvalidURL, InvalidSchema, or TooManyRedirects.
Why this matters
Other request exceptions could occur and would propagate uncaught, making debugging harder since the error message wouldn't indicate which site failed.
Suggested change ∙ Feature Preview
Use requests.exceptions.RequestException to catch all request-related errors:
try:
requests.get(url, timeout=timeout)
except requests.exceptions.RequestException as e:
raise RuntimeError(
f'SafeArena site "{site}" ({url}) is not reacheable: {str(e)}'
)
Provide feedback to improve future suggestions
💬 Looking for more details? Reply to this comment to chat with Korbit.
…te browsergym-core version in requirements
# list all the available assistantbench tasks | ||
env_ids = [id for id in gym.envs.registry.keys() if id.startswith("browsergym/workarena")] | ||
print("\n".join(env_ids)) | ||
``` | ||
|
||
WebArena |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be SafeArena
.
ToDo:
Description by Korbit AI
What change is being made?
Add the SafeArena Benchmark to the BrowserGym project by introducing configurations, tasks, metadata, and scripts related to "safearena", updating the
Makefile
to includesafearena
, and managing dependencies in thepyproject.toml
andrequirements.txt
files.Why are these changes being made?
This integration allows the BrowserGym project to support and benchmark web agents on safety-focused tasks, thereby enhancing the scope of the platform in testing agent capability against safely navigating web environments. This addition caters to the ongoing research in ensuring AI agents operate safely in applications involving internet navigation.