Skip to content

Conversation

alzambranolu13
Copy link

@alzambranolu13 alzambranolu13 commented May 27, 2025

ToDo:

  • Update ReadMe
  • Add tests similar to VisualWebArena
  • Refactor similar to VisualWebArena (remove any duplication)

Description by Korbit AI

What change is being made?

Add the SafeArena Benchmark to the BrowserGym project by introducing configurations, tasks, metadata, and scripts related to "safearena", updating the Makefile to include safearena, and managing dependencies in the pyproject.toml and requirements.txt files.

Why are these changes being made?

This integration allows the BrowserGym project to support and benchmark web agents on safety-focused tasks, thereby enhancing the scope of the platform in testing agent capability against safely navigating web environments. This addition caters to the ongoing research in ensuring AI agents operate safely in applications involving internet navigation.

Is this description stale? Ask me to generate a new description by commenting /korbit-generate-pr-description

Copy link

@korbit-ai korbit-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review by Korbit AI

Korbit automatically attempts to detect when you fix issues in new commits.
Category Issue Status
Performance Identical RNG Seeds Across Benchmarks ▹ view
Error Handling Unhandled Download Failure ▹ view
Performance Insufficient Task Repetitions ▹ view ✅ Fix detected
Logging Incorrect Benchmark Name in SafeArena Harm Config ▹ view ✅ Fix detected
Design Lack of Task Registration Encapsulation ▹ view
Design Mixed Resource Initialization with Task Registration ▹ view
Design Lack of proper configuration management structure ▹ view
Error Handling Incomplete request exception handling ▹ view
Logging Incorrect Benchmark Name in SafeArena Safe Config ▹ view ✅ Fix detected
Functionality Typo in SafeArena All Benchmark Name ▹ view
Files scanned
File Path Reviewed
browsergym/safearena/src/browsergym/safearena/config.py
browsergym/safearena/src/browsergym/safearena/init.py
browsergym/experiments/src/browsergym/experiments/benchmark/configs.py
browsergym/safearena/src/browsergym/safearena/instance.py
browsergym/safearena/src/browsergym/safearena/task.py

Explore our documentation to understand the languages and file types we support and the files we ignore.

Check out our docs on how you can make Korbit work best for you and your team.

Loving Korbit!? Share us on LinkedIn Reddit and X

@@ -0,0 +1 @@
TASK_IDS = range(250)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lack of proper configuration management structure category Design

Tell me more
What is the issue?

Configuration data is directly hardcoded as a global variable without proper encapsulation or configuration management structure.

Why this matters

Hardcoding configuration values makes the code rigid and difficult to maintain. Changes require code modifications rather than configuration updates, and there's no clear separation between configuration and code logic.

Suggested change ∙ Feature Preview

Implement a proper configuration management system, for example:

from dataclasses import dataclass
from typing import Sequence

@dataclass
class SafeArenaConfig:
    task_ids: Sequence[int] = tuple(range(250))

# Usage:
config = SafeArenaConfig()

Or use a configuration file (yaml/json) with a parser class.

Provide feedback to improve future suggestions

Nice Catch Incorrect Not in Scope Not in coding standard Other

💬 Looking for more details? Reply to this comment to chat with Korbit.

try:
nltk.data.find("tokenizers/punkt_tab")
except:
nltk.download("punkt_tab", quiet=True, raise_on_error=True)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unhandled Download Failure category Error Handling

Tell me more
What is the issue?

The code assumes the download will always succeed when raise_on_error=True, but network issues or permission problems could cause failures.

Why this matters

If the download fails, the program will continue without the required NLTK resources, potentially causing runtime errors in dependent code.

Suggested change ∙ Feature Preview

Add error handling for download failures:

try:
    nltk.data.find("tokenizers/punkt_tab")
except LookupError:
    try:
        nltk.download("punkt_tab", quiet=True, raise_on_error=True)
    except Exception as e:
        raise RuntimeError(f"Failed to download required NLTK resources: {e}")
Provide feedback to improve future suggestions

Nice Catch Incorrect Not in Scope Not in coding standard Other

💬 Looking for more details? Reply to this comment to chat with Korbit.

Comment on lines +9 to +12
try:
nltk.data.find("tokenizers/punkt_tab")
except:
nltk.download("punkt_tab", quiet=True, raise_on_error=True)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mixed Resource Initialization with Task Registration category Design

Tell me more
What is the issue?

The initialization of NLTK resources is mixed with task registration logic, violating the Single Responsibility Principle.

Why this matters

This makes the code harder to maintain and test, as resource initialization is tightly coupled with task registration. It also makes it difficult to mock or bypass NLTK initialization during testing.

Suggested change ∙ Feature Preview

Extract NLTK initialization into a separate function or module:

def initialize_nltk_resources():
    try:
        nltk.data.find("tokenizers/punkt_tab")
    except:
        nltk.download("punkt_tab", quiet=True, raise_on_error=True)

# Call this function before task registration
initialize_nltk_resources()
Provide feedback to improve future suggestions

Nice Catch Incorrect Not in Scope Not in coding standard Other

💬 Looking for more details? Reply to this comment to chat with Korbit.

Comment on lines +17 to +24
for task_id in config.TASK_IDS:
gym_id = f"safearena.{task_id}"
register_task(
gym_id,
task.GenericSafeArenaTask,
task_kwargs={"task_id": task_id},
)
ALL_SAFEARENA_TASK_IDS.append(gym_id)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lack of Task Registration Encapsulation category Design

Tell me more
What is the issue?

Task registration logic is not encapsulated in a dedicated function, making it harder to understand and maintain.

Why this matters

The current implementation makes it difficult to modify registration behavior or add pre/post registration hooks. It also makes the code less testable and harder to reuse.

Suggested change ∙ Feature Preview

Create a dedicated function for task registration:

def register_safearena_tasks():
    task_ids = []
    for task_id in config.TASK_IDS:
        gym_id = f"safearena.{task_id}"
        register_task(
            gym_id,
            task.GenericSafeArenaTask,
            task_kwargs={"task_id": task_id},
        )
        task_ids.append(gym_id)
    return task_ids

ALL_SAFEARENA_TASK_IDS = register_safearena_tasks()
Provide feedback to improve future suggestions

Nice Catch Incorrect Not in Scope Not in coding standard Other

💬 Looking for more details? Reply to this comment to chat with Korbit.

task_list=task_list_from_metadata(metadata=task_metadata("safearena_all")),
max_steps=30,
n_repeats=1,
seeds_rng=np.random.RandomState(42),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identical RNG Seeds Across Benchmarks category Performance

Tell me more
What is the issue?

Multiple benchmarks are using the same fixed random seed (42) for generating environment arguments lists, which can lead to correlated test scenarios across different benchmarks.

Why this matters

Using the same seed across benchmarks reduces the diversity of test scenarios and may cause agents to overfit to specific patterns, failing to identify potential performance issues in edge cases.

Suggested change ∙ Feature Preview

Use different seeds for each benchmark to ensure broader test coverage and prevent correlated scenarios. Consider using a seed generation function:

def get_benchmark_seed(benchmark_name: str, base_seed: int = 42) -> int:
    return hash(benchmark_name) + base_seed
Provide feedback to improve future suggestions

Nice Catch Incorrect Not in Scope Not in coding standard Other

💬 Looking for more details? Reply to this comment to chat with Korbit.

Comment on lines 268 to 273
env_args_list=make_env_args_list_from_repeat_tasks(
task_list=task_list_from_metadata(metadata=task_metadata("safearena_all")),
max_steps=30,
n_repeats=1,

This comment was marked as resolved.

Comment on lines 276 to 277
"safearena_harm": lambda: Benchmark(
name="safenarena_all",

This comment was marked as resolved.

Comment on lines 290 to 291
"safearena_safe": lambda: Benchmark(
name="safenarena_all",

This comment was marked as resolved.

Comment on lines +262 to +265
"safearena_all": lambda: Benchmark(
name="safenarena_all",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in SafeArena All Benchmark Name category Functionality

Tell me more
What is the issue?

The benchmark name contains a typo: 'safenarena_all' instead of 'safearena_all'.

Why this matters

This inconsistency in naming could lead to problems with benchmark identification and could break functionality that relies on exact name matching.

Suggested change ∙ Feature Preview

Correct the spelling in the name parameter:

"safearena_all": lambda: Benchmark(
        name="safearena_all",
Provide feedback to improve future suggestions

Nice Catch Incorrect Not in Scope Not in coding standard Other

💬 Looking for more details? Reply to this comment to chat with Korbit.

Comment on lines +145 to +156
try:
requests.get(url, timeout=timeout)
except (requests.exceptions.ConnectionError, requests.exceptions.Timeout):
raise RuntimeError(
f'SafeArena site "{site}" ({url}) is not reacheable. Please check the URL.'
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incomplete request exception handling category Error Handling

Tell me more
What is the issue?

The error handling for the request check is too specific, potentially missing other request exceptions like InvalidURL, InvalidSchema, or TooManyRedirects.

Why this matters

Other request exceptions could occur and would propagate uncaught, making debugging harder since the error message wouldn't indicate which site failed.

Suggested change ∙ Feature Preview

Use requests.exceptions.RequestException to catch all request-related errors:

try:
    requests.get(url, timeout=timeout)
except requests.exceptions.RequestException as e:
    raise RuntimeError(
        f'SafeArena site "{site}" ({url}) is not reacheable: {str(e)}'
    )
Provide feedback to improve future suggestions

Nice Catch Incorrect Not in Scope Not in coding standard Other

💬 Looking for more details? Reply to this comment to chat with Korbit.

@amanjaiswal73892 amanjaiswal73892 marked this pull request as draft June 16, 2025 15:01
@amanjaiswal73892 amanjaiswal73892 added the enhancement New feature or request label Jul 16, 2025
# list all the available assistantbench tasks
env_ids = [id for id in gym.envs.registry.keys() if id.startswith("browsergym/workarena")]
print("\n".join(env_ids))
```

WebArena
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be SafeArena.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants