addition of Safearena Benchmark #345

alzambranolu13 · 2025-05-27T23:53:31Z

ToDo:

Update ReadMe
Add tests similar to VisualWebArena
Refactor similar to VisualWebArena (remove any duplication)

Description by Korbit AI

What change is being made?

Add the SafeArena Benchmark to the BrowserGym project by introducing configurations, tasks, metadata, and scripts related to "safearena", updating the Makefile to include safearena, and managing dependencies in the pyproject.toml and requirements.txt files.

Why are these changes being made?

This integration allows the BrowserGym project to support and benchmark web agents on safety-focused tasks, thereby enhancing the scope of the platform in testing agent capability against safely navigating web environments. This addition caters to the ongoing research in ensuring AI agents operate safely in applications involving internet navigation.

Is this description stale? Ask me to generate a new description by commenting /korbit-generate-pr-description

korbit-ai

Review by Korbit AI

Korbit automatically attempts to detect when you fix issues in new commits.

Category	Issue	Status
	Identical RNG Seeds Across Benchmarks ▹ view
	Unhandled Download Failure ▹ view
	Insufficient Task Repetitions ▹ view	✅ Fix detected
	Incorrect Benchmark Name in SafeArena Harm Config ▹ view	✅ Fix detected
	Lack of Task Registration Encapsulation ▹ view
	Mixed Resource Initialization with Task Registration ▹ view
	Lack of proper configuration management structure ▹ view
	Incomplete request exception handling ▹ view
	Incorrect Benchmark Name in SafeArena Safe Config ▹ view	✅ Fix detected
	Typo in SafeArena All Benchmark Name ▹ view

Files scanned

File Path	Reviewed
browsergym/safearena/src/browsergym/safearena/config.py	✅
browsergym/safearena/src/browsergym/safearena/init.py	✅
browsergym/experiments/src/browsergym/experiments/benchmark/configs.py	✅
browsergym/safearena/src/browsergym/safearena/instance.py	✅
browsergym/safearena/src/browsergym/safearena/task.py	✅

Explore our documentation to understand the languages and file types we support and the files we ignore.

Check out our docs on how you can make Korbit work best for you and your team.

Loving Korbit!? Share us on LinkedIn Reddit and X

korbit-ai · 2025-05-27T23:58:22Z

browsergym/safearena/src/browsergym/safearena/config.py

@@ -0,0 +1 @@
+TASK_IDS = range(250)


Lack of proper configuration management structure

Tell me more

What is the issue?

Configuration data is directly hardcoded as a global variable without proper encapsulation or configuration management structure.

Why this matters

Hardcoding configuration values makes the code rigid and difficult to maintain. Changes require code modifications rather than configuration updates, and there's no clear separation between configuration and code logic.

Suggested change ∙ Feature Preview

Implement a proper configuration management system, for example:

from dataclasses import dataclass from typing import Sequence @dataclass class SafeArenaConfig: task_ids: Sequence[int] = tuple(range(250)) # Usage: config = SafeArenaConfig()

Or use a configuration file (yaml/json) with a parser class.

Provide feedback to improve future suggestions

_{💬 Looking for more details? Reply to this comment to chat with Korbit.}

korbit-ai · 2025-05-27T23:58:22Z

browsergym/safearena/src/browsergym/safearena/__init__.py

+try:
+    nltk.data.find("tokenizers/punkt_tab")
+except:
+    nltk.download("punkt_tab", quiet=True, raise_on_error=True)


Unhandled Download Failure

Tell me more

What is the issue?

The code assumes the download will always succeed when raise_on_error=True, but network issues or permission problems could cause failures.

Why this matters

If the download fails, the program will continue without the required NLTK resources, potentially causing runtime errors in dependent code.

Suggested change ∙ Feature Preview

Add error handling for download failures:

try: nltk.data.find("tokenizers/punkt_tab") except LookupError: try: nltk.download("punkt_tab", quiet=True, raise_on_error=True) except Exception as e: raise RuntimeError(f"Failed to download required NLTK resources: {e}")

Provide feedback to improve future suggestions

_{💬 Looking for more details? Reply to this comment to chat with Korbit.}

korbit-ai · 2025-05-27T23:58:22Z

browsergym/safearena/src/browsergym/safearena/__init__.py

+try:
+    nltk.data.find("tokenizers/punkt_tab")
+except:
+    nltk.download("punkt_tab", quiet=True, raise_on_error=True)


Mixed Resource Initialization with Task Registration

Tell me more

What is the issue?

The initialization of NLTK resources is mixed with task registration logic, violating the Single Responsibility Principle.

Why this matters

This makes the code harder to maintain and test, as resource initialization is tightly coupled with task registration. It also makes it difficult to mock or bypass NLTK initialization during testing.

Suggested change ∙ Feature Preview

Extract NLTK initialization into a separate function or module:

def initialize_nltk_resources(): try: nltk.data.find("tokenizers/punkt_tab") except: nltk.download("punkt_tab", quiet=True, raise_on_error=True) # Call this function before task registration initialize_nltk_resources()

Provide feedback to improve future suggestions

_{💬 Looking for more details? Reply to this comment to chat with Korbit.}

korbit-ai · 2025-05-27T23:58:22Z

browsergym/safearena/src/browsergym/safearena/__init__.py

+for task_id in config.TASK_IDS:
+    gym_id = f"safearena.{task_id}"
+    register_task(
+        gym_id,
+        task.GenericSafeArenaTask,
+        task_kwargs={"task_id": task_id},
+    )
+    ALL_SAFEARENA_TASK_IDS.append(gym_id)


Lack of Task Registration Encapsulation

Tell me more

What is the issue?

Task registration logic is not encapsulated in a dedicated function, making it harder to understand and maintain.

Why this matters

The current implementation makes it difficult to modify registration behavior or add pre/post registration hooks. It also makes the code less testable and harder to reuse.

Suggested change ∙ Feature Preview

Create a dedicated function for task registration:

def register_safearena_tasks(): task_ids = [] for task_id in config.TASK_IDS: gym_id = f"safearena.{task_id}" register_task( gym_id, task.GenericSafeArenaTask, task_kwargs={"task_id": task_id}, ) task_ids.append(gym_id) return task_ids ALL_SAFEARENA_TASK_IDS = register_safearena_tasks()

Provide feedback to improve future suggestions

_{💬 Looking for more details? Reply to this comment to chat with Korbit.}

korbit-ai · 2025-05-27T23:58:22Z