Skip to content

[BUG] Model Parallelism is incorrectly disabled on single multi-GPU machines #895

@huaanrui

Description

@huaanrui

Describe the bug

When running an evaluation with accelerate launch on a single machine with multiple GPUs, setting model_parallel=True in the TransformersModelConfig has no effect. The library logs that it is "not in a distributed setting" and force-overrides model_parallel to False.

To Reproduce

Set up an evaluation script on a single machine with 2 or more GPUs.
In the script, create a TransformersModelConfig for a large model (e.g., 7B+) and explicitly set model_parallel=True.
Launch the script using accelerate launch --num_processes=2 your_script.py.
Observe: We are not in a distributed setting. Setting model_parallel to False.

Expected behavior

lighteval should respect the model_parallel=True configuration in a single-node, multi-GPU environment. It should proceed to shard the model across the available GPUs using device_map="auto" instead of disabling the feature.

Version info

build from source code

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions