Skip to content

Transformer.from_folder - Can we specify FlashAttention ? #238

@amitbcp

Description

@amitbcp

Python -VV

-

Pip Freeze

pip freeze | grep mistral
mistral_common==1.5.1
mistral_inference==1.5.0

Reproduction Steps

self.model_path = model_path
        try:
            from mistral_inference.transformer import Transformer
            from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
        except ImportError as err:
            logging.critical('Please install `mistral-inference` and `mistral_common`')
            raise err

        if os.path.exists(model_path):
            cache_path = model_path
        else:
            if get_cache_path(model_path) is None:
                snapshot_download(repo_id=model_path)
            cache_path = get_cache_path(self.model_path, repo_type='models')

        self.tokenizer = MistralTokenizer.from_file(f'{cache_path}/tekken.json')
        model = Transformer.from_folder(cache_path, device='cpu')
        model.cuda()
        self.model = model
        self.max_tokens = 2048

Expected Behavior

  1. The inference for Pixtral is super slow. Is their a way to specify to use flash-attention2 ?

Additional Context

No response

Suggested Solutions

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions