Ran out of memory in memory space vmem / Extra memory due to padding #7942

radna0 · 2024-08-31T17:18:34Z

🐛 Bug

The error seems to be related to pixel_values being padded

WARNING:root:libtpu.so and TPU device found. Setting PJRT_DEVICE=TPU.
config.json: 100%|█████████████████████████████████████████████████████████| 3.95k/3.95k [00:00<00:00, 23.8MB/s]
configuration_internvl_chat.py: 100%|██████████████████████████████████████| 3.85k/3.85k [00:00<00:00, 26.1MB/s]
configuration_intern_vit.py: 100%|█████████████████████████████████████████| 5.55k/5.55k [00:00<00:00, 29.6MB/s]
A new version of the following files was downloaded from https://huggingface.co/radna/XLA-InternVL2-8B:
- configuration_intern_vit.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
configuration_internlm2.py: 100%|██████████████████████████████████████████| 7.00k/7.00k [00:00<00:00, 40.8MB/s]
A new version of the following files was downloaded from https://huggingface.co/radna/XLA-InternVL2-8B:
- configuration_internlm2.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/radna/XLA-InternVL2-8B:
- configuration_internvl_chat.py
- configuration_intern_vit.py
- configuration_internlm2.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
modeling_internvl_chat.py: 100%|███████████████████████████████████████████| 16.3k/16.3k [00:00<00:00, 70.4MB/s]
modeling_internlm2.py: 100%|███████████████████████████████████████████████| 61.2k/61.2k [00:00<00:00, 77.4MB/s]
A new version of the following files was downloaded from https://huggingface.co/radna/XLA-InternVL2-8B:
- modeling_internlm2.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
conversation.py: 100%|█████████████████████████████████████████████████████| 15.0k/15.0k [00:00<00:00, 82.2MB/s]
A new version of the following files was downloaded from https://huggingface.co/radna/XLA-InternVL2-8B:
- conversation.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
modeling_intern_vit.py: 100%|██████████████████████████████████████████████| 18.1k/18.1k [00:00<00:00, 75.5MB/s]
A new version of the following files was downloaded from https://huggingface.co/radna/XLA-InternVL2-8B:
- modeling_intern_vit.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/radna/XLA-InternVL2-8B:
- modeling_internvl_chat.py
- modeling_internlm2.py
- conversation.py
- modeling_intern_vit.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
FlashAttention2 is not installed.
model.safetensors.index.json: 100%|█████████████████████████████████████████| 51.2k/51.2k [00:00<00:00, 637kB/s]
model-00001-of-00004.safetensors: 100%|████████████████████████████████████| 4.94G/4.94G [08:10<00:00, 10.1MB/s]
model-00002-of-00004.safetensors: 100%|████████████████████████████████████| 4.92G/4.92G [01:28<00:00, 55.8MB/s]
model-00003-of-00004.safetensors: 100%|████████████████████████████████████| 4.92G/4.92G [01:25<00:00, 57.2MB/s]
model-00004-of-00004.safetensors: 100%|████████████████████████████████████| 1.38G/1.38G [00:34<00:00, 39.8MB/s]
Downloading shards: 100%|████████████████████████████████████████████████████████| 4/4 [11:40<00:00, 175.20s/it]
Warning: Flash attention is not available, using eager attention instead.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████| 4/4 [00:00<00:00,  5.86it/s]
generation_config.json: 100%|███████████████████████████████████████████████████| 115/115 [00:00<00:00, 679kB/s]
tokenizer_config.json: 100%|███████████████████████████████████████████████| 4.00k/4.00k [00:00<00:00, 26.3MB/s]
tokenization_internlm2.py: 100%|███████████████████████████████████████████| 8.79k/8.79k [00:00<00:00, 54.9MB/s]
A new version of the following files was downloaded from https://huggingface.co/radna/XLA-InternVL2-8B:
- tokenization_internlm2.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
tokenizer.model: 100%|█████████████████████████████████████████████████████| 1.48M/1.48M [00:00<00:00, 13.1MB/s]
added_tokens.json: 100%|███████████████████████████████████████████████████████| 179/179 [00:00<00:00, 2.05MB/s]
special_tokens_map.json: 100%|█████████████████████████████████████████████████| 844/844 [00:00<00:00, 8.83MB/s]
Traceback (most recent call last):
  File "/home/kojoe/EasyAnimate/easyanimate/image_caption/template.py", line 132, in <module>
    response = model.chat(tokenizer, pixel_values, question, generation_config)
  File "/dev/shm/modules/transformers_modules/radna/XLA-InternVL2-8B/746cd35e611234c48f8dc5c61dbe30b5a782a208/modeling_internvl_chat.py", line 356, in chat
    generation_output = self.generate(
  File "/home/kojoe/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/dev/shm/modules/transformers_modules/radna/XLA-InternVL2-8B/746cd35e611234c48f8dc5c61dbe30b5a782a208/modeling_internvl_chat.py", line 410, in generate
    outputs = self.language_model.generate(
  File "/home/kojoe/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2024, in generate
    result = self._sample(
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 3038, in _sample
    unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores)
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/stopping_criteria.py", line 511, in __call__
    is_done = is_done | criteria(input_ids, scores, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/stopping_criteria.py", line 502, in __call__
    is_done = torch.isin(input_ids[:, -1], self.eos_token_id)
RuntimeError: Bad StatusOr access: RESOURCE_EXHAUSTED: XLA:TPU compile permanent error. Ran out of memory in memory space vmem. Used 29.95M of 16.00M vmem. Exceeded vmem capacity by 13.95M.

Program vmem requirement 29.95M:
    scoped           29.95M

  Largest program allocations in vmem:

  1. Size: 29.66M
     XLA label: register allocator spill slots call depth 2
     Allocation type: scoped
     ==========================

  2. Size: 64.0K
     Shape: f32[128,3]{1,0}
     Unpadded size: 1.5K
     Extra memory due to padding: 62.5K (42.7x expansion)
     XLA label: reduce-window.8 = reduce-window(bitcast.1020, bitcast.1021, constant.3067, constant.3067), window={size=1x1x128 pad=0_0x0_0x127_0}, to_apply=AddComputation.5421.clone
     Allocation type: scoped
     ==========================

  3. Size: 64.0K
     Shape: f32[128,3]{1,0}
     Unpadded size: 1.5K
     Extra memory due to padding: 62.5K (42.7x expansion)
     XLA label: reduce-window.8 = reduce-window(bitcast.1020, bitcast.1021, constant.3067, constant.3067), window={size=1x1x128 pad=0_0x0_0x127_0}, to_apply=AddComputation.5421.clone
     Allocation type: scoped
     ==========================

  4. Size: 64.0K
     Shape: f32[128,3]{1,0}
     Unpadded size: 1.5K
     Extra memory due to padding: 62.5K (42.7x expansion)
     XLA label: reduce-window.8 = reduce-window(bitcast.1020, bitcast.1021, constant.3067, constant.3067), window={size=1x1x128 pad=0_0x0_0x127_0}, to_apply=AddComputation.5421.clone
     Allocation type: scoped
     ==========================

  5. Size: 64.0K
     Shape: f32[128,3]{1,0}
     Unpadded size: 1.5K
     Extra memory due to padding: 62.5K (42.7x expansion)
     XLA label: reduce-window.8 = reduce-window(bitcast.1020, bitcast.1021, constant.3067, constant.3067), window={size=1x1x128 pad=0_0x0_0x127_0}, to_apply=AddComputation.5421.clone
     Allocation type: scoped
     ==========================

  6. Size: 4.0K
     Shape: u8[4096]{0}
     Unpadded size: 4.0K
     XLA label: reduce-window.8 = reduce-window(bitcast.1020, bitcast.1021, constant.3067, constant.3067), window={size=1x1x128 pad=0_0x0_0x127_0}, to_apply=AddComputation.5421.clone
     Allocation type: scoped
     ==========================

  7. Size: 4.0K
     Shape: u8[4096]{0}
     Unpadded size: 4.0K
     XLA label: reduce-window.8 = reduce-window(bitcast.1020, bitcast.1021, constant.3067, constant.3067), window={size=1x1x128 pad=0_0x0_0x127_0}, to_apply=AddComputation.5421.clone
     Allocation type: scoped
     ==========================

  8. Size: 4.0K
     Shape: u8[4096]{0}
     Unpadded size: 4.0K
     XLA label: reduce-window.8 = reduce-window(bitcast.1020, bitcast.1021, constant.3067, constant.3067), window={size=1x1x128 pad=0_0x0_0x127_0}, to_apply=AddComputation.5421.clone
     Allocation type: scoped
     ==========================

  9. Size: 4.0K
     Shape: u8[4096]{0}
     Unpadded size: 4.0K
     XLA label: reduce-window.8 = reduce-window(bitcast.1020, bitcast.1021, constant.3067, constant.3067), window={size=1x1x128 pad=0_0x0_0x127_0}, to_apply=AddComputation.5421.clone
     Allocation type: scoped
     ==========================

  10. Size: 2.0K
     Shape: u8[2048]{0}
     Unpadded size: 2.0K
     XLA label: reduce-window.8 = reduce-window(bitcast.1020, bitcast.1021, constant.3067, constant.3067), window={size=1x1x128 pad=0_0x0_0x127_0}, to_apply=AddComputation.5421.clone
     Allocation type: scoped
     ==========================

  11. Size: 2.0K
     Shape: u8[2048]{0}
     Unpadded size: 2.0K
     XLA label: reduce-window.8 = reduce-window(bitcast.1020, bitcast.1021, constant.3067, constant.3067), window={size=1x1x128 pad=0_0x0_0x127_0}, to_apply=AddComputation.5421.clone
     Allocation type: scoped
     ==========================

  12. Size: 2.0K
     Shape: u8[2048]{0}
     Unpadded size: 2.0K
     XLA label: reduce-window.8 = reduce-window(bitcast.1020, bitcast.1021, constant.3067, constant.3067), window={size=1x1x128 pad=0_0x0_0x127_0}, to_apply=AddComputation.5421.clone
     Allocation type: scoped
     ==========================

  13. Size: 2.0K
     Shape: u8[2048]{0}
     Unpadded size: 2.0K
     XLA label: reduce-window.8 = reduce-window(bitcast.1020, bitcast.1021, constant.3067, constant.3067), window={size=1x1x128 pad=0_0x0_0x127_0}, to_apply=AddComputation.5421.clone
     Allocation type: scoped
     ==========================

To Reproduce

Steps to reproduce the behavior:

template.py

import math
import numpy as np
import torch
import torchvision.transforms as T
from decord import VideoReader, cpu
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer
import os
import torch_xla
import torch_xla.distributed.spmd as xs
import torch_xla.core.xla_model as xm
from torch_xla import runtime as xr

xr.use_spmd(auto=False)

from torch_xla.experimental.spmd_fully_sharded_data_parallel import (
    _prepare_spmd_partition_spec,
    SpmdFullyShardedDataParallel as FSDPv2,
)


IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)

def build_transform(input_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    transform = T.Compose([
        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=MEAN, std=STD)
    ])
    return transform

def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float('inf')
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio

def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    # calculate the existing image aspect ratio
    target_ratios = set(
        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
        i * j <= max_num and i * j >= min_num)
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

    # find the closest aspect ratio to the target
    target_aspect_ratio = find_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size)

    # calculate the target width and height
    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

    # resize the image
    resized_img = image.resize((target_width, target_height))
    processed_images = []
    for i in range(blocks):
        box = (
            (i % (target_width // image_size)) * image_size,
            (i // (target_width // image_size)) * image_size,
            ((i % (target_width // image_size)) + 1) * image_size,
            ((i // (target_width // image_size)) + 1) * image_size
        )
        # split the image
        split_img = resized_img.crop(box)
        processed_images.append(split_img)
    assert len(processed_images) == blocks
    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.append(thumbnail_img)
    return processed_images

def load_image(image_file, input_size=448, max_num=12):
    image = Image.open(image_file).convert('RGB')
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values



path = 'radna/XLA-InternVL2-8B'
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True,
    ).eval()


# Define the mesh and partition_spec
num_devices = xr.global_runtime_device_count()
mesh_shape = (num_devices, 1)
device_ids = np.array(range(num_devices))
# To be noted, the mesh must have an axis named 'fsdp', which the weights and activations will be sharded on.
mesh = xs.Mesh(device_ids, mesh_shape, ("fsdp", "model"))
xs.set_global_mesh(mesh)




model = FSDPv2(model)
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)

# set the max number of tiles in `max_num`
pixel_values = load_image('./image1.jpg', max_num=1).to(torch.bfloat16).to(xm.xla_device())
generation_config = dict(max_new_tokens=1024, do_sample=True)

xs.mark_sharding(pixel_values, xs.get_global_mesh(), _prepare_spmd_partition_spec(pixel_values, shard_maximal=True))

# single-image single-round conversation (单图单轮对话)
question = '<image>\nPlease describe the image shortly.'
response = model.chat(tokenizer, pixel_values, question, generation_config)
print(f'User: {question}\nAssistant: {response}')

python template.py

Expected behavior

Should run the Modified XLA version of InternVL2-8B model at https://huggingface.co/radna/XLA-InternVL2-8B

Environment

Reproducible on XLA backend [CPU/TPU/CUDA]: TPU
torch_xla version: NIghtly 2.5

Additional context

Reproducable on TPU V2 && V3

The text was updated successfully, but these errors were encountered:

JackCaoG · 2024-09-03T17:20:21Z

padding is not the issue, issue is

  1. Size: 29.66M
     XLA label: register allocator spill slots call depth 2
     Allocation type: scoped
     ==========================

If you can dump the HLO following https://github.com/pytorch/xla/blob/master/TROUBLESHOOTING.md#common-debugging-environment-variables-combinations we can open a bug to the XLA team. @will-cromar since you are offcall this week.

radna0 · 2024-09-04T01:41:23Z

I followed the guide, and here is the HLO file. @JackCaoG @will-cromar. *Renamed to .txt, because Github doesn't allow hlo format
save1.hlo.0.txt

radna0 mentioned this issue Sep 5, 2024

Ran out of memory in memory space vmem / register allocator spill slots call depth 2 #7962

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ran out of memory in memory space vmem / Extra memory due to padding #7942

Ran out of memory in memory space vmem / Extra memory due to padding #7942

radna0 commented Aug 31, 2024 •

edited

Loading

JackCaoG commented Sep 3, 2024

radna0 commented Sep 4, 2024

Ran out of memory in memory space vmem / Extra memory due to padding #7942

Ran out of memory in memory space vmem / Extra memory due to padding #7942

Comments

radna0 commented Aug 31, 2024 • edited Loading

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

JackCaoG commented Sep 3, 2024

radna0 commented Sep 4, 2024

radna0 commented Aug 31, 2024 •

edited

Loading