Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ran out of memory in memory space vmem / Extra memory due to padding #7942

Open
radna0 opened this issue Aug 31, 2024 · 2 comments
Open

Ran out of memory in memory space vmem / Extra memory due to padding #7942

radna0 opened this issue Aug 31, 2024 · 2 comments

Comments

@radna0
Copy link

radna0 commented Aug 31, 2024

🐛 Bug

The error seems to be related to pixel_values being padded

WARNING:root:libtpu.so and TPU device found. Setting PJRT_DEVICE=TPU.
config.json: 100%|█████████████████████████████████████████████████████████| 3.95k/3.95k [00:00<00:00, 23.8MB/s]
configuration_internvl_chat.py: 100%|██████████████████████████████████████| 3.85k/3.85k [00:00<00:00, 26.1MB/s]
configuration_intern_vit.py: 100%|█████████████████████████████████████████| 5.55k/5.55k [00:00<00:00, 29.6MB/s]
A new version of the following files was downloaded from https://huggingface.co/radna/XLA-InternVL2-8B:
- configuration_intern_vit.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
configuration_internlm2.py: 100%|██████████████████████████████████████████| 7.00k/7.00k [00:00<00:00, 40.8MB/s]
A new version of the following files was downloaded from https://huggingface.co/radna/XLA-InternVL2-8B:
- configuration_internlm2.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/radna/XLA-InternVL2-8B:
- configuration_internvl_chat.py
- configuration_intern_vit.py
- configuration_internlm2.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
modeling_internvl_chat.py: 100%|███████████████████████████████████████████| 16.3k/16.3k [00:00<00:00, 70.4MB/s]
modeling_internlm2.py: 100%|███████████████████████████████████████████████| 61.2k/61.2k [00:00<00:00, 77.4MB/s]
A new version of the following files was downloaded from https://huggingface.co/radna/XLA-InternVL2-8B:
- modeling_internlm2.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
conversation.py: 100%|█████████████████████████████████████████████████████| 15.0k/15.0k [00:00<00:00, 82.2MB/s]
A new version of the following files was downloaded from https://huggingface.co/radna/XLA-InternVL2-8B:
- conversation.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
modeling_intern_vit.py: 100%|██████████████████████████████████████████████| 18.1k/18.1k [00:00<00:00, 75.5MB/s]
A new version of the following files was downloaded from https://huggingface.co/radna/XLA-InternVL2-8B:
- modeling_intern_vit.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/radna/XLA-InternVL2-8B:
- modeling_internvl_chat.py
- modeling_internlm2.py
- conversation.py
- modeling_intern_vit.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
FlashAttention2 is not installed.
model.safetensors.index.json: 100%|█████████████████████████████████████████| 51.2k/51.2k [00:00<00:00, 637kB/s]
model-00001-of-00004.safetensors: 100%|████████████████████████████████████| 4.94G/4.94G [08:10<00:00, 10.1MB/s]
model-00002-of-00004.safetensors: 100%|████████████████████████████████████| 4.92G/4.92G [01:28<00:00, 55.8MB/s]
model-00003-of-00004.safetensors: 100%|████████████████████████████████████| 4.92G/4.92G [01:25<00:00, 57.2MB/s]
model-00004-of-00004.safetensors: 100%|████████████████████████████████████| 1.38G/1.38G [00:34<00:00, 39.8MB/s]
Downloading shards: 100%|████████████████████████████████████████████████████████| 4/4 [11:40<00:00, 175.20s/it]
Warning: Flash attention is not available, using eager attention instead.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████| 4/4 [00:00<00:00,  5.86it/s]
generation_config.json: 100%|███████████████████████████████████████████████████| 115/115 [00:00<00:00, 679kB/s]
tokenizer_config.json: 100%|███████████████████████████████████████████████| 4.00k/4.00k [00:00<00:00, 26.3MB/s]
tokenization_internlm2.py: 100%|███████████████████████████████████████████| 8.79k/8.79k [00:00<00:00, 54.9MB/s]
A new version of the following files was downloaded from https://huggingface.co/radna/XLA-InternVL2-8B:
- tokenization_internlm2.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
tokenizer.model: 100%|█████████████████████████████████████████████████████| 1.48M/1.48M [00:00<00:00, 13.1MB/s]
added_tokens.json: 100%|███████████████████████████████████████████████████████| 179/179 [00:00<00:00, 2.05MB/s]
special_tokens_map.json: 100%|█████████████████████████████████████████████████| 844/844 [00:00<00:00, 8.83MB/s]
Traceback (most recent call last):
  File "/home/kojoe/EasyAnimate/easyanimate/image_caption/template.py", line 132, in <module>
    response = model.chat(tokenizer, pixel_values, question, generation_config)
  File "/dev/shm/modules/transformers_modules/radna/XLA-InternVL2-8B/746cd35e611234c48f8dc5c61dbe30b5a782a208/modeling_internvl_chat.py", line 356, in chat
    generation_output = self.generate(
  File "/home/kojoe/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/dev/shm/modules/transformers_modules/radna/XLA-InternVL2-8B/746cd35e611234c48f8dc5c61dbe30b5a782a208/modeling_internvl_chat.py", line 410, in generate
    outputs = self.language_model.generate(
  File "/home/kojoe/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2024, in generate
    result = self._sample(
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 3038, in _sample
    unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores)
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/stopping_criteria.py", line 511, in __call__
    is_done = is_done | criteria(input_ids, scores, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/stopping_criteria.py", line 502, in __call__
    is_done = torch.isin(input_ids[:, -1], self.eos_token_id)
RuntimeError: Bad StatusOr access: RESOURCE_EXHAUSTED: XLA:TPU compile permanent error. Ran out of memory in memory space vmem. Used 29.95M of 16.00M vmem. Exceeded vmem capacity by 13.95M.

Program vmem requirement 29.95M:
    scoped           29.95M

  Largest program allocations in vmem:

  1. Size: 29.66M
     XLA label: register allocator spill slots call depth 2
     Allocation type: scoped
     ==========================

  2. Size: 64.0K
     Shape: f32[128,3]{1,0}
     Unpadded size: 1.5K
     Extra memory due to padding: 62.5K (42.7x expansion)
     XLA label: reduce-window.8 = reduce-window(bitcast.1020, bitcast.1021, constant.3067, constant.3067), window={size=1x1x128 pad=0_0x0_0x127_0}, to_apply=AddComputation.5421.clone
     Allocation type: scoped
     ==========================

  3. Size: 64.0K
     Shape: f32[128,3]{1,0}
     Unpadded size: 1.5K
     Extra memory due to padding: 62.5K (42.7x expansion)
     XLA label: reduce-window.8 = reduce-window(bitcast.1020, bitcast.1021, constant.3067, constant.3067), window={size=1x1x128 pad=0_0x0_0x127_0}, to_apply=AddComputation.5421.clone
     Allocation type: scoped
     ==========================

  4. Size: 64.0K
     Shape: f32[128,3]{1,0}
     Unpadded size: 1.5K
     Extra memory due to padding: 62.5K (42.7x expansion)
     XLA label: reduce-window.8 = reduce-window(bitcast.1020, bitcast.1021, constant.3067, constant.3067), window={size=1x1x128 pad=0_0x0_0x127_0}, to_apply=AddComputation.5421.clone
     Allocation type: scoped
     ==========================

  5. Size: 64.0K
     Shape: f32[128,3]{1,0}
     Unpadded size: 1.5K
     Extra memory due to padding: 62.5K (42.7x expansion)
     XLA label: reduce-window.8 = reduce-window(bitcast.1020, bitcast.1021, constant.3067, constant.3067), window={size=1x1x128 pad=0_0x0_0x127_0}, to_apply=AddComputation.5421.clone
     Allocation type: scoped
     ==========================

  6. Size: 4.0K
     Shape: u8[4096]{0}
     Unpadded size: 4.0K
     XLA label: reduce-window.8 = reduce-window(bitcast.1020, bitcast.1021, constant.3067, constant.3067), window={size=1x1x128 pad=0_0x0_0x127_0}, to_apply=AddComputation.5421.clone
     Allocation type: scoped
     ==========================

  7. Size: 4.0K
     Shape: u8[4096]{0}
     Unpadded size: 4.0K
     XLA label: reduce-window.8 = reduce-window(bitcast.1020, bitcast.1021, constant.3067, constant.3067), window={size=1x1x128 pad=0_0x0_0x127_0}, to_apply=AddComputation.5421.clone
     Allocation type: scoped
     ==========================

  8. Size: 4.0K
     Shape: u8[4096]{0}
     Unpadded size: 4.0K
     XLA label: reduce-window.8 = reduce-window(bitcast.1020, bitcast.1021, constant.3067, constant.3067), window={size=1x1x128 pad=0_0x0_0x127_0}, to_apply=AddComputation.5421.clone
     Allocation type: scoped
     ==========================

  9. Size: 4.0K
     Shape: u8[4096]{0}
     Unpadded size: 4.0K
     XLA label: reduce-window.8 = reduce-window(bitcast.1020, bitcast.1021, constant.3067, constant.3067), window={size=1x1x128 pad=0_0x0_0x127_0}, to_apply=AddComputation.5421.clone
     Allocation type: scoped
     ==========================

  10. Size: 2.0K
     Shape: u8[2048]{0}
     Unpadded size: 2.0K
     XLA label: reduce-window.8 = reduce-window(bitcast.1020, bitcast.1021, constant.3067, constant.3067), window={size=1x1x128 pad=0_0x0_0x127_0}, to_apply=AddComputation.5421.clone
     Allocation type: scoped
     ==========================

  11. Size: 2.0K
     Shape: u8[2048]{0}
     Unpadded size: 2.0K
     XLA label: reduce-window.8 = reduce-window(bitcast.1020, bitcast.1021, constant.3067, constant.3067), window={size=1x1x128 pad=0_0x0_0x127_0}, to_apply=AddComputation.5421.clone
     Allocation type: scoped
     ==========================

  12. Size: 2.0K
     Shape: u8[2048]{0}
     Unpadded size: 2.0K
     XLA label: reduce-window.8 = reduce-window(bitcast.1020, bitcast.1021, constant.3067, constant.3067), window={size=1x1x128 pad=0_0x0_0x127_0}, to_apply=AddComputation.5421.clone
     Allocation type: scoped
     ==========================

  13. Size: 2.0K
     Shape: u8[2048]{0}
     Unpadded size: 2.0K
     XLA label: reduce-window.8 = reduce-window(bitcast.1020, bitcast.1021, constant.3067, constant.3067), window={size=1x1x128 pad=0_0x0_0x127_0}, to_apply=AddComputation.5421.clone
     Allocation type: scoped
     ==========================

To Reproduce

Steps to reproduce the behavior:

  1. template.py
import math
import numpy as np
import torch
import torchvision.transforms as T
from decord import VideoReader, cpu
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer
import os
import torch_xla
import torch_xla.distributed.spmd as xs
import torch_xla.core.xla_model as xm
from torch_xla import runtime as xr

xr.use_spmd(auto=False)

from torch_xla.experimental.spmd_fully_sharded_data_parallel import (
    _prepare_spmd_partition_spec,
    SpmdFullyShardedDataParallel as FSDPv2,
)


IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)

def build_transform(input_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    transform = T.Compose([
        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=MEAN, std=STD)
    ])
    return transform

def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float('inf')
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio

def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    # calculate the existing image aspect ratio
    target_ratios = set(
        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
        i * j <= max_num and i * j >= min_num)
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

    # find the closest aspect ratio to the target
    target_aspect_ratio = find_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size)

    # calculate the target width and height
    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

    # resize the image
    resized_img = image.resize((target_width, target_height))
    processed_images = []
    for i in range(blocks):
        box = (
            (i % (target_width // image_size)) * image_size,
            (i // (target_width // image_size)) * image_size,
            ((i % (target_width // image_size)) + 1) * image_size,
            ((i // (target_width // image_size)) + 1) * image_size
        )
        # split the image
        split_img = resized_img.crop(box)
        processed_images.append(split_img)
    assert len(processed_images) == blocks
    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.append(thumbnail_img)
    return processed_images

def load_image(image_file, input_size=448, max_num=12):
    image = Image.open(image_file).convert('RGB')
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values



path = 'radna/XLA-InternVL2-8B'
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True,
    ).eval()


# Define the mesh and partition_spec
num_devices = xr.global_runtime_device_count()
mesh_shape = (num_devices, 1)
device_ids = np.array(range(num_devices))
# To be noted, the mesh must have an axis named 'fsdp', which the weights and activations will be sharded on.
mesh = xs.Mesh(device_ids, mesh_shape, ("fsdp", "model"))
xs.set_global_mesh(mesh)




model = FSDPv2(model)
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)

# set the max number of tiles in `max_num`
pixel_values = load_image('./image1.jpg', max_num=1).to(torch.bfloat16).to(xm.xla_device())
generation_config = dict(max_new_tokens=1024, do_sample=True)

xs.mark_sharding(pixel_values, xs.get_global_mesh(), _prepare_spmd_partition_spec(pixel_values, shard_maximal=True))

# single-image single-round conversation (单图单轮对话)
question = '<image>\nPlease describe the image shortly.'
response = model.chat(tokenizer, pixel_values, question, generation_config)
print(f'User: {question}\nAssistant: {response}')


  1. python template.py

Expected behavior

Should run the Modified XLA version of InternVL2-8B model at https://huggingface.co/radna/XLA-InternVL2-8B

Environment

  • Reproducible on XLA backend [CPU/TPU/CUDA]: TPU
  • torch_xla version: NIghtly 2.5

Additional context

Reproducable on TPU V2 && V3

@JackCaoG
Copy link
Collaborator

JackCaoG commented Sep 3, 2024

padding is not the issue, issue is

  1. Size: 29.66M
     XLA label: register allocator spill slots call depth 2
     Allocation type: scoped
     ==========================

If you can dump the HLO following https://github.com/pytorch/xla/blob/master/TROUBLESHOOTING.md#common-debugging-environment-variables-combinations we can open a bug to the XLA team. @will-cromar since you are offcall this week.

@radna0
Copy link
Author

radna0 commented Sep 4, 2024

I followed the guide, and here is the HLO file. @JackCaoG @will-cromar. *Renamed to .txt, because Github doesn't allow hlo format
save1.hlo.0.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants