Skip to content

DGS Corpus download: ValueError: Message tensorflow_copy.Example exceeds maximum protobuf size of 2GB #90

@cleong110

Description

@cleong110

Trying to download DGS Corpus on a very powerful machine with high RAM and such, I get a ValueError. Apparently this can be fixed with proto splitter?

https://discuss.ai.google.dev/t/fix-the-notorious-graphdef-2gb-limitation/29392

Traceback:

Traceback (most recent call last):                                                                                                                                         
  File "sldata_download.py", line 20, in <module>             
    dataset = tfds.load(name=str(args.dataset_name), data_dir=args.data_dir)
  File "/opt/conda/envs/sldata/lib/python3.8/site-packages/tensorflow_datasets/core/logging/__init__.py", line 166, in __call__
    return function(*args, **kwargs)
  File "/opt/conda/envs/sldata/lib/python3.8/site-packages/tensorflow_datasets/core/load.py", line 639, in load
    _download_and_prepare_builder(dbuilder, download, download_and_prepare_kwargs)
  File "/opt/conda/envs/sldata/lib/python3.8/site-packages/tensorflow_datasets/core/load.py", line 498, in _download_and_prepare_builder
    dbuilder.download_and_prepare(**download_and_prepare_kwargs)
  File "/opt/conda/envs/sldata/lib/python3.8/site-packages/tensorflow_datasets/core/logging/__init__.py", line 166, in __call__
    return function(*args, **kwargs)
  File "/opt/conda/envs/sldata/lib/python3.8/site-packages/tensorflow_datasets/core/dataset_builder.py", line 691, in download_and_prepare
    self._download_and_prepare(
  File "/opt/conda/envs/sldata/lib/python3.8/site-packages/tensorflow_datasets/core/dataset_builder.py", line 1583, in _download_and_prepare
    future = split_builder.submit_split_generation(
  File "/opt/conda/envs/sldata/lib/python3.8/site-packages/tensorflow_datasets/core/split_builder.py", line 341, in submit_split_generation
    return self._build_from_generator(**build_kwargs)
  File "/opt/conda/envs/sldata/lib/python3.8/site-packages/tensorflow_datasets/core/split_builder.py", line 418, in _build_from_generator
    writer.write(key, example)
  File "/opt/conda/envs/sldata/lib/python3.8/site-packages/tensorflow_datasets/core/writer.py", line 238, in write
    serialized_example = self._serializer.serialize_example(example=example)
  File "/opt/conda/envs/sldata/lib/python3.8/site-packages/tensorflow_datasets/core/example_serializer.py", line 98, in serialize_example
    return self.get_tf_example(example).SerializeToString()
ValueError: Message tensorflow_copy.Example exceeds maximum protobuf size of 2GB: 15513563426

Download script and command

Installed env as noted in #89, with python 3.8, webvtt-py and lxml

# https://github.com/sign-language-processing/datasets/blob/master/sign_language_datasets/datasets/autsl/autsl.py

import tensorflow_datasets as tfds
import sign_language_datasets.datasets
import itertools
from sign_language_datasets.datasets.config import SignDatasetConfig
from pathlib import Path
import argparse

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="attempt to download a dataset from sign-language-datasets, e.g. 'dgs_corpus/holistic'")
    parser.add_argument("dataset_name", help="something like 'dgs_corpus'")
    parser.add_argument("--data_dir", type=Path, default=Path("~/tfds_sign_language_datasets"))
    args= parser.parse_args()

    # config = SignDatasetConfig(name="only-annotations", version="1.0.0", include_video=False)
    # config = SignDatasetConfig(name="holistic")
    # autsl = tfds.load(name='autsl', data_dir=data_dir, builder_kwargs={"config": config})
    # autsl = tfds.load(name='autsl/holistic', data_dir=data_dir)
    dataset = tfds.load(name=str(args.dataset_name), data_dir=args.data_dir)

    for datum in itertools.islice(dataset["train"], 0, 10):
        print(f"datum")
        print(datum)

Command

python sldata_download.py dgs_corpus --data_dir /data/petabyte/cleong/data/tfds_sign_language_datasets/

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions