Skip to content
Snippets Groups Projects

Switch to PyTorch Dataloader and HF datasets

Closed Vlad-Andrei BĂDOIU (78692) requested to merge vladb/py_dataloader into main

Pull Request Title

Description

Wants to merge: vladb/py_dataloader into main

Type of change

  • Bug fix
  • New feature
  • Enhancement
  • Documentation update
  • Other (specify right below)

Merge request commits

  • Switch to PyTorch Dataloader

Related Issues

Screenshots or GIFs

Checklist

  • I have tested the code with the changes manually.
  • My code follows the project's style guidelines.
  • I have documented my code for others to understand.
  • I have updated documentation as needed (including README.md, code comments and doc strings).

Reviewer Guidelines

Additional Notes

@mentions

Edited by Vlad-Andrei BĂDOIU (78692)

Merge request reports

Pipeline #55079 passed

Pipeline passed for 2e627e6b on vladb/py_dataloader

Closed by Alexandru-Mihai GHERGHESCUAlexandru-Mihai GHERGHESCU 10 months ago (May 18, 2024 8:58am UTC)

Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
1 import time
2 import random
3 from typing import Tuple, Iterator, Iterable
1 from typing import (
2 Optional,
3 Union,
4 Generator,
5 Any,
6 Callable
  • 1 import time
    2 import random
    3 from typing import Tuple, Iterator, Iterable
    1 from typing import (
    2 Optional,
    3 Union,
    4 Generator,
    5 Any,
    6 Callable
    7 )
    8 from typing import Iterator
  • 2 2 import torch
    3 3 from torch import nn
    4 4
    5 from optimus.datasets import WikiText103Dataset
    6 5 from optimus.tokenizers import SentencePieceTokenizer
    7 from optimus.dataloader import OptimusDataLoader
    6 from optimus.dataloader import *
    8 7 from optimus.models import OptimusTransformer
    9 8 from optimus.trainer import Trainer
    10
    9 from datasets import load_dataset
    11 10
  • 2 2 import torch
    3 3 from torch import nn
    4 4
    5 from optimus.datasets import WikiText103Dataset
    6 5 from optimus.tokenizers import SentencePieceTokenizer
    7 from optimus.dataloader import OptimusDataLoader
    6 from optimus.dataloader import *
  • 65 64 tok = SentencePieceTokenizer(model_path=tokenizer_path)
    66 65
    67 66 # load dataset splits
    68 train_ds = WikiText103Dataset(split='train')
    69 test_ds = WikiText103Dataset(split='test')
    67 train_ds = load_dataset('wikitext', 'wikitext-2-v1', split='train', streaming=False)
    68 test_ds = load_dataset('wikitext', 'wikitext-2-v1', split='test', streaming=False)
    69
    70 # toknize splits
  • 65 64 tok = SentencePieceTokenizer(model_path=tokenizer_path)
    66 65
    67 66 # load dataset splits
    68 train_ds = WikiText103Dataset(split='train')
    69 test_ds = WikiText103Dataset(split='test')
    67 train_ds = load_dataset('wikitext', 'wikitext-2-v1', split='train', streaming=False)
    68 test_ds = load_dataset('wikitext', 'wikitext-2-v1', split='test', streaming=False)
    • Comment on lines +67 to +68

      Remove streaming; the default is False, and mentioning it here doesn't help us too much anyway, because we can't set it to True either way (setting it to True doesn't work because we need to get its __len__() in the DataLoader; however, since it is an IterableDataset, it doesn't have a valid __len__()).

    • Please register or sign in to reply
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Please register or sign in to reply
    Loading