Skip to content

Fix datasets

Alexandru-Mihai GHERGHESCU requested to merge fix/datasets_links into main

Fix datasets

Description

Wants to merge: fix/datasets_links into main

Type of change

  • Bug fix
  • New feature
  • Enhancement
  • Documentation update
  • Other (specify right below)

Merge request commits

  • Fix dataset links, slightly refactor

Fix some issues with the datasets:

  • fix wikitext103 dead link
  • fix tinystories correct link (it now results in exactly the same dataset as the one obtained via huggingface's load_dataset() interface)
  • reduce the number of splits (from ['train', 'test', 'valid'] to ['train', 'test']) for all datasets
  • add extract_tgz() method to dataset utils

Related Issues

Screenshots or GIFs

Checklist

  • I have tested the code with the changes manually.
  • My code follows the project's style guidelines.
  • I have documented my code for others to understand.
  • I have updated documentation as needed (including README.md, code comments and doc strings).

Reviewer Guidelines

Additional Notes

@mentions

Merge request reports