Fix datasets
Fix datasets
Description
Wants to merge: fix/datasets_links into main
Type of change
-
Bug fix -
New feature -
Enhancement -
Documentation update -
Other (specify right below)
Merge request commits
- Fix dataset links, slightly refactor
Fix some issues with the datasets:
- fix wikitext103 dead link
- fix tinystories correct link (it now results in exactly the same dataset as the one obtained via huggingface's load_dataset() interface)
- reduce the number of splits (from ['train', 'test', 'valid'] to ['train', 'test']) for all datasets
- add extract_tgz() method to dataset utils
Related Issues
Screenshots or GIFs
Checklist
-
I have tested the code with the changes manually. -
My code follows the project's style guidelines. -
I have documented my code for others to understand. -
I have updated documentation as needed (including README.md
, code comments and doc strings).