Skip to content
Snippets Groups Projects

Fix tokenizer typos, add newlines

Merged Alexandru-Mihai GHERGHESCU requested to merge fix/tokenizer into main

Fix some issues in the tokenizer.

Mainly, fix a problem where newlines weren't present in the tokenizer model. This meant any whitespace was silently deleted and newlines weren't a thing. This could introduce issues for datasets such as wikitext103, where newlines delimited titles and actual text.

Re-train the wikitext tokenizers with newlines.

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
Please register or sign in to reply
Loading