Fix tokenizer typos, add newlines (!8) · Merge requests · NetSys / Optimus Prime · GitLab

Snippets Groups Projects

Merged Alexandru-Mihai GHERGHESCU requested to merge fix/tokenizer into main 1 year ago

Fix some issues in the tokenizer.

Mainly, fix a problem where newlines weren't present in the tokenizer model. This meant any whitespace was silently deleted and newlines weren't a thing. This could introduce issues for datasets such as wikitext103, where newlines delimited titles and actual text.

Re-train the wikitext tokenizers with newlines.

Activity

Alexandru-Mihai GHERGHESCU requested review from @vlad_andrei.badoiu1 1 year ago

requested review from @vlad_andrei.badoiu1
Alexandru-Mihai GHERGHESCU assigned to @agherghescu2411 1 year ago

assigned to @agherghescu2411
Vlad-Andrei BĂDOIU (78692) merged 1 year ago

merged
Vlad-Andrei BĂDOIU (78692) mentioned in commit f3c62726 1 year ago

mentioned in commit f3c62726

Please register or sign in to reply