Add Hebrew Wikipedia Sentences Corpus to General Corpora by tomron87 · Pull Request #138 · NNLP-IL/Hebrew-Resources

tomron87 · 2026-02-14T12:57:42Z

Adds the Hebrew Wikipedia Sentences Corpus to the General Corpora section.

Dataset summary:

~11M cleaned, deduplicated Hebrew sentences extracted from 366K Hebrew Wikipedia articles
Crawled February 2026 via the MediaWiki API
Rich metadata per sentence: article ID, title, categories, sentence position, word count, Hebrew character ratio
License: CC BY-SA 3.0 (inherited from Wikipedia)
Intended for language modeling, text classification, NER, sentence similarity, and embedding benchmarks

Add Hebrew Wikipedia Sentences Corpus to General Corpora

4efec2d

Provide feedback