Skip to content

Releases: bltlab/mot

v1.11

08 May 22:28
9204cad

Choose a tag to compare

Scrapes through 3.16.25 when VOA paused publication.

V1.10

28 Oct 18:13
9204cad

Choose a tag to compare

Scrape up to Oct 1, 2024

V1.9

09 Apr 03:11
bfc960d

Choose a tag to compare

  • Scrape up to April 1, 2024

  • Better filtering out of <!-- IMAGE --> and variants

V1.8

20 Nov 19:21
63ef942

Choose a tag to compare

  • Added scraping from April 2023 to November 15 2023

1.7

04 May 23:14

Choose a tag to compare

1.7
  • Additional data scraped from October 2022 to end of April 2023

v1.6

11 Oct 22:40
7511f2c

Choose a tag to compare

v1.5

02 Sep 20:08
8f697c5

Choose a tag to compare

  • Added segmentation for remaining languages
  • Improvements to some of the existing segmentation models
  • Both cases of under-segmentation and over-segmentation were found and addressed

v1.4

08 Jul 20:22
6263108

Choose a tag to compare

Updated scrape through July 1st, 2022
Fix missing yue documents
Change yue to cmn and voacambodia from khm to eng
Authors extraction from metadata improved
Paragraph splits extraction improved

v1.3

16 Jun 18:24
d8c1df5

Choose a tag to compare

Release 1.3 with updated scrapes through the end of May 2022.

v1.2

12 May 15:32
91162df

Choose a tag to compare

  • Added segmentation for all languages except: ben, bod, kat, kur
  • Better publication date coverage
  • Remove zero-width space in segmentation and tokenization output for Thai, Lao, Khmer (zero-width space is kept in the original text in paragraphs
  • Release as described in camera-ready LREC 2022 paper