Skip to content

mC4 sampling & pre-processing #61

@sbmaruf

Description

@sbmaruf

Hi @TevenLeScao,

I think there are some confusing and broken link in the mC4 data preprocessing section. Can you take a look?

Both of the links are broken here,

  1. mc4_preprocessing
  2. mc4_sampled_raw

The original link should be,

  1. mc4_preprocessing
  2. mc4_sampled_raw

In addition to that, the multinomial data processing code to create the different language splits are in this pull request, bigscience-workshop/Megatron-DeepSpeed#9

Here's few things,

  1. Did you use this data for any one of your experiments?
  2. If not then I think you can update the doc, https://github.com/bigscience-workshop/bigscience/tree/master/data/mc4

For reference purpose, if you want to keep the code, I'm happy to open a pull request here. If not I'll close the pull request from bigscience/Megatron-Deepspeed repo.

Let me know what do you think.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions