PS and KM basic support#17
Conversation
dgant
commented
Mar 13, 2020
- Modified download script to acquire PS/KM/HI/FA
- Added prepare.sh for PS/KM parallel and multilingual training
… PS/KM parallel and multilingual training.
| cat "${DEVTEST_HI}/dev.hi" > $DATA/valid.hi-en.hi | ||
| cat "${DEVTEST_HI}/dev.en" > $DATA/valid.hi-en.en | ||
| cat "${DEVTEST_HI}/test.hi" > $DATA/test.hi-en.hi | ||
| cat "${DEVTEST_HI}/test.en" > $DATA/test.hi-en.en |
There was a problem hiding this comment.
DEVTEST_PSKM and DEVTEST_HI are placeholders for where these datasets will eventually live.
|
Hi @dgant! Thank you for your pull request. We require contributors to sign our Contributor License Agreement, and yours needs attention. You currently have a record in our system, but we do not have a signature on file. In order for us to review and merge your code, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks! |
guzmanhe
left a comment
There was a problem hiding this comment.
Thanks for this PR.
I've added a few comments. Most importantly, please check the contributor license process.
| download_opus_data $KM_ROOT $KM_TGT | ||
| download_opus_data $PS_ROOT $PS_TGT | ||
| download_opus_data $FA_ROOT $FA_TGT | ||
| #download_opus_data $HI_ROOT $HI_TGT |
| elif [ "$TGT" = "ps" ]; then | ||
| URLS=("${PS_OPUS_URLS[@]}") | ||
| DATASETS=("${PS_OPUS_DATASETS[@]}") | ||
| elif [ "$TGT" = "fa" ]; then |
There was a problem hiding this comment.
We don't have Farsi in the original flores. Is there a reason why are you including it in the target?
There was a problem hiding this comment.
I believe the intent was to improve the Pashto performance of a multilingual model by training for Farsi as well, being a related language. The Pashto-English parallel corpus was very limited so being able to make use of the Farsi-English corpus would hopefully be a boon. I don't recall whether we tested this or if we did whether it had an impact.
| @@ -0,0 +1,90 @@ | |||
| import argparse | |||
There was a problem hiding this comment.
Can you add more information on what is the intended use of this script?
Deduplication is certainly useful, but newcomers might not be so familiar
|
Hi @guzmanhe. I opened this PR while I was working at FAIR with Peng-Jen and Marc'Aurelio. I left FB two years ago so my contributor license agreement status has lapsed :). I won't be making any further changes to this pull request, so if there's not interest on your end to see this through you can close it. |