Key terms updates necessary for use in SILNLP #257

Enkidu93 · 2026-01-02T21:51:11Z

Added support for capturing renderings patterns, references, and term domains. Moved to using a KeyTerm data structure rather than tuples.

(This also includes porting of recent changes in Machine sillsdev/machine#362 and sillsdev/machine#368)

This change is

Enkidu93 · 2026-01-02T21:52:34Z

Also, update machine.py library version

codecov-commenter · 2026-01-07T17:11:24Z

Codecov Report

❌ Patch coverage is 82.19697% with 47 lines in your changes missing coverage. Please review.
✅ Project coverage is 90.68%. Comparing base (acff116) to head (e33823f).

Files with missing lines	Patch %	Lines
.../corpora/test_usfm_versification_error_detector.py	4.76%	20 Missing ⚠️
...chine/corpora/usfm_versification_error_detector.py	30.00%	14 Missing ⚠️
machine/jobs/translation_file_service.py	33.33%	4 Missing ⚠️
...tion/huggingface/hugging_face_nmt_model_trainer.py	93.47%	3 Missing ⚠️
...hine/corpora/paratext_project_terms_parser_base.py	96.07%	2 Missing ⚠️
...hine/corpora/file_paratext_project_terms_parser.py	85.71%	1 Missing ⚠️
machine/corpora/n_parallel_text_row.py	85.71%	1 Missing ⚠️
...a/paratext_project_versification_error_detector.py	0.00%	1 Missing ⚠️
machine/corpora/text_file_alignment_corpus.py	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #257      +/-   ##
==========================================
- Coverage   90.74%   90.68%   -0.07%     
==========================================
  Files         352      355       +3     
  Lines       22337    22508     +171     
==========================================
+ Hits        20270    20411     +141     
- Misses       2067     2097      +30

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ddaspit

@ddaspit partially reviewed 11 files and all commit messages, and made 1 comment.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93).

machine/corpora/key_term_row.py line 0 at r1 (raw file):
This file should be named key_term.py.

Enkidu93

@Enkidu93 made 3 comments.
Reviewable status: 8 of 21 files reviewed, 1 unresolved discussion (waiting on @ddaspit).

machine/corpora/key_term_row.py line at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This file should be named key_term.py.

Done.

machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 362 at r3 (raw file):

            ).tokens()
            src_term_partial_word_tokens.remove("▁")
            src_term_partial_word_tokens.remove("\ufffc")

This is mirroring code in silnlp more-or-less exactly. I made an issue for creating a shared utility function that could do some of this. I also experimented with finding a safe way to be able to do this with non-fast tokenizers. It's something we should look into as needed but I decided that it was taking too much time.

tests/translation/huggingface/test_hugging_face_nmt_model_trainer.py line 130 at r3 (raw file):

        corpus = source_corpus.align_rows(target_corpus)

        terms_corpus = DictionaryTextCorpus(MemoryText("terms", [TextRow("terms", 1, ["telephone"])])).align_rows(

I don't love that this test doesn't really cover whether the terms are affecting the result. I just stuck this in here for code coverage (no exceptions thrown, etc.), but I couldn't adapt our one true fine-tuning test because it uses a non-fast tokenizer. I looked for alternatives but couldn't find anything that works. I did confirm in the debugger that everything was being tokenized properly. Maybe we should consider outputting some kind of artifact in ClearML (?) with the tokenized data so we have something to compare apples-to-apples to the tokenized experiment txt files in silnlp.

Enkidu93 · 2026-01-13T16:44:31Z

Fixes #256 #240

ddaspit

@ddaspit partially reviewed 9 files and all commit messages, made 1 comment, and resolved 1 discussion.
Reviewable status: 17 of 26 files reviewed, 1 unresolved discussion (waiting on @Enkidu93).

machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 94 at r4 (raw file):

        training_args: Seq2SeqTrainingArguments,
        corpus: Union[ParallelTextCorpus, Dataset],
        terms_corpus: Optional[Union[ParallelTextCorpus, Dataset]] = None,

I think it would be better to add some type of metadata to a row to indicate what kind of data it is, rather than pass around two separate corpora. It feels more in spirit with the design goals of the corpus API in Machine. We could have a generic metadata dictionary that allows you to set whatever you want. Or, we could add a new "data type" property with a predefined set of values, such as gloss, sentence, section, document, etc. What do you think?

Enkidu93

@Enkidu93 made 1 comment.
Reviewable status: 17 of 26 files reviewed, 1 unresolved discussion (waiting on @ddaspit).

machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 94 at r4 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I think it would be better to add some type of metadata to a row to indicate what kind of data it is, rather than pass around two separate corpora. It feels more in spirit with the design goals of the corpus API in Machine. We could have a generic metadata dictionary that allows you to set whatever you want. Or, we could add a new "data type" property with a predefined set of values, such as gloss, sentence, section, document, etc. What do you think?

Yes, I did think of this. If we wanted to add a field to each row, I think an enum like you mentioned might be the better option. We could add it to each text and then expose it through the row as well (?). There were two reasons I didn't already tackle this:

It will add a little complexity to the to_hf_dataset() function as well as to the preprocess_function when we call map() on the dataset.
I wondered how much it was worth doing this when we only have two categories. Do we foresee additional text types? Or maybe we don't foresee other text types but do we see ourselves needing some kind of tagging for other attributes? This would affect how it should be implemented.

A competing (or perhaps supplemental) option would be for this function to take something like corpora: List[Tuple[CorpusOptions, Union[ParallelCorpus, Dataset]]] or corpora: List[CorpusConfig]. Then the processing options could be more dynamic - e.g., 'include partial words for corpus 1, tag corpus 2 with such-and-such tag, etc.' - regardless of a text type field and it could be specified by the calling code.

ddaspit

@ddaspit made 1 comment.
Reviewable status: 17 of 26 files reviewed, 1 unresolved discussion (waiting on @Enkidu93).

machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 94 at r4 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Yes, I did think of this. If we wanted to add a field to each row, I think an enum like you mentioned might be the better option. We could add it to each text and then expose it through the row as well (?). There were two reasons I didn't already tackle this:

It will add a little complexity to the to_hf_dataset() function as well as to the preprocess_function when we call map() on the dataset.

I wondered how much it was worth doing this when we only have two categories. Do we foresee additional text types? Or maybe we don't foresee other text types but do we see ourselves needing some kind of tagging for other attributes? This would affect how it should be implemented.

A competing (or perhaps supplemental) option would be for this function to take something like corpus: List[Tuple[CorpusOptions, Union[ParallelCorpus, Dataset]]] or corpus: List[CorpusConfig]. Then the processing options could be more dynamic - e.g., 'include partial words for corpus 1, tag corpus 2 with such-and-such tag, etc.` - regardless of a text type field and it could be specified by the calling code.

I do think it would be generally useful to have a way of tagging each row with a "data type" field. I'm okay with added complexity in a couple of places. I also like the second option that you proposed as a shorter-term solution. Would the second option be a lot easier and quicker to implement than the tagging?

Enkidu93

@Enkidu93 made 1 comment.
Reviewable status: 17 of 26 files reviewed, 1 unresolved discussion (waiting on @ddaspit).

machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 94 at r4 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I do think it would be generally useful to have a way of tagging each row with a "data type" field. I'm okay with added complexity in a couple of places. I also like the second option that you proposed as a shorter-term solution. Would the second option be a lot easier and quicker to implement than the tagging?

OK! I think the second solution would require fewer edits since adding the data type attribute is going to affect quite a few classes I think all of which will need to be ported - but if it's what you think we should do ultimately, might as well just go ahead with it 👍.

ddaspit

@ddaspit made 1 comment.
Reviewable status: 17 of 26 files reviewed, 1 unresolved discussion (waiting on @Enkidu93).

machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 94 at r4 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

OK! I think the second solution would require fewer edits since adding the data type attribute is going to affect quite a few classes I think all of which will need to be ported - but if it's what you think we should do ultimately, might as well just go ahead with it 👍.

Ok, let's go ahead and implement the "data type" field, then we don't have to pass around multiple corpora.

Enkidu93

@Enkidu93 made 1 comment.
Reviewable status: 9 of 41 files reviewed, 1 unresolved discussion (waiting on @ddaspit).

machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 94 at r4 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Ok, let's go ahead and implement the "data type" field, then we don't have to pass around multiple corpora.

Alright - I've gone ahead and added the data type field - hopefully satisfactorily - and updated the preprocessing code accordingly. Sorry this took a little while!

(Super minor but I don't like how we have competing names for source and target throughout the codebases source, src/target,trg,tgt - do you have a preference in regards to variable names that include source/target? If so, I can slowly normalize the naming as I edit code).

…achine#362

Enkidu93 · 2026-01-21T22:09:07Z

machine/corpora/data_type.py line 9 at r5 (raw file):

    SENTENCE = auto()
    PASSAGE = auto()
    DOCUMENT = auto()

Word and segment

ddaspit

@ddaspit reviewed 35 files and all commit messages, made 8 comments, and resolved 2 discussions.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @Enkidu93).

machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 94 at r4 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Alright - I've gone ahead and added the data type field - hopefully satisfactorily - and updated the preprocessing code accordingly. Sorry this took a little while!

(Super minor but I don't like how we have competing names for source and target throughout the codebases source, src/target,trg,tgt - do you have a preference in regards to variable names that include source/target? If so, I can slowly normalize the naming as I edit code).

I think I've typically used src and trg in the past.

machine/corpora/parallel_text_corpus.py line 405 at r6 (raw file):

        translation_column: str = "translation",
        alignment_column: Optional[str] = "alignment",
        data_type_column: Optional[str] = "data_type",

This should be content_type.

machine/corpora/n_parallel_text_row.py line 29 at r6 (raw file):

    @property
    def data_type(self) -> TextRowContentType:

This should be content_type.

machine/corpora/text_row.py line 39 at r6 (raw file):

    @property
    def data_type(self) -> TextRowContentType:

This should be renamed to content_type.

machine/corpora/memory_text.py line 11 at r6 (raw file):

class MemoryText(Text):
    def __init__(
        self, id: str, rows: Iterable[TextRow] = [], data_type: TextRowContentType = TextRowContentType.SEGMENT

I don't think we need to configure the type for this class.

machine/corpora/text.py line 19 at r6 (raw file):

    @property
    @abstractmethod
    def data_type(self) -> TextRowContentType: ...

I don't think we need this property. We only need it on the text row classes.

machine/corpora/text_file_text_corpus.py line 33 at r6 (raw file):

                data_types[pattern_index] if pattern_index < len(data_types) else TextRowContentType.SEGMENT,
            )
            for id, filename, pattern_index in get_files(file_patterns)

Can we use enumerate instead of passing the index from get_files?

machine/corpora/parallel_text_row.py line 58 at r6 (raw file):

    @property
    def data_type(self) -> TextRowContentType:

This should be named content_type.

…ments

Enkidu93

@Enkidu93 made 8 comments.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @ddaspit).

machine/corpora/memory_text.py line 11 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I don't think we need to configure the type for this class.

OK, I wasn't sure since the other texts have an overall type associated with them, but yeah, it wouldn't really be used for anything.

machine/corpora/n_parallel_text_row.py line 29 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This should be content_type.

Done.

machine/corpora/parallel_text_corpus.py line 405 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This should be content_type.

Done.

machine/corpora/parallel_text_row.py line 58 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This should be named content_type.

Done.

machine/corpora/text.py line 19 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I don't think we need this property. We only need it on the text row classes.

OK, done.

machine/corpora/text_file_text_corpus.py line 33 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Can we use enumerate instead of passing the index from get_files?

Each pattern can yield multiple files, so I don't believe that we can just use enumerate.

machine/corpora/text_row.py line 39 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This should be renamed to content_type.

Done.

machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 94 at r4 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I think I've typically used src and trg in the past.

OK, it looks like tgt is used just in the huggingface classes 🤔, but there is a mix of source_... and src_.. (same for target) elsewhere in machine.py. I guess I'll leave it for now. Maybe huggingface uses tgt somewhere?

ddaspit

@ddaspit reviewed 17 files and all commit messages, made 3 comments, and resolved 7 discussions.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93).

machine/corpora/text_file_text_corpus.py line 33 at r6 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Each pattern can yield multiple files, so I don't believe that we can just use enumerate.

Oh, I see.

machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 94 at r4 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

OK, it looks like tgt is used just in the huggingface classes 🤔, but there is a mix of source_... and src_.. (same for target) elsewhere in machine.py. I guess I'll leave it for now. Maybe huggingface uses tgt somewhere?

I'm guessing the Huggingface classes use tgt, because that is the convention that is used in Huggingface libraries.

machine/corpora/text_base.py line 12 at r7 (raw file):

        self._id = id
        self._sort_key = sort_key
        self._content_type = content_type

I think _default_content_type would be better. This is only used to set the content type when _create_row is called. It is possible for a text to create rows with different content types.

Enkidu93

@Enkidu93 made 2 comments.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @ddaspit).

machine/corpora/text_base.py line 12 at r7 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I think _default_content_type would be better. This is only used to set the content type when _create_row is called. It is possible for a text to create rows with different content types.

Done.

machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 94 at r4 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I'm guessing the Huggingface classes use tgt, because that is the convention that is used in Huggingface libraries.

That's what I was thinking, so I'll just leave it as-is

ddaspit

@ddaspit reviewed 1 file and all commit messages, made 1 comment, and resolved 1 discussion.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93).

machine/corpora/text_base.py line 9 at r11 (raw file):

class TextBase(Text):
    def __init__(self, id: str, sort_key: str, content_type: TextRowContentType = TextRowContentType.SEGMENT) -> None:

You forgot to rename the parameter.

Enkidu93

@Enkidu93 made 1 comment.
Reviewable status: 41 of 43 files reviewed, 1 unresolved discussion (waiting on @ddaspit).

machine/corpora/text_base.py line 9 at r11 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

You forgot to rename the parameter.

Oh, that's true - done.

ddaspit

@ddaspit reviewed 2 files and all commit messages, made 1 comment, and resolved 1 discussion.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @Enkidu93).

Enkidu93 requested a review from ddaspit January 2, 2026 21:51

ddaspit requested changes Jan 9, 2026

View reviewed changes

Enkidu93 requested a review from ddaspit January 13, 2026 16:27

Enkidu93 commented Jan 13, 2026

View reviewed changes

ddaspit requested changes Jan 15, 2026

View reviewed changes

Enkidu93 commented Jan 19, 2026

View reviewed changes

ddaspit reviewed Jan 20, 2026

View reviewed changes

Enkidu93 commented Jan 20, 2026

View reviewed changes

ddaspit reviewed Jan 21, 2026

View reviewed changes

Enkidu93 commented Jan 21, 2026

View reviewed changes

Enkidu93 added 9 commits January 21, 2026 14:41

Update key terms handling to support silnlp features; port sillsdev/m…

1bde009

…achine#362

Port sillsdev/machine#362; fix gloss inclusion

891cc1a

Remove unused imports

9b1a32f

Pass settings not just versification to detector

5b5051c

Remove non-pt localizations from pt xml (to match Machine C#)

3c3b3f5

Address reviewer comment; add key terms as partial words

ba0b914

Split long line

9e6b1a8

Port 'Expose chapter numbers' sillsdev/machine#369

7d44497

Add data type property to rows

74f47b1

Enkidu93 force-pushed the key_terms_updates branch from 98d4c0a to 74f47b1 Compare January 21, 2026 19:41

Fix pyright errors

0898f1d

Update row data type naming

f25702e

ddaspit requested changes Jan 22, 2026

View reviewed changes

Change data_type property to content_type; address other reviewer com…

8087e0d

…ments

Enkidu93 commented Jan 22, 2026

View reviewed changes

Remove unused import

8e66991

Enkidu93 added 2 commits January 22, 2026 14:22

Remove another unused import

0a86fce

Remove content_type parameter from MemoryText constructor

bd5c719

ddaspit reviewed Jan 22, 2026

View reviewed changes

Rename text_base content_type to default_content_type

7de8cba

Enkidu93 commented Jan 22, 2026

View reviewed changes

ddaspit reviewed Jan 22, 2026

View reviewed changes

Rename parameter

e33823f

Enkidu93 commented Jan 22, 2026

View reviewed changes

ddaspit approved these changes Jan 22, 2026

View reviewed changes

Enkidu93 merged commit f1dc4f1 into main Jan 22, 2026
14 checks passed

Enkidu93 deleted the key_terms_updates branch January 22, 2026 20:26

Uh oh!

Key terms updates necessary for use in SILNLP #257

Key terms updates necessary for use in SILNLP #257

Uh oh!

Conversation

Enkidu93 commented Jan 2, 2026 • edited by ddaspit Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Enkidu93 commented Jan 2, 2026

Uh oh!

codecov-commenter commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

Enkidu93 left a comment

Choose a reason for hiding this comment

Uh oh!

Enkidu93 commented Jan 13, 2026

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

Enkidu93 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

Enkidu93 left a comment

Choose a reason for hiding this comment

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

Enkidu93 left a comment

Choose a reason for hiding this comment

Uh oh!

Enkidu93 commented Jan 21, 2026

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

Enkidu93 left a comment

Choose a reason for hiding this comment

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

Enkidu93 left a comment

Choose a reason for hiding this comment

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

Enkidu93 left a comment

Choose a reason for hiding this comment

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Enkidu93 commented Jan 2, 2026 •

edited by ddaspit

Loading

codecov-commenter commented Jan 7, 2026 •

edited

Loading

Enkidu93 left a comment •

edited

Loading