Skip to content

Annotation transfer#349

Open
ZuzanaSebb wants to merge 7 commits into
bokulich-lab:mainfrom
ZuzanaSebb:annotation-transfer
Open

Annotation transfer#349
ZuzanaSebb wants to merge 7 commits into
bokulich-lab:mainfrom
ZuzanaSebb:annotation-transfer

Conversation

@ZuzanaSebb

Copy link
Copy Markdown

Description

Add transfer_eggnog_annotations action.

What it does

  • MAG-level annotations + FeatureData[MAG] → copy files for matching MAG IDs (after dereplication).
  • Contig-level annotations + FeatureMap[MAGtoContigs] → aggregate into per-MAG files by mapping each gene to its contig.

AI Disclosure

  • NO AI USED.
  • AI USED.

AI Usage Details

Claude (Sonnet 4.6) was used to draft the initial implementation; I refined the methods and the final action.
Claude (Opus 4.8) was used to draft the initial implementation of the unit tests.

@ZuzanaSebb ZuzanaSebb requested a review from misialq June 8, 2026 06:50
@ZuzanaSebb ZuzanaSebb linked an issue Jun 8, 2026 that may be closed by this pull request
@ZuzanaSebb ZuzanaSebb force-pushed the annotation-transfer branch from 0d0b946 to 934aa06 Compare June 8, 2026 12:29

@misialq misialq left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ZuzanaSebb, here's my first round of comments - once you update these I'll take your code for a spin :D

Also, just a reminder, in all the plugin repos in our organization we are trying to follow the PR naming scheme same as commit naming (use a prefix and a semi-verbose description, typically using a verb in imperative form).

return result


def _get_mag_ids_from_feature_data(mags: MAGSequencesDirFmt) -> set:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this method? It literally does a single thing.

def _copy_annotation_files(
source_annotations: OrthologAnnotationDirFmt,
mag_ids: set,
result: OrthologAnnotationDirFmt,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should just make this method return results as nothing really happens to them before they are being passed to this method. This also makes it a bit more explicit what this method actually does/returns.

Comment on lines +203 to +205
matched_ids = mag_ids & set(annotation_dict.keys())
if not matched_ids:
raise ValueError("No annotation files matched the destination MAG IDs.")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this check should happen already before we call this method - it will remove the need to pass the mag_ids and provide a simpler interface. Actually, you could even move it to a separate method as you have some additional check below - that way the validation can be taken care of one testable method and the copying by another one, making both of them responsible for different parts of the pipeline.


def transfer_eggnog_annotations(
ortholog_annotations: OrthologAnnotationDirFmt,
destination: Union[MAGSequencesDirFmt, MAGtoContigsDirFmt],

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is confusing. The destination should be represented by the same kind of semantic type, either SampleData[MAGs] or FeatureData[MAG]. Now, you are mixing in the contig map. I think this should become an additional input required when SampleData[MAGs] were provided as source of the annotations or if FeatureData[MAG] was provided as the destination (whichever of those makes more sense for your pipeline).

Comment on lines +240 to +243
df = pd.read_csv(fp, sep="\t", skiprows=4)
# drop trailing comment only if present
first_col = df.columns[0]
df = df[~df[first_col].astype(str).str.startswith("##")]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering whether you could achieve the same by simply reading in every file as OrthologFileFmt and viewing as a df?

df = df[~df[first_col].astype(str).str.startswith("##")]
frames.append(df)

all_annotations = pd.concat(frames, ignore_index=True)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to evaluate carefully whether this will work well when one has hundreds of samples with thousands of annotations. I'm a bit worried that the memory will blow up 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ENH: add an action to transfer functional annotations

4 participants