Annotation transfer#349
Conversation
63104d1 to
fe16f49
Compare
0d0b946 to
934aa06
Compare
misialq
left a comment
There was a problem hiding this comment.
Hey @ZuzanaSebb, here's my first round of comments - once you update these I'll take your code for a spin :D
Also, just a reminder, in all the plugin repos in our organization we are trying to follow the PR naming scheme same as commit naming (use a prefix and a semi-verbose description, typically using a verb in imperative form).
| return result | ||
|
|
||
|
|
||
| def _get_mag_ids_from_feature_data(mags: MAGSequencesDirFmt) -> set: |
There was a problem hiding this comment.
Do we really need this method? It literally does a single thing.
| def _copy_annotation_files( | ||
| source_annotations: OrthologAnnotationDirFmt, | ||
| mag_ids: set, | ||
| result: OrthologAnnotationDirFmt, |
There was a problem hiding this comment.
You should just make this method return results as nothing really happens to them before they are being passed to this method. This also makes it a bit more explicit what this method actually does/returns.
| matched_ids = mag_ids & set(annotation_dict.keys()) | ||
| if not matched_ids: | ||
| raise ValueError("No annotation files matched the destination MAG IDs.") |
There was a problem hiding this comment.
I think this check should happen already before we call this method - it will remove the need to pass the mag_ids and provide a simpler interface. Actually, you could even move it to a separate method as you have some additional check below - that way the validation can be taken care of one testable method and the copying by another one, making both of them responsible for different parts of the pipeline.
|
|
||
| def transfer_eggnog_annotations( | ||
| ortholog_annotations: OrthologAnnotationDirFmt, | ||
| destination: Union[MAGSequencesDirFmt, MAGtoContigsDirFmt], |
There was a problem hiding this comment.
This is confusing. The destination should be represented by the same kind of semantic type, either SampleData[MAGs] or FeatureData[MAG]. Now, you are mixing in the contig map. I think this should become an additional input required when SampleData[MAGs] were provided as source of the annotations or if FeatureData[MAG] was provided as the destination (whichever of those makes more sense for your pipeline).
| df = pd.read_csv(fp, sep="\t", skiprows=4) | ||
| # drop trailing comment only if present | ||
| first_col = df.columns[0] | ||
| df = df[~df[first_col].astype(str).str.startswith("##")] |
There was a problem hiding this comment.
I'm wondering whether you could achieve the same by simply reading in every file as OrthologFileFmt and viewing as a df?
| df = df[~df[first_col].astype(str).str.startswith("##")] | ||
| frames.append(df) | ||
|
|
||
| all_annotations = pd.concat(frames, ignore_index=True) |
There was a problem hiding this comment.
We need to evaluate carefully whether this will work well when one has hundreds of samples with thousands of annotations. I'm a bit worried that the memory will blow up 😅
Description
Add transfer_eggnog_annotations action.
What it does
FeatureData[MAG]→ copy files for matching MAG IDs (after dereplication).FeatureMap[MAGtoContigs]→ aggregate into per-MAG files by mapping each gene to its contig.AI Disclosure
AI Usage Details
Claude (Sonnet 4.6) was used to draft the initial implementation; I refined the methods and the final action.
Claude (Opus 4.8) was used to draft the initial implementation of the unit tests.