Skip to content

Better (readable) tag names #435

@Tiendil

Description

@Tiendil

After gh-420 (ff-468) was implemented, we got tag form normalization, which is great for deduplication, but not so great for readability.

Currently, we use SpaCy to choose the best tag form by calculating cosine similarity between the neighboring parts of the tag. It works well, but fails in some cases.

For example: public-works-department is normalized to public-work-department, which is not very readable (public-works-department is an idiom). It happens because the sum of cos similarity between public & work and between work & department wins, which is correct for the tags like public-work and work-department, but leads to a wrong result for public-works-department.

One can find many more examples in tests.

We should either find a better way to choose the best part forms or implement a separate functionality to calculate tag names (outside of normalization).

Options (for both approaches)

  • Implement own frequency-based analysis on top of raw LLM output.
  • Find an open dataset of tags, for example, like Wikipedia page names.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions