Better (readable) tag names

After gh-420 (ff-468) was implemented, we got tag form normalization, which is great for deduplication, but not so great for readability.

Currently, we use SpaCy to choose the best tag form by calculating cosine similarity between the neighboring parts of the tag. It works well, but fails in some cases.

For example: `public-works-department` is normalized to `public-work-department`, which is not very readable (`public-works-department` is an idiom). It happens because the sum of cos similarity between `public` & `work` and between `work` & `department` wins, which is correct for the tags like `public-work` and `work-department`, but leads to a wrong result for `public-works-department`.

One can find many more examples in tests.

We should either find a better way to choose the best part forms or implement a separate functionality to calculate tag names (outside of normalization).

Options (for both approaches)

- Implement own frequency-based analysis on top of raw LLM output.
- Find an open dataset of tags, for example, like Wikipedia page names.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better (readable) tag names #435

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Better (readable) tag names #435

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions