After gh-420 (ff-468) was implemented, we got tag form normalization, which is great for deduplication, but not so great for readability.
Currently, we use SpaCy to choose the best tag form by calculating cosine similarity between the neighboring parts of the tag. It works well, but fails in some cases.
For example: public-works-department is normalized to public-work-department, which is not very readable (public-works-department is an idiom). It happens because the sum of cos similarity between public & work and between work & department wins, which is correct for the tags like public-work and work-department, but leads to a wrong result for public-works-department.
One can find many more examples in tests.
We should either find a better way to choose the best part forms or implement a separate functionality to calculate tag names (outside of normalization).
Options (for both approaches)
- Implement own frequency-based analysis on top of raw LLM output.
- Find an open dataset of tags, for example, like Wikipedia page names.
After gh-420 (ff-468) was implemented, we got tag form normalization, which is great for deduplication, but not so great for readability.
Currently, we use SpaCy to choose the best tag form by calculating cosine similarity between the neighboring parts of the tag. It works well, but fails in some cases.
For example:
public-works-departmentis normalized topublic-work-department, which is not very readable (public-works-departmentis an idiom). It happens because the sum of cos similarity betweenpublic&workand betweenwork&departmentwins, which is correct for the tags likepublic-workandwork-department, but leads to a wrong result forpublic-works-department.One can find many more examples in tests.
We should either find a better way to choose the best part forms or implement a separate functionality to calculate tag names (outside of normalization).
Options (for both approaches)