Skip to content

Feature/malayalam normalizer#161

Open
spigelspike wants to merge 2 commits into
k2-fsa:masterfrom
spigelspike:feature/malayalam-normalizer
Open

Feature/malayalam normalizer#161
spigelspike wants to merge 2 commits into
k2-fsa:masterfrom
spigelspike:feature/malayalam-normalizer

Conversation

@spigelspike

Copy link
Copy Markdown

What this does

Adds text normalization for Malayalam (language="ml"). Numbers, currency symbols, time, percentages, etc. get converted to spoken Malayalam words before the text hits the tokenizer.

Without this, the model tries to pronounce raw symbols like , %, 5:30 which sounds broken.

Examples

Input Normalized
₹250 ഇരുനൂറ്റി അമ്പത് രൂപ
5:30 അഞ്ച് മുപ്പത്
AICTE എ ഐ സി ടി ഇ
10.5 പത്ത് പോയിന്റ് അഞ്ച്
1/2 അര
1st ഒന്നാമത്തെ
$100 നൂറ് ഡോളർ
10km പത്ത് കിലോമീറ്റർ

Files changed

  • omnivoice/utils/text_normalization/__init__.py — language dispatcher
  • omnivoice/utils/text_normalization/malayalam.py — the normalizer
  • omnivoice/models/omnivoice.py — calls normalize_text() inside _preprocess_all(), guarded by language check so other languages are unaffected
  • tests/test_malayalam_normalizer.py — 58 tests
  • docs/tips.md — added a note about this feature

Audio samples (before/after)

Test 1: "എന്റെ കയ്യിൽ $100 ഉം Rs. 50 ഉം ഉണ്ട്. ഇതിന് £200 വിലവരും."

Test 2: "എനിക്ക് ₹100 ഉണ്ട്, സമയം 5:30 ആയി. AICTE മോഡൽ 10.5 സ്കോർ നേടി. 1/2 ലിറ്റർ പാലും 1st പ്രൈസും കിട്ടി"

Add built-in Malayalam (ml) text normalization that automatically converts
numbers, currency (?), percentages (%), time, decimals, fractions, ordinals,
acronyms, and measurement units into their Malayalam spoken-word equivalents
during TTS preprocessing.
Adds support for multiple global currencies including Dollars ($),
Pounds (£), Euros (€), Yen (¥), and alternative Rupee formats
(Rs, Rs.) in both prefix and suffix positions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant