Skip to content

Implement half-width katakana support#21

Merged
PSeitz merged 6 commits into
PSeitz:masterfrom
exoego:half-width-katakana
Apr 9, 2026
Merged

Implement half-width katakana support#21
PSeitz merged 6 commits into
PSeitz:masterfrom
exoego:half-width-katakana

Conversation

@exoego

@exoego exoego commented Aug 13, 2025

Copy link
Copy Markdown
Contributor

Closes #19
(recreation of #20 due to accidental repo cleanup)

  • Adds utils/halfwidth_katakana_to_hiragana.
    • Counterintuitively, this is not "half to full" conversion. Because, ..._to_hiragana util can be leveraged in to_hiragana's character loops so the input is not enumerated multiple times.
  • In to_katakana and to_romaji, the util is invoked only when input contains half-width kana,
    • The input may be fully enumerated at worst case when no half-width katakana is used.
  • to_haflwidth_katakana or new option is NOT added, since I don't need such.

Comment thread src/utils/katakana_to_hiragana.rs Outdated
hira.push(hira_char);
previous_kana = Some(hira_char);
} else if is_char_halfwidth_katakana(input_char) {
let result = HALFWIDTH_KATAKANA_TO_HIRAGANA_NODE_TREE.get(&chars[index..]);

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer a pos variable that gets incremented instead of the previous_read_forward_count.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean this?
fbbf388

@exoego

exoego commented Jan 8, 2026

Copy link
Copy Markdown
Contributor Author

@PSeitz Can you take a look again on this, so that I can finish this 🙇

@exoego exoego requested a review from PSeitz April 8, 2026 05:09
@PSeitz

PSeitz commented Apr 8, 2026

Copy link
Copy Markdown
Owner

Sorry for the delay, will come back shortly to continue the review

Comment thread src/utils/katakana_to_hiragana.rs Outdated
previous_kana = Some(hira_char);
} else if is_char_halfwidth_katakana(input_char) {
let result = HALFWIDTH_KATAKANA_TO_HIRAGANA_NODE_TREE.get(&chars[index..]);
result.0.chars().for_each(|char| hira.push(char));

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's unidiomatic rust

hira.extend(result.0.chars());

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that's simple. Refactored so in f9597f4

Comment thread src/to_hiragana.rs
assert_eq!(to_hiragana("ダヂヅデド"), "だぢづでど");
assert_eq!(to_hiragana("バビブベボ"), "ばびぶべぼ");
assert_eq!(to_hiragana("パピプペポ"), "ぱぴぷぺぽ");
assert_eq!(to_hiragana("ヴ"), "ゔ");

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the half-width is handled differently. Can you check

assert_eq!("スーパー".to_hiragana(), "スーパー".to_hiragana());

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented long-voweled transformation in fd05bc4

Comment thread src/utils/katakana_to_hiragana.rs Outdated
let mut count: usize = 0;
let chars = input.chars().collect::<Vec<_>>();

for (index, input_char) in input.chars().enumerate() {

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can iterate on the chars vec directly and index via pos

count += 1 at the end of the loop and in the halfwidth case

count += result.1 - 1 ;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored in 7021f8a

exoego added 3 commits April 9, 2026 11:48
This removes the read-ahead skip check, since the cursor jumps past consumed halfwidth katakana directly
Comment thread src/utils/katakana_to_hiragana.rs Outdated
// the long-vowel transformation below applies uniformly.
let chars: Vec<char> = input
.chars()
.map(|c| if c == 'ー' { 'ー' } else { c })

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of replacing we should update the method to also contain the short-vowel version (and rename the method to is_prolonged_sound)

/// Returns true if char is 'ー'
pub fn is_char_long_dash(char: char) -> bool {
    char as u32 == PROLONGED_SOUND_MARK
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored in ad6b77d

@exoego exoego force-pushed the half-width-katakana branch from 5fe53f9 to ad6b77d Compare April 9, 2026 05:18
@exoego exoego requested a review from PSeitz April 9, 2026 05:35
@PSeitz PSeitz merged commit 1738330 into PSeitz:master Apr 9, 2026
3 checks passed
@PSeitz

PSeitz commented Apr 9, 2026

Copy link
Copy Markdown
Owner

Thanks for the PR!

@exoego exoego deleted the half-width-katakana branch April 9, 2026 06:03
@exoego exoego restored the half-width-katakana branch April 9, 2026 12:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feature: hankaku-kana (half-width kana) support

2 participants