Skip to content

Conversation

@robinpats182
Copy link

@robinpats182 robinpats182 commented Jan 12, 2026

  • Created text_utils.py with normalize_text() for NFC Unicode normalization
  • Modified _calculate_rouge_1_scores() to normalize texts before comparison
  • Added automatic detection: word-level ROUGE for space-separated text, character-level scoring for non-space-separated text
  • Implemented _calculate_character_level_rouge() using Counter to calculate precision/recall/F-measure from character overlap
  • Created test_non_english_eval.py for Thai text evaluation

Previously returned 0.0 for non-English text. Now provides proportional scores based on character frequency overlap.

Please ensure you have read the contribution guide before creating a pull request.

Link to Issue or Description of Change

1. Link to an existing issue (if applicable):

2. Or, if no issue exists, describe the change:

If applicable, please follow the issue templates to provide as much detail as
possible.

Problem:
A clear and concise description of what the problem is.

Solution:
A clear and concise description of what you want to happen and why you choose
this solution.

Testing Plan

Please describe the tests that you ran to verify your changes. This is required
for all PRs that are not small documentation or typo fixes.

Unit Tests:

  • I have added test_non_english_eval.py for testing
  • All unit tests don't pass locally.

Please include a summary of passed pytest results.
tests/unittests/evaluation/test_non_english_eval.py::test_debug_normalization Reference: 'สวัสดี'
Candidate: 'สวัสดี'
Normalized reference: 'สวัสดี'
Normalized candidate: 'สวัสดี'
Are they equal after normalization? True
ROUGE score: Score(precision=1.0, recall=1.0, fmeasure=1.0)
PASSED

Manual End-to-End (E2E) Tests:

Please provide instructions on how to manually test your changes, including any
necessary setup or configuration. Please provide logs or screenshots to help
reviewers better understand the fix.

Checklist

  • I have read the CONTRIBUTING.md document.
  • I have performed a self-review of my own code.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have added tests that prove my fix is effective or that my feature works.
  • New and existing unit tests pass locally with my changes.
  • I have manually tested my changes end-to-end.
  • Any dependent changes have been merged and published in downstream modules.

Additional context

Add any other context or screenshots about the feature request here.

- Created text_utils.py with normalize_text() for NFC Unicode normalization
- Modified _calculate_rouge_1_scores() to normalize texts before comparison
- Added automatic detection: word-level ROUGE for space-separated text,
  character-level scoring for non-space-separated text
- Implemented _calculate_character_level_rouge() using Counter to calculate
  precision/recall/F-measure from character overlap
- Created test_non_english_eval.py for Thai text evaluation

Previously returned 0.0 for non-English text. Now provides proportional
scores based on character frequency overlap.
@google-cla
Copy link

google-cla bot commented Jan 12, 2026

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @robinpats182, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the robustness of the ROUGE-1 evaluation metric by introducing comprehensive support for non-English and non-space-separated languages. It achieves this by implementing Unicode text normalization and a dynamic mechanism to select between word-level and character-level ROUGE calculations, ensuring that text similarity scores are accurately computed across a wider range of linguistic inputs.

Highlights

  • Enhanced ROUGE-1 Evaluation: Implemented text normalization using NFC Unicode normalization to ensure consistent text comparison, especially for non-English languages, and introduced an automatic detection mechanism within _calculate_rouge_1_scores() to switch between word-level ROUGE (for space-separated texts) and character-level ROUGE (for non-space-separated texts.
  • Character-Level ROUGE: Developed a new function, _calculate_character_level_rouge(), which uses collections.Counter to compute ROUGE-1 precision, recall, and F-measure based on character overlap for languages without explicit word delimiters.
  • Non-English Text Support: Resolved an issue where ROUGE-1 scores for non-English texts previously returned 0.0, now providing accurate, proportional scores based on character frequency overlap.
  • New Utility and Test Files: Added a new utility file, text_utils.py, for text normalization and a new test file, test_non_english_eval.py, specifically for validating the ROUGE evaluation with non-English (e.g., Thai) text.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@adk-bot adk-bot added the eval [Component] This issue is related to evaluation label Jan 12, 2026
@adk-bot
Copy link
Collaborator

adk-bot commented Jan 12, 2026

Response from ADK Triaging Agent

Hello @robinpats182, thank you for your contribution!

To help us review this PR, could you please provide the following information from our contribution guidelines:

  • In the "Testing Plan" section, please describe the manual end-to-end (E2E) tests that you ran to verify your changes.
  • Could you also clarify the statement "All unit tests don't pass locally" and ensure that all new and existing unit tests are passing?
  • Please complete the checklist in the PR description.

This information will help reviewers to review your PR more efficiently. Thanks!

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable enhancement by adding support for non-English, non-space-separated languages to the ROUGE-1 score calculation. The approach of using character-level comparison for such languages is sound, and the code is well-structured with the normalization logic extracted into a text_utils.py file. My review includes suggestions to improve code style, efficiency, and testing practices. In final_response_match_v1.py, I've suggested moving imports, defining a namedtuple at the module level for efficiency, and correcting a docstring. For test_non_english_eval.py, I've proposed refactoring the debug script into a proper, parameterized unit test to cover more scenarios. Lastly, a couple of new files are missing a final newline character, which is a common Python style convention. Overall, these are great changes, and addressing the feedback will make the code more robust and maintainable.

Comment on lines 131 to 165
def _calculate_character_level_rouge(candidate: str, reference: str):
"""Calculates character-level ROUGE-1 score for non-space-separated text.
Args:
candidate: The candidate text (already normalized).
reference: The reference text (already normalized).
Returns:
A Score namedtuple with precision, recall, and fmeasure.
"""
from collections import Counter, namedtuple

if not reference or not candidate:
Score = namedtuple('Score', ['precision', 'recall', 'fmeasure'])
return Score(precision=0.0, recall=0.0, fmeasure=0.0)

# Count character occurrences
ref_chars = Counter(reference)
cand_chars = Counter(candidate)

# Calculate overlapping characters
overlap = sum((ref_chars & cand_chars).values())

# Calculate precision and recall
precision = overlap / len(candidate) if len(candidate) > 0 else 0.0
recall = overlap / len(reference) if len(reference) > 0 else 0.0

# Calculate F-measure
if precision + recall > 0:
fmeasure = 2 * (precision * recall) / (precision + recall)
else:
fmeasure = 0.0

Score = namedtuple('Score', ['precision', 'recall', 'fmeasure'])
return Score(precision=precision, recall=recall, fmeasure=fmeasure)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This function has a few areas for improvement regarding style and efficiency:

  • Imports: The import from collections import Counter, namedtuple should be at the top of the file, per PEP 8.
  • namedtuple definition: The Score namedtuple is defined twice inside this function. This is inefficient as it's redefined on every call. It should be defined once at the module level.
  • Redundant checks: The checks for non-zero length before division are redundant, as the if not reference or not candidate: guard at the beginning already handles this.

Here is a suggested refactoring that addresses these points. Please remember to move the import to the top of the file and define Score at the module level.

# At top of file:
from collections import Counter, namedtuple
# ...

# At module level, after imports:
Score = namedtuple('Score', ['precision', 'recall', 'fmeasure'])
def _calculate_character_level_rouge(candidate: str, reference: str):
  """Calculates character-level ROUGE-1 score for non-space-separated text.
  
  Args:
    candidate: The candidate text (already normalized).
    reference: The reference text (already normalized).
  
  Returns:
    A Score namedtuple with precision, recall, and fmeasure.
  """
  if not reference or not candidate:
    return Score(precision=0.0, recall=0.0, fmeasure=0.0)
  
  # Count character occurrences
  ref_chars = Counter(reference)
  cand_chars = Counter(candidate)
  
  # Calculate overlapping characters
  overlap = sum((ref_chars & cand_chars).values())
  
  # Calculate precision and recall
  precision = overlap / len(candidate)
  recall = overlap / len(reference)
  
  # Calculate F-measure
  if precision + recall > 0:
    fmeasure = 2 * (precision * recall) / (precision + recall)
  else:
    fmeasure = 0.0
  
  return Score(precision=precision, recall=recall, fmeasure=fmeasure)

Comment on lines 17 to 40
from __future__ import annotations


def test_debug_normalization():
"""Debug test to see if normalization is being applied."""
from google.adk.evaluation.final_response_match_v1 import _calculate_rouge_1_scores
from google.adk.evaluation.text_utils import normalize_text

reference = "สวัสดี"
candidate = "สวัสดี"

# Check normalization directly
norm_ref = normalize_text(reference)
norm_cand = normalize_text(candidate)

print(f"Reference: {repr(reference)}")
print(f"Candidate: {repr(candidate)}")
print(f"Normalized reference: {repr(norm_ref)}")
print(f"Normalized candidate: {repr(norm_cand)}")
print(f"Are they equal after normalization? {norm_ref == norm_cand}")

# Now test the actual function
score = _calculate_rouge_1_scores(candidate, reference)
print(f"ROUGE score: {score}") No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This test file contains a debug-style test with print statements instead of assertions. To make the tests more robust and automated, it's better to use pytest features like parametrize to cover multiple scenarios and assert to verify the results. This also makes the test suite cleaner by not printing to standard output during runs.

Here is a suggested replacement for the current test function that uses pytest.mark.parametrize to cover several cases, including perfect matches, partial matches, and no matches for non-space-separated text.

from __future__ import annotations

from collections import namedtuple

import pytest

from google.adk.evaluation.final_response_match_v1 import (
    _calculate_rouge_1_scores,
)

Score = namedtuple("Score", ["precision", "recall", "fmeasure"])


@pytest.mark.parametrize(
    "candidate, reference, expected_score",
    [
        # Perfect match
        ("สวัสดี", "สวัสดี", Score(1.0, 1.0, 1.0)),
        # Partial match
        ("ab", "ac", Score(0.5, 0.5, 0.5)),
        # No match
        ("abc", "def", Score(0.0, 0.0, 0.0)),
        # Candidate is subset of reference
        ("a", "ab", Score(1.0, 0.5, 2 / 3)),
        # Empty candidate
        ("", "abc", Score(0.0, 0.0, 0.0)),
        # Empty reference
        ("abc", "", Score(0.0, 0.0, 0.0)),
        # Both empty
        ("", "", Score(0.0, 0.0, 0.0)),
    ],
)
def test_character_level_rouge(candidate, reference, expected_score):
  """Tests character-level ROUGE for various non-space-separated strings."""
  actual_score = _calculate_rouge_1_scores(candidate, reference)
  assert actual_score.precision == pytest.approx(expected_score.precision)
  assert actual_score.recall == pytest.approx(expected_score.recall)
  assert actual_score.fmeasure == pytest.approx(expected_score.fmeasure)

from .evaluator import EvaluationResult
from .evaluator import Evaluator
from .evaluator import PerInvocationResult
from .text_utils import normalize_text #importing normalize_text function for non-English text comparison
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

As a matter of style, it's better to avoid inline comments on import statements, as per PEP 8. The purpose of the import is clear from the code that uses it. Please remove the comment.

Suggested change
from .text_utils import normalize_text #importing normalize_text function for non-English text comparison
from .text_utils import normalize_text

reference: The ground-truth text to compare against.
Returns:
A dictionary containing the ROUGE-1 precision, recall, and f-measure.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The docstring is incorrect. This function returns a Score namedtuple, not a dictionary. Please update the docstring to reflect the actual return type. This will improve clarity for future developers.

Suggested change
A dictionary containing the ROUGE-1 precision, recall, and f-measure.
A Score namedtuple containing the ROUGE-1 precision, recall, and f-measure.

…s to use pytest assertions instead of prints
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

eval [Component] This issue is related to evaluation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eval fails for non-English languages

2 participants