Modified _calculate_rouge_1_scores() to normalize texts before comparison and implemented _calculate_character_level_rouge() using Counter to calculate precision/recall/F-measure from character overlap #4131

robinpats182 · 2026-01-12T23:42:22Z

Created text_utils.py with normalize_text() for NFC Unicode normalization
Modified _calculate_rouge_1_scores() to normalize texts before comparison
Added automatic detection: word-level ROUGE for space-separated text, character-level scoring for non-space-separated text
Implemented _calculate_character_level_rouge() using Counter to calculate precision/recall/F-measure from character overlap
Created test_non_english_eval.py for Thai text evaluation

Previously returned 0.0 for non-English text. Now provides proportional scores based on character frequency overlap.

Please ensure you have read the contribution guide before creating a pull request.

Link to Issue or Description of Change

1. Link to an existing issue (if applicable):

Closes: Eval fails for non-English languages #3111
Related: #issue_number

2. Or, if no issue exists, describe the change:

If applicable, please follow the issue templates to provide as much detail as
possible.

Problem:
A clear and concise description of what the problem is.

Solution:
A clear and concise description of what you want to happen and why you choose
this solution.

Testing Plan

Please describe the tests that you ran to verify your changes. This is required
for all PRs that are not small documentation or typo fixes.

Unit Tests:

I have added test_non_english_eval.py for testing
All unit tests don't pass locally.

Please include a summary of passed pytest results.
tests/unittests/evaluation/test_non_english_eval.py::test_debug_normalization Reference: 'สวัสดี'
Candidate: 'สวัสดี'
Normalized reference: 'สวัสดี'
Normalized candidate: 'สวัสดี'
Are they equal after normalization? True
ROUGE score: Score(precision=1.0, recall=1.0, fmeasure=1.0)
PASSED

Manual End-to-End (E2E) Tests:

Please provide instructions on how to manually test your changes, including any
necessary setup or configuration. Please provide logs or screenshots to help
reviewers better understand the fix.

Checklist

I have read the CONTRIBUTING.md document.
I have performed a self-review of my own code.
I have commented my code, particularly in hard-to-understand areas.
I have added tests that prove my fix is effective or that my feature works.
New and existing unit tests pass locally with my changes.
I have manually tested my changes end-to-end.
Any dependent changes have been merged and published in downstream modules.

Additional context

Add any other context or screenshots about the feature request here.

- Created text_utils.py with normalize_text() for NFC Unicode normalization - Modified _calculate_rouge_1_scores() to normalize texts before comparison - Added automatic detection: word-level ROUGE for space-separated text, character-level scoring for non-space-separated text - Implemented _calculate_character_level_rouge() using Counter to calculate precision/recall/F-measure from character overlap - Created test_non_english_eval.py for Thai text evaluation Previously returned 0.0 for non-English text. Now provides proportional scores based on character frequency overlap.

google-cla · 2026-01-12T23:42:27Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

gemini-code-assist · 2026-01-12T23:42:42Z

Summary of Changes

Hello @robinpats182, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the robustness of the ROUGE-1 evaluation metric by introducing comprehensive support for non-English and non-space-separated languages. It achieves this by implementing Unicode text normalization and a dynamic mechanism to select between word-level and character-level ROUGE calculations, ensuring that text similarity scores are accurately computed across a wider range of linguistic inputs.

Highlights

Enhanced ROUGE-1 Evaluation: Implemented text normalization using NFC Unicode normalization to ensure consistent text comparison, especially for non-English languages, and introduced an automatic detection mechanism within _calculate_rouge_1_scores() to switch between word-level ROUGE (for space-separated texts) and character-level ROUGE (for non-space-separated texts.
Character-Level ROUGE: Developed a new function, _calculate_character_level_rouge(), which uses collections.Counter to compute ROUGE-1 precision, recall, and F-measure based on character overlap for languages without explicit word delimiters.
Non-English Text Support: Resolved an issue where ROUGE-1 scores for non-English texts previously returned 0.0, now providing accurate, proportional scores based on character frequency overlap.
New Utility and Test Files: Added a new utility file, text_utils.py, for text normalization and a new test file, test_non_english_eval.py, specifically for validating the ROUGE evaluation with non-English (e.g., Thai) text.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

adk-bot · 2026-01-12T23:43:17Z

Response from ADK Triaging Agent

Hello @robinpats182, thank you for your contribution!

To help us review this PR, could you please provide the following information from our contribution guidelines:

In the "Testing Plan" section, please describe the manual end-to-end (E2E) tests that you ran to verify your changes.
Could you also clarify the statement "All unit tests don't pass locally" and ensure that all new and existing unit tests are passing?
Please complete the checklist in the PR description.

This information will help reviewers to review your PR more efficiently. Thanks!

gemini-code-assist

Code Review

This pull request introduces a valuable enhancement by adding support for non-English, non-space-separated languages to the ROUGE-1 score calculation. The approach of using character-level comparison for such languages is sound, and the code is well-structured with the normalization logic extracted into a text_utils.py file. My review includes suggestions to improve code style, efficiency, and testing practices. In final_response_match_v1.py, I've suggested moving imports, defining a namedtuple at the module level for efficiency, and correcting a docstring. For test_non_english_eval.py, I've proposed refactoring the debug script into a proper, parameterized unit test to cover more scenarios. Lastly, a couple of new files are missing a final newline character, which is a common Python style convention. Overall, these are great changes, and addressing the feedback will make the code more robust and maintainable.

gemini-code-assist · 2026-01-12T23:44:42Z

src/google/adk/evaluation/final_response_match_v1.py

+def _calculate_character_level_rouge(candidate: str, reference: str):
+  """Calculates character-level ROUGE-1 score for non-space-separated text.
+  
+  Args:
+    candidate: The candidate text (already normalized).
+    reference: The reference text (already normalized).
+  
+  Returns:
+    A Score namedtuple with precision, recall, and fmeasure.
+  """
+  from collections import Counter, namedtuple
+
+  if not reference or not candidate:
+    Score = namedtuple('Score', ['precision', 'recall', 'fmeasure'])
+    return Score(precision=0.0, recall=0.0, fmeasure=0.0)
+
+  # Count character occurrences
+  ref_chars = Counter(reference)
+  cand_chars = Counter(candidate)
+
+  # Calculate overlapping characters
+  overlap = sum((ref_chars & cand_chars).values())
+
+  # Calculate precision and recall
+  precision = overlap / len(candidate) if len(candidate) > 0 else 0.0
+  recall = overlap / len(reference) if len(reference) > 0 else 0.0
+
+  # Calculate F-measure
+  if precision + recall > 0:
+    fmeasure = 2 * (precision * recall) / (precision + recall)
+  else:
+    fmeasure = 0.0
+
+  Score = namedtuple('Score', ['precision', 'recall', 'fmeasure'])
+  return Score(precision=precision, recall=recall, fmeasure=fmeasure)


This function has a few areas for improvement regarding style and efficiency:

Imports: The import from collections import Counter, namedtuple should be at the top of the file, per PEP 8.

namedtuple definition: The Score namedtuple is defined twice inside this function. This is inefficient as it's redefined on every call. It should be defined once at the module level.

Redundant checks: The checks for non-zero length before division are redundant, as the if not reference or not candidate: guard at the beginning already handles this.

Here is a suggested refactoring that addresses these points. Please remember to move the import to the top of the file and define Score at the module level.

# At top of file: from collections import Counter, namedtuple # ... # At module level, after imports: Score = namedtuple('Score', ['precision', 'recall', 'fmeasure'])

def _calculate_character_level_rouge(candidate: str, reference: str): """Calculates character-level ROUGE-1 score for non-space-separated text. Args: candidate: The candidate text (already normalized). reference: The reference text (already normalized). Returns: A Score namedtuple with precision, recall, and fmeasure. """ if not reference or not candidate: return Score(precision=0.0, recall=0.0, fmeasure=0.0) # Count character occurrences ref_chars = Counter(reference) cand_chars = Counter(candidate) # Calculate overlapping characters overlap = sum((ref_chars & cand_chars).values()) # Calculate precision and recall precision = overlap / len(candidate) recall = overlap / len(reference) # Calculate F-measure if precision + recall > 0: fmeasure = 2 * (precision * recall) / (precision + recall) else: fmeasure = 0.0 return Score(precision=precision, recall=recall, fmeasure=fmeasure)

gemini-code-assist · 2026-01-12T23:44:42Z

tests/unittests/evaluation/test_non_english_eval.py

+from __future__ import annotations
+
+
+def test_debug_normalization():
+  """Debug test to see if normalization is being applied."""
+  from google.adk.evaluation.final_response_match_v1 import _calculate_rouge_1_scores
+  from google.adk.evaluation.text_utils import normalize_text
+
+  reference = "สวัสดี"
+  candidate = "สวัสดี"
+
+  # Check normalization directly
+  norm_ref = normalize_text(reference)
+  norm_cand = normalize_text(candidate)
+
+  print(f"Reference: {repr(reference)}")
+  print(f"Candidate: {repr(candidate)}")
+  print(f"Normalized reference: {repr(norm_ref)}")
+  print(f"Normalized candidate: {repr(norm_cand)}")
+  print(f"Are they equal after normalization? {norm_ref == norm_cand}")
+
+  # Now test the actual function
+  score = _calculate_rouge_1_scores(candidate, reference)
+  print(f"ROUGE score: {score}")


This test file contains a debug-style test with print statements instead of assertions. To make the tests more robust and automated, it's better to use pytest features like parametrize to cover multiple scenarios and assert to verify the results. This also makes the test suite cleaner by not printing to standard output during runs.

Here is a suggested replacement for the current test function that uses pytest.mark.parametrize to cover several cases, including perfect matches, partial matches, and no matches for non-space-separated text.

from __future__ import annotations from collections import namedtuple import pytest from google.adk.evaluation.final_response_match_v1 import ( _calculate_rouge_1_scores, ) Score = namedtuple("Score", ["precision", "recall", "fmeasure"]) @pytest.mark.parametrize( "candidate, reference, expected_score", [ # Perfect match ("สวัสดี", "สวัสดี", Score(1.0, 1.0, 1.0)), # Partial match ("ab", "ac", Score(0.5, 0.5, 0.5)), # No match ("abc", "def", Score(0.0, 0.0, 0.0)), # Candidate is subset of reference ("a", "ab", Score(1.0, 0.5, 2 / 3)), # Empty candidate ("", "abc", Score(0.0, 0.0, 0.0)), # Empty reference ("abc", "", Score(0.0, 0.0, 0.0)), # Both empty ("", "", Score(0.0, 0.0, 0.0)), ], ) def test_character_level_rouge(candidate, reference, expected_score): """Tests character-level ROUGE for various non-space-separated strings.""" actual_score = _calculate_rouge_1_scores(candidate, reference) assert actual_score.precision == pytest.approx(expected_score.precision) assert actual_score.recall == pytest.approx(expected_score.recall) assert actual_score.fmeasure == pytest.approx(expected_score.fmeasure)

gemini-code-assist · 2026-01-12T23:44:42Z

src/google/adk/evaluation/final_response_match_v1.py

 from .evaluator import EvaluationResult
 from .evaluator import Evaluator
 from .evaluator import PerInvocationResult
+from .text_utils import normalize_text #importing normalize_text function for non-English text comparison


As a matter of style, it's better to avoid inline comments on import statements, as per PEP 8. The purpose of the import is clear from the code that uses it. Please remove the comment.

Suggested change

from .text_utils import normalize_text #importing normalize_text function for non-English text comparison

from .text_utils import normalize_text

gemini-code-assist · 2026-01-12T23:44:42Z

src/google/adk/evaluation/final_response_match_v1.py

      reference: The ground-truth text to compare against.

  Returns:
      A dictionary containing the ROUGE-1 precision, recall, and f-measure.


The docstring is incorrect. This function returns a Score namedtuple, not a dictionary. Please update the docstring to reflect the actual return type. This will improve clarity for future developers.

Suggested change

A dictionary containing the ROUGE-1 precision, recall, and f-measure.

A Score namedtuple containing the ROUGE-1 precision, recall, and f-measure.

…s to use pytest assertions instead of prints

adk-bot added the eval [Component] This issue is related to evaluation label Jan 12, 2026

gemini-code-assist bot reviewed Jan 12, 2026

View reviewed changes

Formatted final_response_matchv1 codes and updated normalization test…

249e499

…s to use pytest assertions instead of prints

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Modified _calculate_rouge_1_scores() to normalize texts before comparison and implemented _calculate_character_level_rouge() using Counter to calculate precision/recall/F-measure from character overlap #4131

Modified _calculate_rouge_1_scores() to normalize texts before comparison and implemented _calculate_character_level_rouge() using Counter to calculate precision/recall/F-measure from character overlap #4131

robinpats182 commented Jan 12, 2026 •

edited

Loading

Uh oh!

google-cla bot commented Jan 12, 2026

Uh oh!

gemini-code-assist bot commented Jan 12, 2026

Uh oh!

adk-bot commented Jan 12, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 12, 2026

Uh oh!

gemini-code-assist bot Jan 12, 2026

Uh oh!

gemini-code-assist bot Jan 12, 2026

Uh oh!

gemini-code-assist bot Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	from .text_utils import normalize_text #importing normalize_text function for non-English text comparison
	from .text_utils import normalize_text

	A dictionary containing the ROUGE-1 precision, recall, and f-measure.
	A Score namedtuple containing the ROUGE-1 precision, recall, and f-measure.

Modified _calculate_rouge_1_scores() to normalize texts before comparison and implemented _calculate_character_level_rouge() using Counter to calculate precision/recall/F-measure from character overlap #4131

Are you sure you want to change the base?

Modified _calculate_rouge_1_scores() to normalize texts before comparison and implemented _calculate_character_level_rouge() using Counter to calculate precision/recall/F-measure from character overlap #4131

Conversation

robinpats182 commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Link to Issue or Description of Change

Testing Plan

Checklist

Additional context

Uh oh!

google-cla bot commented Jan 12, 2026

Uh oh!

gemini-code-assist bot commented Jan 12, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

adk-bot commented Jan 12, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

robinpats182 commented Jan 12, 2026 •

edited

Loading