-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Modified _calculate_rouge_1_scores() to normalize texts before comparison and implemented _calculate_character_level_rouge() using Counter to calculate precision/recall/F-measure from character overlap #4131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Created text_utils.py with normalize_text() for NFC Unicode normalization - Modified _calculate_rouge_1_scores() to normalize texts before comparison - Added automatic detection: word-level ROUGE for space-separated text, character-level scoring for non-space-separated text - Implemented _calculate_character_level_rouge() using Counter to calculate precision/recall/F-measure from character overlap - Created test_non_english_eval.py for Thai text evaluation Previously returned 0.0 for non-English text. Now provides proportional scores based on character frequency overlap.
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
Summary of ChangesHello @robinpats182, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the robustness of the ROUGE-1 evaluation metric by introducing comprehensive support for non-English and non-space-separated languages. It achieves this by implementing Unicode text normalization and a dynamic mechanism to select between word-level and character-level ROUGE calculations, ensuring that text similarity scores are accurately computed across a wider range of linguistic inputs. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
Response from ADK Triaging Agent Hello @robinpats182, thank you for your contribution! To help us review this PR, could you please provide the following information from our contribution guidelines:
This information will help reviewers to review your PR more efficiently. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a valuable enhancement by adding support for non-English, non-space-separated languages to the ROUGE-1 score calculation. The approach of using character-level comparison for such languages is sound, and the code is well-structured with the normalization logic extracted into a text_utils.py file. My review includes suggestions to improve code style, efficiency, and testing practices. In final_response_match_v1.py, I've suggested moving imports, defining a namedtuple at the module level for efficiency, and correcting a docstring. For test_non_english_eval.py, I've proposed refactoring the debug script into a proper, parameterized unit test to cover more scenarios. Lastly, a couple of new files are missing a final newline character, which is a common Python style convention. Overall, these are great changes, and addressing the feedback will make the code more robust and maintainable.
| def _calculate_character_level_rouge(candidate: str, reference: str): | ||
| """Calculates character-level ROUGE-1 score for non-space-separated text. | ||
| Args: | ||
| candidate: The candidate text (already normalized). | ||
| reference: The reference text (already normalized). | ||
| Returns: | ||
| A Score namedtuple with precision, recall, and fmeasure. | ||
| """ | ||
| from collections import Counter, namedtuple | ||
|
|
||
| if not reference or not candidate: | ||
| Score = namedtuple('Score', ['precision', 'recall', 'fmeasure']) | ||
| return Score(precision=0.0, recall=0.0, fmeasure=0.0) | ||
|
|
||
| # Count character occurrences | ||
| ref_chars = Counter(reference) | ||
| cand_chars = Counter(candidate) | ||
|
|
||
| # Calculate overlapping characters | ||
| overlap = sum((ref_chars & cand_chars).values()) | ||
|
|
||
| # Calculate precision and recall | ||
| precision = overlap / len(candidate) if len(candidate) > 0 else 0.0 | ||
| recall = overlap / len(reference) if len(reference) > 0 else 0.0 | ||
|
|
||
| # Calculate F-measure | ||
| if precision + recall > 0: | ||
| fmeasure = 2 * (precision * recall) / (precision + recall) | ||
| else: | ||
| fmeasure = 0.0 | ||
|
|
||
| Score = namedtuple('Score', ['precision', 'recall', 'fmeasure']) | ||
| return Score(precision=precision, recall=recall, fmeasure=fmeasure) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function has a few areas for improvement regarding style and efficiency:
- Imports: The import
from collections import Counter, namedtupleshould be at the top of the file, per PEP 8. namedtupledefinition: TheScorenamedtuple is defined twice inside this function. This is inefficient as it's redefined on every call. It should be defined once at the module level.- Redundant checks: The checks for non-zero length before division are redundant, as the
if not reference or not candidate:guard at the beginning already handles this.
Here is a suggested refactoring that addresses these points. Please remember to move the import to the top of the file and define Score at the module level.
# At top of file:
from collections import Counter, namedtuple
# ...
# At module level, after imports:
Score = namedtuple('Score', ['precision', 'recall', 'fmeasure'])def _calculate_character_level_rouge(candidate: str, reference: str):
"""Calculates character-level ROUGE-1 score for non-space-separated text.
Args:
candidate: The candidate text (already normalized).
reference: The reference text (already normalized).
Returns:
A Score namedtuple with precision, recall, and fmeasure.
"""
if not reference or not candidate:
return Score(precision=0.0, recall=0.0, fmeasure=0.0)
# Count character occurrences
ref_chars = Counter(reference)
cand_chars = Counter(candidate)
# Calculate overlapping characters
overlap = sum((ref_chars & cand_chars).values())
# Calculate precision and recall
precision = overlap / len(candidate)
recall = overlap / len(reference)
# Calculate F-measure
if precision + recall > 0:
fmeasure = 2 * (precision * recall) / (precision + recall)
else:
fmeasure = 0.0
return Score(precision=precision, recall=recall, fmeasure=fmeasure)| from __future__ import annotations | ||
|
|
||
|
|
||
| def test_debug_normalization(): | ||
| """Debug test to see if normalization is being applied.""" | ||
| from google.adk.evaluation.final_response_match_v1 import _calculate_rouge_1_scores | ||
| from google.adk.evaluation.text_utils import normalize_text | ||
|
|
||
| reference = "สวัสดี" | ||
| candidate = "สวัสดี" | ||
|
|
||
| # Check normalization directly | ||
| norm_ref = normalize_text(reference) | ||
| norm_cand = normalize_text(candidate) | ||
|
|
||
| print(f"Reference: {repr(reference)}") | ||
| print(f"Candidate: {repr(candidate)}") | ||
| print(f"Normalized reference: {repr(norm_ref)}") | ||
| print(f"Normalized candidate: {repr(norm_cand)}") | ||
| print(f"Are they equal after normalization? {norm_ref == norm_cand}") | ||
|
|
||
| # Now test the actual function | ||
| score = _calculate_rouge_1_scores(candidate, reference) | ||
| print(f"ROUGE score: {score}") No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test file contains a debug-style test with print statements instead of assertions. To make the tests more robust and automated, it's better to use pytest features like parametrize to cover multiple scenarios and assert to verify the results. This also makes the test suite cleaner by not printing to standard output during runs.
Here is a suggested replacement for the current test function that uses pytest.mark.parametrize to cover several cases, including perfect matches, partial matches, and no matches for non-space-separated text.
from __future__ import annotations
from collections import namedtuple
import pytest
from google.adk.evaluation.final_response_match_v1 import (
_calculate_rouge_1_scores,
)
Score = namedtuple("Score", ["precision", "recall", "fmeasure"])
@pytest.mark.parametrize(
"candidate, reference, expected_score",
[
# Perfect match
("สวัสดี", "สวัสดี", Score(1.0, 1.0, 1.0)),
# Partial match
("ab", "ac", Score(0.5, 0.5, 0.5)),
# No match
("abc", "def", Score(0.0, 0.0, 0.0)),
# Candidate is subset of reference
("a", "ab", Score(1.0, 0.5, 2 / 3)),
# Empty candidate
("", "abc", Score(0.0, 0.0, 0.0)),
# Empty reference
("abc", "", Score(0.0, 0.0, 0.0)),
# Both empty
("", "", Score(0.0, 0.0, 0.0)),
],
)
def test_character_level_rouge(candidate, reference, expected_score):
"""Tests character-level ROUGE for various non-space-separated strings."""
actual_score = _calculate_rouge_1_scores(candidate, reference)
assert actual_score.precision == pytest.approx(expected_score.precision)
assert actual_score.recall == pytest.approx(expected_score.recall)
assert actual_score.fmeasure == pytest.approx(expected_score.fmeasure)| from .evaluator import EvaluationResult | ||
| from .evaluator import Evaluator | ||
| from .evaluator import PerInvocationResult | ||
| from .text_utils import normalize_text #importing normalize_text function for non-English text comparison |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a matter of style, it's better to avoid inline comments on import statements, as per PEP 8. The purpose of the import is clear from the code that uses it. Please remove the comment.
| from .text_utils import normalize_text #importing normalize_text function for non-English text comparison | |
| from .text_utils import normalize_text |
| reference: The ground-truth text to compare against. | ||
| Returns: | ||
| A dictionary containing the ROUGE-1 precision, recall, and f-measure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring is incorrect. This function returns a Score namedtuple, not a dictionary. Please update the docstring to reflect the actual return type. This will improve clarity for future developers.
| A dictionary containing the ROUGE-1 precision, recall, and f-measure. | |
| A Score namedtuple containing the ROUGE-1 precision, recall, and f-measure. |
…s to use pytest assertions instead of prints
Previously returned 0.0 for non-English text. Now provides proportional scores based on character frequency overlap.
Please ensure you have read the contribution guide before creating a pull request.
Link to Issue or Description of Change
1. Link to an existing issue (if applicable):
2. Or, if no issue exists, describe the change:
If applicable, please follow the issue templates to provide as much detail as
possible.
Problem:
A clear and concise description of what the problem is.
Solution:
A clear and concise description of what you want to happen and why you choose
this solution.
Testing Plan
Please describe the tests that you ran to verify your changes. This is required
for all PRs that are not small documentation or typo fixes.
Unit Tests:
Please include a summary of passed
pytestresults.tests/unittests/evaluation/test_non_english_eval.py::test_debug_normalization Reference: 'สวัสดี'
Candidate: 'สวัสดี'
Normalized reference: 'สวัสดี'
Normalized candidate: 'สวัสดี'
Are they equal after normalization? True
ROUGE score: Score(precision=1.0, recall=1.0, fmeasure=1.0)
PASSED
Manual End-to-End (E2E) Tests:
Please provide instructions on how to manually test your changes, including any
necessary setup or configuration. Please provide logs or screenshots to help
reviewers better understand the fix.
Checklist
Additional context
Add any other context or screenshots about the feature request here.