Skip to content

[Near Deduplication] Tokenization #10

@ChenghaoMou

Description

@ChenghaoMou

As we extend deduplication to a wide range of languages, what tokenization method to use will have an impact on the final results.

The current script uses a simple regex and uni-gram to perform minhash calculation. What are the consequences using a different configuration?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions