Skip to content

[Near Deduplication] Post processing #9

@ChenghaoMou

Description

@ChenghaoMou

The current script building clusters of duplicates, but there are cases it might yield unwanted results:

When doc B is clustered under doc A's name, another doc C can also be clustered into B's name (AB, BC, C!~A), thus when we are deleting non "extreme"s from each cluster, we could end up having both A and B kept in the results.

A better way to delete duplicates is to find community within each connected components. This is used in https://github.com/src-d/gemini.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions