Skip to content

Duplicate datasets. #167

@alexzwanenburg

Description

@alexzwanenburg

While trying to identify which data sets from the modeldata R package are already present in pmlb, I found that quite a few datasets are duplicates or simple subsets of other datasets.

  • cmc and contraceptive are the same. The original can be found on the UCI ML repository.
    • Parse data from the original into the expected format.
    • Deprecate cmc and contraceptive datasets.
  • 195_auto_price and 207_autoPrice. The symboling feature underwent a shift between both datasets. Note: the underlying dataset seems to be the same as the one used for auto. The difference between the datasets is the target, which is price for 195_auto_price and 207_autoPrice, and symboling for auto, as well as how missing values were removed. The original dataset may be found on the UCI ML repository.
    • Parse data from the original into the expected format with price as target.
    • Parse data from the original into the expected format with symboling as target.
    • Ensure that Description of each new dataset references the other.
    • Deprecate 195_auto_price, 207_autoPrice and auto datasets.
  • glass and prnn_fglass. The target class levels seem to be switched between datasets. The original can be found on the UCI ML repository.
    • Parse data from the original into the expected format.
    • Deprecate glass and prnn_fglass datasets.
  • heart_c, cleve, cleveland_nominal and cleveland. The cleve and heart_c data sets have a binarized target (vs. ordinal in the other two datasets); the cleveland_nominal data set contains only a feature subset. The original can be found on the UCI ML repository.
  • heart_statlog is a subset of the cleve data set.
  • heart_h and hungarian appear to be the same.
    • Parse Cleveland data from the original into the expected format.
    • Parse Hungarian data from the original into the expected format.
    • Parse Switzerland data (currently missing) from the original into the expected format.
    • Parse VA Long beach data (currently missing) from the original into the expected format.
    • Deprecate heart_c, cleve, cleveland_nominal, cleveland, heart_statlog, heart_h and hungarian datasets.
  • colic and horse_colic appear to be the same. The original can be found on the UCI ML repository. This issue was also mentioned in horse-colic class labels are in [1,2] #75.
    • Parse data from the original into the expected format.
    • Deprecate colic and horse_colic datasets.
  • vote and house_votes_84 are identical.
    • Identify original source.
    • Parse data from the original into the expected format.
    • Deprecate vote and house_votes_84 datasets.
  • breast_cancer_wisconsin and wdbc are the same. The original can be found on the UCI ML repository.
    • Parse data from the original into the expected format.
    • Deprecate breast_cancer_wisconsin and wdbc datasets.
  • australian, buggyCrx, credit_a and crx are identical or based on the same data.
    • Identify original source.
    • Parse data from the original into the expected format.
    • Deprecate australian, buggyCrx, credit_a and crx datasets.
  • breast_w and breast are based on the same data. The breast dataset has a Sample code number feature that is not present in breast_w. The original can be found on the UCI ML repository.
    • Parse data from the original into the expected format.
    • Deprecate breast_w and breast datasets.
  • diabetes and pima appear to be identical.
    • Identify original source. This dataset appears to have been hosted at the UCI ML repository. However, the original owner seems to have withdrawn permission to use this dataset.
    • Parse data from the original into the expected format.
    • Deprecate diabetes and pima datasets.
  • credit_g and german appear to be identical.
    • Identify original source. The original can be found the UCI ML repository.
    • Parse data from the original into the expected format.
    • Deprecate credit_g and german datasets.
  • solar_flare_2 and flare derive from the same data, but differ in the way the target is formulated. solar_flare_2 also contains two additional features.
    • Identify original source. The original can be found the UCI ML repository. There are three targets, of which one is useful for ML prediction. The additional features in solar_flare_2 are in fact the other two targets.
    • Parse data from the original into the expected format.
    • Deprecate solar_flare_2 and flare datasets.
  • car and car_evaluation are based on the same dataset. In the car_evaluation dataset several categorical (ordinal) features from car are one-hot-encoded. The original can be found on the UCI ML repository. This issue was also mention in car and car_evaluation seem to be identical #84.
    • Parse data from the original into the expected format.
    • Deprecate car and car_evaluation datasets.
  • chess and kr_vs_kp are identical. The original can be found on the UCI ML repository.
    • Parse data from the original into the expected format.
    • Deprecate chess and kr_vs_kp datasets.
  • satimage and 294_satellite_image are the same, with the exception that 294_satellite_image incorrectly specifies a regression problem. The original can be found on the UCI ML repository, and has multiple (6) classes as target.
    • Parse data from the original into the expected format.
    • Deprecate satimage and 294_satellite_image datasets.
  • 197_cpu_act, 227_cpu_small, 562_cpu_small and 573_cpu_act are based on the same dataset, with the difference being that 227_cpu_small and 562_cpu_small have fewer features.
    • Identify original source.
    • Parse data from the original into the expected format.
    • Deprecate 197_cpu_act, 227_cpu_small, 562_cpu_small and 573_cpu_act datasets.
  • poker and 1595_poker are identical except for the target specification. The original can be found on the UCI ML repository, and suggest the target is ordinal.
    • Parse data from the original into the expected format.
    • Deprecate poker and 1595_poker datasets.

My proposal is to remove duplicates, using an original dataset where this can be found. This might also address the following issues:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions