-
Notifications
You must be signed in to change notification settings - Fork 140
Open
Description
While trying to identify which data sets from the modeldata R package are already present in pmlb, I found that quite a few datasets are duplicates or simple subsets of other datasets.
- cmc and contraceptive are the same. The original can be found on the UCI ML repository.
- Parse data from the original into the expected format.
- Deprecate
cmcandcontraceptivedatasets.
- 195_auto_price and 207_autoPrice. The
symbolingfeature underwent a shift between both datasets. Note: the underlying dataset seems to be the same as the one used for auto. The difference between the datasets is the target, which is price for195_auto_priceand207_autoPrice, and symboling forauto, as well as how missing values were removed. The original dataset may be found on the UCI ML repository.- Parse data from the original into the expected format with price as target.
- Parse data from the original into the expected format with symboling as target.
- Ensure that
Descriptionof each new dataset references the other. - Deprecate
195_auto_price,207_autoPriceandautodatasets.
- glass and prnn_fglass. The target class levels seem to be switched between datasets. The original can be found on the UCI ML repository.
- Parse data from the original into the expected format.
- Deprecate
glassandprnn_fglassdatasets.
- heart_c, cleve, cleveland_nominal and cleveland. The
cleveandheart_cdata sets have a binarized target (vs. ordinal in the other two datasets); thecleveland_nominaldata set contains only a feature subset. The original can be found on the UCI ML repository. - heart_statlog is a subset of the
clevedata set. - heart_h and hungarian appear to be the same.
- Parse Cleveland data from the original into the expected format.
- Parse Hungarian data from the original into the expected format.
- Parse Switzerland data (currently missing) from the original into the expected format.
- Parse VA Long beach data (currently missing) from the original into the expected format.
- Deprecate
heart_c,cleve,cleveland_nominal,cleveland,heart_statlog,heart_handhungariandatasets.
- colic and horse_colic appear to be the same. The original can be found on the UCI ML repository. This issue was also mentioned in horse-colic class labels are in [1,2] #75.
- Parse data from the original into the expected format.
- Deprecate
colicandhorse_colicdatasets.
- vote and house_votes_84 are identical.
- Identify original source.
- Parse data from the original into the expected format.
- Deprecate
voteandhouse_votes_84datasets.
- breast_cancer_wisconsin and wdbc are the same. The original can be found on the UCI ML repository.
- Parse data from the original into the expected format.
- Deprecate
breast_cancer_wisconsinandwdbcdatasets.
- australian, buggyCrx, credit_a and crx are identical or based on the same data.
- Identify original source.
- Parse data from the original into the expected format.
- Deprecate
australian,buggyCrx,credit_aandcrxdatasets.
- breast_w and breast are based on the same data. The
breastdataset has aSample code numberfeature that is not present inbreast_w. The original can be found on the UCI ML repository.- Parse data from the original into the expected format.
- Deprecate
breast_wandbreastdatasets.
- diabetes and pima appear to be identical.
- Identify original source. This dataset appears to have been hosted at the UCI ML repository. However, the original owner seems to have withdrawn permission to use this dataset.
-
Parse data from the original into the expected format. - Deprecate
diabetesandpimadatasets.
- credit_g and german appear to be identical.
- Identify original source. The original can be found the UCI ML repository.
- Parse data from the original into the expected format.
- Deprecate
credit_gandgermandatasets.
- solar_flare_2 and flare derive from the same data, but differ in the way the target is formulated.
solar_flare_2also contains two additional features.- Identify original source. The original can be found the UCI ML repository. There are three targets, of which one is useful for ML prediction. The additional features in
solar_flare_2are in fact the other two targets. - Parse data from the original into the expected format.
- Deprecate
solar_flare_2andflaredatasets.
- Identify original source. The original can be found the UCI ML repository. There are three targets, of which one is useful for ML prediction. The additional features in
- car and car_evaluation are based on the same dataset. In the
car_evaluationdataset several categorical (ordinal) features fromcarare one-hot-encoded. The original can be found on the UCI ML repository. This issue was also mention in car and car_evaluation seem to be identical #84.- Parse data from the original into the expected format.
- Deprecate
carandcar_evaluationdatasets.
- chess and kr_vs_kp are identical. The original can be found on the UCI ML repository.
- Parse data from the original into the expected format.
- Deprecate
chessandkr_vs_kpdatasets.
- satimage and 294_satellite_image are the same, with the exception that
294_satellite_imageincorrectly specifies a regression problem. The original can be found on the UCI ML repository, and has multiple (6) classes as target.- Parse data from the original into the expected format.
- Deprecate
satimageand294_satellite_imagedatasets.
- 197_cpu_act, 227_cpu_small, 562_cpu_small and 573_cpu_act are based on the same dataset, with the difference being that
227_cpu_smalland562_cpu_smallhave fewer features.- Identify original source.
- Parse data from the original into the expected format.
- Deprecate
197_cpu_act,227_cpu_small,562_cpu_smalland573_cpu_actdatasets.
- poker and 1595_poker are identical except for the target specification. The original can be found on the UCI ML repository, and suggest the target is ordinal.
- Parse data from the original into the expected format.
- Deprecate
pokerand1595_pokerdatasets.
My proposal is to remove duplicates, using an original dataset where this can be found. This might also address the following issues:
trangdata and lacava
Metadata
Metadata
Assignees
Labels
No labels