Update on004588 to v1.0.2 by AmanJaiswal1503 · Pull Request #1 · nemarDatasets/on004588

AmanJaiswal1503 · 2026-05-22T19:42:01Z

Dataset Update

Bumps on004588 from 1.0.1 to 1.0.2.

Changed files

bidsignore
README.md
dataset_description.json
participants.tsv
sub-S37/eeg/sub-S37_task-unnamed_events.tsv
task-unnamed_events.json

…e), set BIDSVersion to validator (1.11.1); rewrite README in plainer language grouped by file (maintainer feedback)

# Conflicts: # .bidsignore

AmanJaiswal1503 · 2026-06-01T17:36:10Z

Claude Review

This PR is a NEMAR curation pass on on004588 (the Neuma neuromarketing dataset), taking the validator from 43 errors + 1850 warnings to 0 errors + 1765 warnings. It bumps Version 1.0.0 → 1.0.2 and BIDSVersion 1.8.0 → 1.11.1. Six files change: dataset_description.json, task-unnamed_events.json, participants.tsv, one subject events table, .bidsignore, and README.md. Walking through what it changes:

dataset_description.json (four field-level edits). DatasetType: "raw" is added — without it the validator falls through to derivative rules and emits cascading warnings, so adding it is correct for a raw-acquisition dataset. BIDSVersion moves from 1.8.0 to 1.11.1 to match the validator the curation is being run against; BIDS minor bumps are backward-compatible so this is safe. ReferencesAndLinks was [""] (a one-element array containing an empty string, which is not a valid URL and is what the validator was complaining about) and is now []. Version is bumped from 1.0.0 to 1.0.2 — note the README mentions the BIDSVersion change but says it moved from 1.2 to 1.11.1, when the actual source value was 1.8.0, and it doesn't mention the Version bump at all; worth tightening up the changelog before merge.

task-unnamed_events.json (rewritten from [] to a proper object). BIDS sidecar JSON files must be objects, not arrays — the original empty array fired a top-level structural error. The rewrite documents the two non-standard columns that the per-recording events tables carry: sample (the sample index) and value (the event label, declared as free-form text because the observed labels span dozens of distinct mouse/keyboard/wheel strings that no reasonable enum would cover). The description on value correctly points to representative observed labels rather than inventing a controlled vocabulary.

participants.tsv (5 IDs zero-padded). The on-disk subject directories use the zero-padded sub-SNN form (I confirmed: 42 subject directories, all sub-S01 through sub-S44 style). The TSV listed sub-S1, sub-S2, sub-S3, sub-S5, sub-S6 for the first five rows — an inconsistency the validator flags. Those five entries are now sub-S01, sub-S02, sub-S03, sub-S05, sub-S06. I verified row-by-row that only those five cells changed; the remaining 37 rows are byte-identical. participant_id is the only column.

sub-S37/eeg/sub-S37_task-unnamed_events.tsv (final row dropped, 245 data rows preserved). The last row at sample 200123 had MouseButt followed by stray non-UTF-8 bytes (\xbf\x3e\xcf in the original), almost certainly a truncated write of MouseButtonLeft pressed by analogy with row 199902. That one row failed encoding validation and cascaded into missing-column errors on the whole file. The fix drops the row rather than inventing the likely intended label — the right call, since fabricating event data is worse than losing one event. I diffed the first 246 lines of the original against the new 246-line file: byte-identical, so nothing else in the events stream moved.

.bidsignore (patterns broadened from anchored to unanchored). Each subject carries a non-BIDS eye_tracker/ modality directory holding the paired ET .set/.fdt recording, and the dataset root carries a QuestionnaireResponses/ folder; neither is a BIDS modality, so both belong on the ignore list. The original **/eye_tracker/** pattern only matches files inside the directory and leaves the directory entry itself flagged as not-included. The new triple */eye_tracker, */eye_tracker/, */eye_tracker/** (and the matching shape for QuestionnaireResponses) covers the directory entry and its contents. Functionally equivalent at the file level, just suppresses the directory-entry warning the validator was emitting.

README.md (curation log). Lays out the curation rationale and the "left untouched" list — manufacturer / model / software / institution / cognitive-atlas URIs across the per-subject _eeg.json sidecars are deliberately left blank rather than guessed, which is the right call for study-specific facts the original lab needs to confirm. The remaining 1765 warnings are correctly characterized as recommended-but-missing fields (1722 sidecar entries, 42 event-onset-ordering soft warnings, 1 missing GeneratedBy at the dataset level), not structural defects.

Data integrity

Six files changed, all text: one .json, one .tsv (participants), one _events.tsv (one subject), one .bidsignore, one dataset_description.json, and the README.md. Zero binary files touched — I spot-checked four .set/.fdt blobs (S18 EEG and ET) and confirmed identical git SHAs on main and the PR branch, which is what you'd expect for git-annex symlinks that weren't modified. No _channels.tsv, _electrodes.tsv, or _coordsystem.json files appear in the diff. This dataset has no _scans.tsv files, so the Z-suffix question doesn't arise. Programmatic diff confirms no signal data, event timing, sample indices, or channel-level metadata was altered beyond the single corrupted S37 row.

Recommendations

Tighten the README changelog before merge: it says BIDSVersion went from 1.2 to 1.11.1 (the source was actually 1.8.0) and omits the Version bump from 1.0.0 to 1.0.2. The fixes themselves are fine; just the documentation drifted.
Consider filing the MouseButt-truncated S37 row upstream with the original lab — it's plausibly recoverable from the raw acquisition log if they still have it, and the event was real, just lost in serialization.
The eeg-verify harness UnicodeDecodeError noted in the README's "Harness note" is worth filing upstream if not already tracked, since that's the exact failure mode the S37 row would have provoked.

arnodelorme · 2026-06-04T13:29:51Z

Split values into different columns and document accordingly.

AmanJaiswal1503 · 2026-06-05T18:35:18Z

Hi Arno,

To split the value column into structured columns without inventing meaning, here's what the source already gives us. Across all 42 subjects the column has only 17 unique strings, falling into three classes:

1. Stimulus presentation (902 rows, 6 unique strings)
The strings are already structured key/value triples:

Category:IMG=ID:FYLLADIO_1.tif=Type:Leaflet_Images_1
Category:IMG=ID:FYLLADIO_2.tif=Type:Leaflet_Images_1
...
Category:IMG=ID:FYLLADIO_6.tif=Type:Leaflet_Images_1

Three key:value pairs joined by =. Splitting is purely a parse, no interpretation.

2. Mouse/wheel response (3741 rows, 9 unique strings)

MouseButtonLeft pressed     MouseButtonLeft released
MouseButtonMiddle pressed   MouseButtonMiddle released
MouseButtonRight pressed    MouseButtonRight released
MouseWheelDown120 pressed
MouseWheelUp120 pressed

Each is <device> <action> separated by a space.

3. Trial markers (168 rows, 2 strings)
fixation_cross, EOE — atomic, no internal structure to split.

Proposed columns

Keep value verbatim, add four columns:

column	content	notes
`event_class`	`stimulus` / `response` / `marker`	derived from which vocabulary the cell matches
`stim_id`	the `ID:` field for stimulus rows (e.g. `FYLLADIO_1.tif`); `n/a` otherwise
`stim_category`	the `Type:` field for stimulus rows (e.g. `Leaflet_Images_1`); `n/a` otherwise
`device`	for response rows, `MouseButtonLeft` / `MouseWheelDown120` / etc.; `n/a` otherwise
`action`	for response rows, `pressed` / `released`; `n/a` otherwise

Each new cell is either a literal substring of the original value or a class label that follows unambiguously from a regex match, so the row-level mapping is 100% defensible. task-unnamed_events.json would document the four new columns.

Sample before/after

Before:

onset            duration   sample   value
175.97000        1.0000     52791    Category:IMG=ID:FYLLADIO_1.tif=Type:Leaflet_Images_1
1.31000          1.0000     393      MouseButtonLeft pressed
6.25667          1.0000     1877     fixation_cross

After:

onset            duration   sample   value                                                  event_class   stim_id            stim_category        device              action
175.97000        1.0000     52791    Category:IMG=ID:FYLLADIO_1.tif=Type:Leaflet_Images_1   stimulus      FYLLADIO_1.tif     Leaflet_Images_1     n/a                 n/a
1.31000          1.0000     393      MouseButtonLeft pressed                                response      n/a                n/a                  MouseButtonLeft     pressed
6.25667          1.0000     1877     fixation_cross                                         marker        n/a                n/a                  n/a                 n/a

Two questions before I push the rewrite:

Are these column names OK? (event_class, stim_id, stim_category, device, action)
For stimulus rows the Category: field is always literally IMG (every cell). Do you want it as its own column too, or is that redundant?

Thanks!

arnodelorme · 2026-06-05T18:41:00Z

On Jun 5, 2026, at 20:35, Aman Jaiswal ***@***.***> wrote: AmanJaiswal1503 left a comment (nemarDatasets/on004588#1) Hi Arno, To split the value column into structured columns without inventing meaning, here's what the source already gives us. Across all 42 subjects the column has only 17 unique strings, falling into three classes: 1. Stimulus presentation (902 rows, 6 unique strings) The strings are already structured key/value triples: Category:IMG=ID:FYLLADIO_1.tif=Type:Leaflet_Images_1 Category:IMG=ID:FYLLADIO_2.tif=Type:Leaflet_Images_1 ... Category:IMG=ID:FYLLADIO_6.tif=Type:Leaflet_Images_1 Three key:value pairs joined by =. Splitting is purely a parse, no interpretation. 2. Mouse/wheel response (3741 rows, 9 unique strings) MouseButtonLeft pressed MouseButtonLeft released MouseButtonMiddle pressed MouseButtonMiddle released MouseButtonRight pressed MouseButtonRight released MouseWheelDown120 pressed MouseWheelUp120 pressed Each is <device> <action> separated by a space. 3. Trial markers (168 rows, 2 strings) fixation_cross, EOE — atomic, no internal structure to split. Proposed columns Keep value verbatim, add four columns: column content notes event_class stimulus / response / marker derived from which vocabulary the cell matches stim_id the ID: field for stimulus rows (e.g. FYLLADIO_1.tif); n/a otherwise stim_category the Type: field for stimulus rows (e.g. Leaflet_Images_1); n/a otherwise device for response rows, MouseButtonLeft / MouseWheelDown120 / etc.; n/a otherwise action for response rows, pressed / released; n/a otherwise Each new cell is either a literal substring of the original value or a class label that follows unambiguously from a regex match, so the row-level mapping is 100% defensible. task-unnamed_events.json would document the four new columns. Sample before/after Before: onset duration sample value 175.97000 1.0000 52791 Category:IMG=ID:FYLLADIO_1.tif=Type:Leaflet_Images_1 1.31000 1.0000 393 MouseButtonLeft pressed 6.25667 1.0000 1877 fixation_cross After: onset duration sample value event_class stim_id stim_category device action 175.97000 1.0000 52791 Category:IMG=ID:FYLLADIO_1.tif=Type:Leaflet_Images_1 stimulus FYLLADIO_1.tif Leaflet_Images_1 n/a n/a 1.31000 1.0000 393 MouseButtonLeft pressed response n/a n/a MouseButtonLeft pressed 6.25667 1.0000 1877 fixation_cross marker n/a n/a n/a n/a Two questions before I push the rewrite: • Are these column names OK? (event_class, stim_id, stim_category, device, action)

Yes, they are good. I think device and action could be grouped.

• For stimulus rows the Category: field is always literally IMG (every cell). Do you want it as its own column too, or is that redundant?

Ye, not needed if it does not have information, Arno

…

Thanks! — Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications, keep track of coding agent tasks and review pull requests on the go with GitHub Mobile for iOS and Android. Download it today! You are receiving this because you commented.Message ID: ***@***.***>

…id, stim_category Maintainer feedback (Arno) on PR #1 asked for the `value` column to be split into structured columns. The source `value` field carries three vocabularies (stimulus / response / marker), each with internal structure that can be parsed mechanically. Adds three new columns to every events.tsv (84 files: 42 subjects × 2 modalities — eeg/ and eye_tracker/): - event_class: stimulus / response / marker (derived from `value`) - stim_id: filename from `ID:` field for stimulus rows; n/a otherwise - stim_category: label from `Type:` field for stimulus rows; n/a otherwise The original `value` column is preserved verbatim alongside the new columns. For response rows, `value` already carries the `<device> <action>` pair so device/action are not re-split into separate columns (per Arno's "device and action could be grouped" reply). Per-row mapping is purely a regex parse of the source `value` — every new cell is either a literal substring of the original or a class label that follows unambiguously from the regex match. No interpretation, no invented metadata. task-unnamed_events.json updated to document the three new columns. dataset_description.json: Version 1.0.2 → 1.0.3. Validator confirms 0 errors / 1765 warnings (identical breakdown to pre-split: 1722 SIDECAR_KEY_RECOMMENDED + 42 EVENT_ONSET_ORDER + 1 JSON_KEY_RECOMMENDED). Binary .set/.fdt payloads remain byte-identical. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

AmanJaiswal1503 · 2026-06-05T18:55:41Z

Hi Arno,

Pushed the column split as 37e0472. Going with the simpler scheme: three new columns, no merged device/action column (since value already carries the <device> <action> pair verbatim for response rows and event_class=response makes them filterable).

Applied:

Every events.tsv (84 files, 42 subjects × eeg/ + eye_tracker/) now has columns onset, duration, sample, value, event_class, stim_id, stim_category.
value is preserved verbatim; the three new columns are pure regex projections of it.
task-unnamed_events.json documents the three new columns, with a Levels enum on event_class (stimulus / response / marker).
dataset_description.json Version 1.0.2 to 1.0.3.

Row counts (4811 total):

902 stimulus (image presentation, Category:IMG=... rows)
3741 response (mouse / wheel rows)
168 marker (fixation_cross, EOE)

Validator state (deno + jsr:@bids/validator): 0 errors, 1765 warnings. Same breakdown as before the split (1722 SIDECAR_KEY_RECOMMENDED + 42 EVENT_ONSET_ORDER + 1 JSON_KEY_RECOMMENDED), so the new columns did not introduce any new findings.

Binary .set / .fdt payloads are byte-identical. Ready for merge.

Thanks!

The previous commit (37e0472) incorrectly bumped Version from 1.0.2 to 1.0.3 in dataset_description.json. Version tracks the dataset's release lineage and is set by NEMAR's Update-to-vX.Y.Z automation, not by curation edits. Our metadata fixes do not constitute a new dataset version, so Version is reverted to 1.0.2 and the corresponding line in the README curation log is removed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The README now has one `## NEMAR curation changes` section describing what the curated dataset looks like vs the OpenNeuro source, with no dated revision blocks and no narration of which edits were earlier vs later. File-grouped subsections list the changes (dataset description, events sidecar + columns, S37 truncated row, participants padding, .bidsignore patterns) neutrally; no maintainer-feedback framing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

AmanJaiswal1503 added 3 commits May 22, 2026 12:41

Update on004588 to 1.0.2

e820bc5

Curation corrections: drop nemar-cli GeneratedBy (source declares non…

fed2fc8

…e), set BIDSVersion to validator (1.11.1); rewrite README in plainer language grouped by file (maintainer feedback)

Merge remote-tracking branch 'origin/main' into update/on004588-mphbswkd

35e8115

# Conflicts: # .bidsignore

AmanJaiswal1503 and others added 2 commits June 5, 2026 12:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update on004588 to v1.0.2#1

Update on004588 to v1.0.2#1
AmanJaiswal1503 wants to merge 6 commits into
mainfrom
update/on004588-mphbswkd

AmanJaiswal1503 commented May 22, 2026

Uh oh!

AmanJaiswal1503 commented Jun 1, 2026

Uh oh!

arnodelorme commented Jun 4, 2026

Uh oh!

AmanJaiswal1503 commented Jun 5, 2026

Uh oh!

arnodelorme commented Jun 5, 2026 via email

Uh oh!

AmanJaiswal1503 commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AmanJaiswal1503 commented May 22, 2026

Dataset Update

Changed files

Uh oh!

AmanJaiswal1503 commented Jun 1, 2026

Claude Review

Data integrity

Recommendations

Uh oh!

arnodelorme commented Jun 4, 2026

Uh oh!

AmanJaiswal1503 commented Jun 5, 2026

Uh oh!

arnodelorme commented Jun 5, 2026 via email

Uh oh!

AmanJaiswal1503 commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants