Skip to content

Update on004588 to v1.0.2#1

Open
AmanJaiswal1503 wants to merge 6 commits into
mainfrom
update/on004588-mphbswkd
Open

Update on004588 to v1.0.2#1
AmanJaiswal1503 wants to merge 6 commits into
mainfrom
update/on004588-mphbswkd

Conversation

@AmanJaiswal1503

Copy link
Copy Markdown

Dataset Update

Bumps on004588 from 1.0.1 to 1.0.2.

Changed files

  • bidsignore
  • README.md
  • dataset_description.json
  • participants.tsv
  • sub-S37/eeg/sub-S37_task-unnamed_events.tsv
  • task-unnamed_events.json

…e), set BIDSVersion to validator (1.11.1); rewrite README in plainer language grouped by file (maintainer feedback)
@AmanJaiswal1503

Copy link
Copy Markdown
Author

Claude Review

This PR is a NEMAR curation pass on on004588 (the Neuma neuromarketing dataset), taking the validator from 43 errors + 1850 warnings to 0 errors + 1765 warnings. It bumps Version 1.0.0 → 1.0.2 and BIDSVersion 1.8.0 → 1.11.1. Six files change: dataset_description.json, task-unnamed_events.json, participants.tsv, one subject events table, .bidsignore, and README.md. Walking through what it changes:

dataset_description.json (four field-level edits). DatasetType: "raw" is added — without it the validator falls through to derivative rules and emits cascading warnings, so adding it is correct for a raw-acquisition dataset. BIDSVersion moves from 1.8.0 to 1.11.1 to match the validator the curation is being run against; BIDS minor bumps are backward-compatible so this is safe. ReferencesAndLinks was [""] (a one-element array containing an empty string, which is not a valid URL and is what the validator was complaining about) and is now []. Version is bumped from 1.0.0 to 1.0.2 — note the README mentions the BIDSVersion change but says it moved from 1.2 to 1.11.1, when the actual source value was 1.8.0, and it doesn't mention the Version bump at all; worth tightening up the changelog before merge.

task-unnamed_events.json (rewritten from [] to a proper object). BIDS sidecar JSON files must be objects, not arrays — the original empty array fired a top-level structural error. The rewrite documents the two non-standard columns that the per-recording events tables carry: sample (the sample index) and value (the event label, declared as free-form text because the observed labels span dozens of distinct mouse/keyboard/wheel strings that no reasonable enum would cover). The description on value correctly points to representative observed labels rather than inventing a controlled vocabulary.

participants.tsv (5 IDs zero-padded). The on-disk subject directories use the zero-padded sub-SNN form (I confirmed: 42 subject directories, all sub-S01 through sub-S44 style). The TSV listed sub-S1, sub-S2, sub-S3, sub-S5, sub-S6 for the first five rows — an inconsistency the validator flags. Those five entries are now sub-S01, sub-S02, sub-S03, sub-S05, sub-S06. I verified row-by-row that only those five cells changed; the remaining 37 rows are byte-identical. participant_id is the only column.

sub-S37/eeg/sub-S37_task-unnamed_events.tsv (final row dropped, 245 data rows preserved). The last row at sample 200123 had MouseButt followed by stray non-UTF-8 bytes (\xbf\x3e\xcf in the original), almost certainly a truncated write of MouseButtonLeft pressed by analogy with row 199902. That one row failed encoding validation and cascaded into missing-column errors on the whole file. The fix drops the row rather than inventing the likely intended label — the right call, since fabricating event data is worse than losing one event. I diffed the first 246 lines of the original against the new 246-line file: byte-identical, so nothing else in the events stream moved.

.bidsignore (patterns broadened from anchored to unanchored). Each subject carries a non-BIDS eye_tracker/ modality directory holding the paired ET .set/.fdt recording, and the dataset root carries a QuestionnaireResponses/ folder; neither is a BIDS modality, so both belong on the ignore list. The original **/eye_tracker/** pattern only matches files inside the directory and leaves the directory entry itself flagged as not-included. The new triple */eye_tracker, */eye_tracker/, */eye_tracker/** (and the matching shape for QuestionnaireResponses) covers the directory entry and its contents. Functionally equivalent at the file level, just suppresses the directory-entry warning the validator was emitting.

README.md (curation log). Lays out the curation rationale and the "left untouched" list — manufacturer / model / software / institution / cognitive-atlas URIs across the per-subject _eeg.json sidecars are deliberately left blank rather than guessed, which is the right call for study-specific facts the original lab needs to confirm. The remaining 1765 warnings are correctly characterized as recommended-but-missing fields (1722 sidecar entries, 42 event-onset-ordering soft warnings, 1 missing GeneratedBy at the dataset level), not structural defects.

Data integrity

Six files changed, all text: one .json, one .tsv (participants), one _events.tsv (one subject), one .bidsignore, one dataset_description.json, and the README.md. Zero binary files touched — I spot-checked four .set/.fdt blobs (S18 EEG and ET) and confirmed identical git SHAs on main and the PR branch, which is what you'd expect for git-annex symlinks that weren't modified. No _channels.tsv, _electrodes.tsv, or _coordsystem.json files appear in the diff. This dataset has no _scans.tsv files, so the Z-suffix question doesn't arise. Programmatic diff confirms no signal data, event timing, sample indices, or channel-level metadata was altered beyond the single corrupted S37 row.

Recommendations

  • Tighten the README changelog before merge: it says BIDSVersion went from 1.2 to 1.11.1 (the source was actually 1.8.0) and omits the Version bump from 1.0.0 to 1.0.2. The fixes themselves are fine; just the documentation drifted.
  • Consider filing the MouseButt-truncated S37 row upstream with the original lab — it's plausibly recoverable from the raw acquisition log if they still have it, and the event was real, just lost in serialization.
  • The eeg-verify harness UnicodeDecodeError noted in the README's "Harness note" is worth filing upstream if not already tracked, since that's the exact failure mode the S37 row would have provoked.

@arnodelorme

Copy link
Copy Markdown

Split values into different columns and document accordingly.

@AmanJaiswal1503

Copy link
Copy Markdown
Author

Hi Arno,

To split the value column into structured columns without inventing meaning, here's what the source already gives us. Across all 42 subjects the column has only 17 unique strings, falling into three classes:

1. Stimulus presentation (902 rows, 6 unique strings)
The strings are already structured key/value triples:

Category:IMG=ID:FYLLADIO_1.tif=Type:Leaflet_Images_1
Category:IMG=ID:FYLLADIO_2.tif=Type:Leaflet_Images_1
...
Category:IMG=ID:FYLLADIO_6.tif=Type:Leaflet_Images_1

Three key:value pairs joined by =. Splitting is purely a parse, no interpretation.

2. Mouse/wheel response (3741 rows, 9 unique strings)

MouseButtonLeft pressed     MouseButtonLeft released
MouseButtonMiddle pressed   MouseButtonMiddle released
MouseButtonRight pressed    MouseButtonRight released
MouseWheelDown120 pressed
MouseWheelUp120 pressed

Each is <device> <action> separated by a space.

3. Trial markers (168 rows, 2 strings)
fixation_cross, EOE — atomic, no internal structure to split.

Proposed columns

Keep value verbatim, add four columns:

column content notes
event_class stimulus / response / marker derived from which vocabulary the cell matches
stim_id the ID: field for stimulus rows (e.g. FYLLADIO_1.tif); n/a otherwise
stim_category the Type: field for stimulus rows (e.g. Leaflet_Images_1); n/a otherwise
device for response rows, MouseButtonLeft / MouseWheelDown120 / etc.; n/a otherwise
action for response rows, pressed / released; n/a otherwise

Each new cell is either a literal substring of the original value or a class label that follows unambiguously from a regex match, so the row-level mapping is 100% defensible. task-unnamed_events.json would document the four new columns.

Sample before/after

Before:

onset            duration   sample   value
175.97000        1.0000     52791    Category:IMG=ID:FYLLADIO_1.tif=Type:Leaflet_Images_1
1.31000          1.0000     393      MouseButtonLeft pressed
6.25667          1.0000     1877     fixation_cross

After:

onset            duration   sample   value                                                  event_class   stim_id            stim_category        device              action
175.97000        1.0000     52791    Category:IMG=ID:FYLLADIO_1.tif=Type:Leaflet_Images_1   stimulus      FYLLADIO_1.tif     Leaflet_Images_1     n/a                 n/a
1.31000          1.0000     393      MouseButtonLeft pressed                                response      n/a                n/a                  MouseButtonLeft     pressed
6.25667          1.0000     1877     fixation_cross                                         marker        n/a                n/a                  n/a                 n/a

Two questions before I push the rewrite:

  1. Are these column names OK? (event_class, stim_id, stim_category, device, action)
  2. For stimulus rows the Category: field is always literally IMG (every cell). Do you want it as its own column too, or is that redundant?

Thanks!

@arnodelorme

arnodelorme commented Jun 5, 2026 via email

Copy link
Copy Markdown

…id, stim_category

Maintainer feedback (Arno) on PR #1 asked for the `value` column to be split
into structured columns. The source `value` field carries three vocabularies
(stimulus / response / marker), each with internal structure that can be
parsed mechanically.

Adds three new columns to every events.tsv (84 files: 42 subjects × 2
modalities — eeg/ and eye_tracker/):
- event_class: stimulus / response / marker (derived from `value`)
- stim_id: filename from `ID:` field for stimulus rows; n/a otherwise
- stim_category: label from `Type:` field for stimulus rows; n/a otherwise

The original `value` column is preserved verbatim alongside the new columns.
For response rows, `value` already carries the `<device> <action>` pair so
device/action are not re-split into separate columns (per Arno's "device and
action could be grouped" reply).

Per-row mapping is purely a regex parse of the source `value` — every new
cell is either a literal substring of the original or a class label that
follows unambiguously from the regex match. No interpretation, no invented
metadata.

task-unnamed_events.json updated to document the three new columns.
dataset_description.json: Version 1.0.2 → 1.0.3.

Validator confirms 0 errors / 1765 warnings (identical breakdown to
pre-split: 1722 SIDECAR_KEY_RECOMMENDED + 42 EVENT_ONSET_ORDER + 1
JSON_KEY_RECOMMENDED). Binary .set/.fdt payloads remain byte-identical.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@AmanJaiswal1503

Copy link
Copy Markdown
Author

Hi Arno,

Pushed the column split as 37e0472. Going with the simpler scheme: three new columns, no merged device/action column (since value already carries the <device> <action> pair verbatim for response rows and event_class=response makes them filterable).

Applied:

  • Every events.tsv (84 files, 42 subjects × eeg/ + eye_tracker/) now has columns onset, duration, sample, value, event_class, stim_id, stim_category.
  • value is preserved verbatim; the three new columns are pure regex projections of it.
  • task-unnamed_events.json documents the three new columns, with a Levels enum on event_class (stimulus / response / marker).
  • dataset_description.json Version 1.0.2 to 1.0.3.

Row counts (4811 total):

  • 902 stimulus (image presentation, Category:IMG=... rows)
  • 3741 response (mouse / wheel rows)
  • 168 marker (fixation_cross, EOE)

Validator state (deno + jsr:@bids/validator): 0 errors, 1765 warnings. Same breakdown as before the split (1722 SIDECAR_KEY_RECOMMENDED + 42 EVENT_ONSET_ORDER + 1 JSON_KEY_RECOMMENDED), so the new columns did not introduce any new findings.

Binary .set / .fdt payloads are byte-identical. Ready for merge.

Thanks!

AmanJaiswal1503 and others added 2 commits June 5, 2026 12:12
The previous commit (37e0472) incorrectly bumped Version from 1.0.2 to
1.0.3 in dataset_description.json. Version tracks the dataset's release
lineage and is set by NEMAR's Update-to-vX.Y.Z automation, not by
curation edits. Our metadata fixes do not constitute a new dataset
version, so Version is reverted to 1.0.2 and the corresponding line in
the README curation log is removed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The README now has one `## NEMAR curation changes` section describing what
the curated dataset looks like vs the OpenNeuro source, with no dated
revision blocks and no narration of which edits were earlier vs later.
File-grouped subsections list the changes (dataset description, events
sidecar + columns, S37 truncated row, participants padding, .bidsignore
patterns) neutrally; no maintainer-feedback framing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants