Make LD parser more resilient

https://github.com/indix/web-auto-extractor/blob/2d15ce45c8a8ef8387f5c8035817fe45a607081c/src/parsers/jsonld-parser.js#L8-L22

The current JSON-LD parser assumes a perfect world scenario.

* Some websites (e.g. www.empireonline.com/) have JSON that includes new-lines, i.e. invalid JSON.
* Some website (e.g. Variety) have JSON that is surrounded in CDATA comments, e.g. https://gist.github.com/gajus/4a2653b4a5235ccebedc44467a2896f2. Furthermore, this includes `;` at the end of the JSON.

Here is how I've implemented a LD+JSON parser in my local project:

```js
(html: string): $ReadOnlyArray<Object> => {
  const dom = new JSDOM(html);

  const nodes = Object.values(dom.window.document.querySelectorAll('script[type="application/ld+json"]'));

  return nodes.map((node) => {
    if (!node || typeof node.innerHTML !== 'string') {
      throw new TypeError('Unexpected content.');
    }

    let body = node.innerHTML;

    debug('body', body);

    // Some websites (e.g. Empire) have JSON that includes new-lines, i.e. invalid JSON.
    body = body.replace(/\n/g, '');

    // Some website (e.g. Variety) have JSON that is surrounded in CDATA comments, e.g.
    // https://gist.github.com/gajus/4a2653b4a5235ccebedc44467a2896f2
    body = body.slice(body.indexOf('{'), body.lastIndexOf('}') + 1);

    return JSON.parse(body);
  });
};

```

Thus far it works with all the sites I have been testing.

	$html('script[type="application/ld+json"]').each((index, item) => {
	try {
	let parsedJSON = JSON.parse($(item).text())
	if (!Array.isArray(parsedJSON)) {
	parsedJSON = [parsedJSON]
	}
	parsedJSON.forEach(obj => {
	const type = obj['@type']
	jsonldData[type] = jsonldData[type] \|\| []
	jsonldData[type].push(obj)
	})
	} catch (e) {
	console.log(`Error in jsonld parse - ${e}`)
	}
	})

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make LD parser more resilient #19

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Make LD parser more resilient #19

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions