|
$html('script[type="application/ld+json"]').each((index, item) => { |
|
try { |
|
let parsedJSON = JSON.parse($(item).text()) |
|
if (!Array.isArray(parsedJSON)) { |
|
parsedJSON = [parsedJSON] |
|
} |
|
parsedJSON.forEach(obj => { |
|
const type = obj['@type'] |
|
jsonldData[type] = jsonldData[type] || [] |
|
jsonldData[type].push(obj) |
|
}) |
|
} catch (e) { |
|
console.log(`Error in jsonld parse - ${e}`) |
|
} |
|
}) |
The current JSON-LD parser assumes a perfect world scenario.
Here is how I've implemented a LD+JSON parser in my local project:
(html: string): $ReadOnlyArray<Object> => {
const dom = new JSDOM(html);
const nodes = Object.values(dom.window.document.querySelectorAll('script[type="application/ld+json"]'));
return nodes.map((node) => {
if (!node || typeof node.innerHTML !== 'string') {
throw new TypeError('Unexpected content.');
}
let body = node.innerHTML;
debug('body', body);
// Some websites (e.g. Empire) have JSON that includes new-lines, i.e. invalid JSON.
body = body.replace(/\n/g, '');
// Some website (e.g. Variety) have JSON that is surrounded in CDATA comments, e.g.
// https://gist.github.com/gajus/4a2653b4a5235ccebedc44467a2896f2
body = body.slice(body.indexOf('{'), body.lastIndexOf('}') + 1);
return JSON.parse(body);
});
};
Thus far it works with all the sites I have been testing.
web-auto-extractor/src/parsers/jsonld-parser.js
Lines 8 to 22 in 2d15ce4
The current JSON-LD parser assumes a perfect world scenario.
;at the end of the JSON.Here is how I've implemented a LD+JSON parser in my local project:
Thus far it works with all the sites I have been testing.