1

I'm trying to read the JSON string that is inside the <pre> element here:

http://nlp.stanford.edu:8080/corenlp/process?input=hello%20world&outputFormat=json

If I copy-paste the string with the mouse, I can JSON.parse() it. But if I read it programmatically, I get an error.

Here is my code:

var request = require('request'); // to make POST requests
var Entities = require('html-entities').AllHtmlEntities; // to decode the json string (i.e. get rid of nbsp and quot's)
var fs = require('fs')
// Set the headers
var headers = {
    'User-Agent': 'Super Agent/0.0.1',
    'Content-Type': 'application/x-www-form-urlencoded'
}

// Configure the request
var options = {
    url: 'http://nlp.stanford.edu:8080/corenlp/process',
    method: 'POST',
    headers: headers,
    form: {
        'input': 'hello world',
        'outputFormat': 'json'
    }
}

// Start the request
request(options, function(error, response, body) {
    if (!error && response.statusCode == 200) {
        // Print out the response body
        console.log("body: " + body)
        let cheerio = require('cheerio')
        let $ = cheerio.load(body)
        var inside = $('pre').text();
        inside = Entities.decode(inside.toString());
        //console.log("inside "+ inside);
        var obj = JSON.parse(inside);
        console.log(obj);
    }
})

But I get the following error:

undefined:2
  "sentences": [
^

SyntaxError: Unexpected token   in JSON at position 2
    at JSON.parse (<anonymous>)

And here is an excerpt from the output of the link, i.e. what I want to parse into obj:

{
&nbsp;&nbsp;&quot;sentences&quot;: [
&nbsp;&nbsp;&nbsp;&nbsp;{
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&quot;index&quot;: &quot;0&quot;,
...
&nbsp;&nbsp;&nbsp;&nbsp;}
&nbsp;&nbsp;]
}

How can I JSON.parse() such a string?

Thanks,

4
  • @JaredSmith Thanks, I already tried that, it's included in my code. But I'm still failing to parse it correctly. Commented Jan 13, 2017 at 17:31
  • what does the decoded string look like? JSON.parse shouldn't care about whitespace... Commented Jan 13, 2017 at 17:35
  • It looks like a regular string when I print on the console. But in the very beginning, the space before { is diagnosed as an unexpected token. Commented Jan 13, 2017 at 17:38
  • 1
    try triming the string prior to parsing to remove the leading whitespace. It shouldn't matter, but worth trying since the error is pointing you to that spot. Commented Jan 13, 2017 at 17:41

2 Answers 2

2

Final Answer

Both the output and the error you presented pointed at a problem to parse a space character right after the opening JSON bracket. I suggest you remove all white spaces, that are not within quotes.

As follows:

var obj = JSON.parse(str.replace(/(\s+?(?={))|(^\s+)|(\r|\n)|((?=[\[:,])\s+)/gm,''));

Original Answer

I suggest you remove all white spaces.

So, var obj = JSON.parse(inside.replace(/\s/g,'')); should work

Here is a JSFiddle example

EDIT

Better: var obj = JSON.parse(str.replace(/(\s+?(?={))|(^\s+)|(\r|\n)|((?=[\[:,])\s+)/gm,'')); will leave spaces inside quotes as they are, since "parse" has spaces in its value

Sign up to request clarification or add additional context in comments.

4 Comments

This will replace all whitespace... including those in string values in the json.
I tried before with inside = inside.replace(/(\r\n|\n|\r)/gm,"");, which didn't work. However, this works! In this case, it looks like removing spaces are fine, since every element is a word which does not contain any space. So, thanks.. :)
@JaredSmith Thanks for your comment. edited my answer.
@remdevtec indeed, +1 to you
2

The problem is all of those &nbsp;s. Those represent a non-breaking space character, U+00A0. Unfortunately, JSON.parse (correctly) chokes on those characters because the JSON spec, RFC 4627, only treats regular spaces (U+0020), tabs, and line breaks as whitespace.

You could do the hacky thing, which is to replace every U+00A0 with U+0020, but that would also affect non-breaking spaces inside of strings, which is not ideal.

The best way to handle input data like this would be to use a JSON parsing library that is more tolerant of other kinds of whitespace characters.


Why aren't you running your own copy of CoreNLP? I imagine they don't want you scraping their server.

1 Comment

OMG, I didn't know they made a Node.js wrapper! Thank you so much!

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.