What could possibly go wrong with JSON parsing?

If you have been using JSON for years but have never wondered about the origins of JSON or its nuances, you would be forgiven. It’s not your fault. That’s JSON being true to its purpose.

json.org describes JSON as “easy for humans to read and write” and “easy for machines to parse and generate”. It was designed to be this simple so that you wouldn’t bat your eye twice when using it. Famously, when Douglas Crockford “discovered” JSON in early 2000s, he printed the entire JSON grammar on business cards and shared with people to popularise its adoption.

As adoption grew, specifications began to emerge. The good guys on the internet got together to put out a specification to stop the world from exploding. Today, most languages refer to either RFC 7159 (which came out in 2014) or RFC 8259 (which fixed a couple of minor things with RFC 7159). These RFCs essentially describe how the encoder and decoder should behave when presented with different kind of inputs.

JSON does an almost perfect job of being the de-facto data exchange format over the network. But it does have some nuances. And revealing those nuances is the intention of this post.

Nuances

There are two types of issues that crop up when using JSON.

Type 1: Specification vs Implementation

The first issue relates to how different languages implemented the JSON specification. While its true that there is a specification, it turns out that it has left a lot of teeny-tiny details upto the language developers.

One interesting example of this freedom is that the implementing language has the freedom to declare a maximum length of a string that it would consider as valid. So technically, even a parser which accepts maximum 3 character strings would be considered as implementing the specification.

A direct consequence of this is interoperability issues between different languages, or even different versions of the same language.

This issue has been explored with some extraordinary effort put by Nicolas Seriot in his blog post. Do give it a read, but this figure conveys the point I want to make.

On the left hand side are different kind of tests, some straightforward, which almost all languages pass (for ex, ‘{“id”:0’, having an unclosed bracket is not a valid JSON and should be rejected by all parsers), and some corner cases (for ex, ‘{“id”:0,}’, with a trailing comma is considered as valid by some languages, but not by others). On the top are different languages that were tested. On the right are examples of the tests. Each colored square represents one of the six possible outcomes.

Type 2: Dynamically typed vs Statically typed languages

The second kind of problems arise from inherent differences in dynamically typed languages and statically typed languages, and the development behaviour that they induce in coders.

Since most of my work involves working with PHP and Golang, I will use examples from these languages. But the same would hold for any JSON communication where the encoder is a dynamically typed language while the decoder is statically typed.

JSON & PHP

PHP predates JSON by about 6 years. So they share a lot of history together. As such, PHP has implemented JSON specification quite well. In Seriot’s tests, PHP (7.0.10) did not throw any of “Should have succeeded but failed” or “Should have failed but succeeded” errors.

According to php.net, PHP implements a superset of RFC 7159 specification.

JSON & Golang

JSON is handled by the “encoding/json” package in Go. It also implements the RFC 7159 specification. It also passes almost all of Seriot’s tests and throws none of the serious errors.

All this looks good. Where is the problem then?

I will show with a few examples.

First we will setup our encoder and decoder scripts. They are fairly trivial.

Encoder (PHP):

<?php$to_encode = 1;
$data = ['a' => $to_encode];
echo json_encode($data)."\n";
?>

We have a value $to_encode that we want to encode to JSON format.

And here is the decoder of this encoded value:

package mainimport (
"bufio"
"encoding/json"
"fmt"
"os"
)
// "a" is a tag used to tell json package
// to look for "a" key in received JSON
type B struct {
A int `json:"a"`
}
func main() {
// read input from stdin
reader := bufio.NewReader(os.Stdin)
str, _ := reader.ReadString('\n')
// the json string
fmt.Print("json:", str)
b := B{}
err := json.Unmarshal([]byte(str), &b)
if err != nil {
fmt.Println("go: error occurred - ", err)
return
}
// decoded value
fmt.Printf("go: %+v", b)
fmt.Println()
}

To run the two scripts together, run on bash:

> php encode.php | go run decode_copy.go
json: {“a”:1}
go: {A:1}

In the above two scripts, I have highlighted the parts which we will be playing with. I will present different examples in the following format:

$to_encode = 1; //PHP// Go
type B struct {
A int `json:"a"`
}
// Bash outputjson: {“a”:1}go: {A:1}

We encoded the number 1 in PHP, and we got an int in Go. Simple enough.

As we modify $to_encode in our examples, we will have to modify A’s type in struct B for decoding to work.

Problem #1: representability of large numbers

Small numbers in PHP are passed as is in JSON, while large numbers are converted to the scientific notation. In Go, the former can be accommodated in an int, but the latter can be represented only as a float value.

Works:

$to_encode = 123; //PHP// Go
type B struct {
A int `json:"a"`
}
// Bash outputjson: {"a":123}go: {A:123}

Fails:

$to_encode = 123123123123123123123; //PHP// Go
type B struct {
A int `json:"a"`
}
// Bash outputjson:{"a":1.2312312312312313e+20}go: error occurred - json: cannot unmarshal number 1.2312312312312313e+20 into Go struct field B.a of type int

This problem can easily occur if the testing is done on smaller numbers and on production we encounter big numbers.

Solution:

One straightforward solution here is to always use float64 values while decoding to get maximum representability. But as shown in Problem #2, that also has its issues.

Problem #2: numbers as strings

Since PHP is dynamically typed, developers often do not worry about the data type that they are working with, specially if it is being read from DB and passed in API response.

Works (with some precision loss):

$to_encode = 3.141592653589793238462643383279; //PHP// Go
type B struct {
A float64 `json:"a"`
}
// Bash outputjson: {"a":3.141592653589793}go: {A:3.141592653589793}

Fails:

$to_encode = "3.141592653589793238462643383279"; //PHP// Gotype B struct {
A float64 `json:"a"`
}
// Bash outputjson: {"a":"3.141592653589793238462643383279"}go: error occurred - json: cannot unmarshal string into Go struct field B.a of type float64

This problem can occur if, for example, the value passed was being read from some IO, and some developer while introducing changes did a strval() on the variable. While PHP code may continue to run as expected, consumer of the API would start breaking. Quite a big price to pay for such a small miss.

Possible solutions:

This problem has multiple solutions.

  1. One way to solve this would be to use the string tag which is used to indicate to the json package that the value would be found as a string in the received response.

Resolved:

$to_encode = "3.141592653589793238462643383279"; //PHP// Go
type B struct {
A float64 `json:"a,string"`
}
// Bash outputjson: {"a":"3.141592653589793238462643383279"}go: {A:3.141592653589793238462643383279}

However, in this case, if the encoder starts sending a float/int/bool instead of a string, then the parsing would fail.

Fails:

$to_encode = 3.141592653589793238462643383279; //PHP// Gotype B struct {
A float64 `json:"a,string"`
}
// Bash outputjson: {"a":3.141592653589793}go: error occurred - json: invalid use of ,string struct tag, trying to unmarshal unquoted value into float64

So, not the best resolution.

2. Another way to solve this is to always use JSON_NUMERIC_CHECK option in PHP for all APIs. This will ensure that any field with a number-like format would be converted to a JSON number.

<?php$to_encode = "3.141592653589793238462643383279";
$data = ['a' => $to_encode];
echo json_encode($data, JSON_NUMERIC_CHECK)."\n";
?>

So now this works:

$to_encode = "3.141592653589793238462643383279"; //PHP// Go
type B struct {
A float64 `json:"a"`
}
// Bash outputjson: {"a":3.141592653589793}go: {A:3.141592653589793}

3. However, the solution I like the most, because it puts the responsibility on the consumer side, is using json.Number from Go’s json package.

$to_encode = "3.141592653589793238462643383279"; //PHP// Go
type B struct {
A json.Number `json:"a"`
}
// Bash outputjson: {"a":"3.141592653589793238462643383279"}go: {A:3.141592653589793238462643383279}

json.Number is an alias of string data type. While decoding, if any field is specified as json.Number, then JSON numbers, and strings containing JSON numbers are parsed into a string literal and put in the field. The json.Number type also exposes Float64(), Int64() and String() methods to get the value out with required type.

Problem #3: booleans as numbers/strings

In PHP, it is possible to use boolean-like variables as booleans. According to php.net, these values are considered as False:

The most common non-boolean value used as booleans are integers 1 and 0 as True and False.

It is also possible that an API starts sending “True” and “False”, the string versions, in the response. That can lead to parsing failures.

So while this works:

$to_encode = True; //PHP// Go
type B struct {
A bool `json:"a"`
}
// Bash outputjson: {"a":true}go: {A:true}

This fails:

$to_encode = 1; //PHP// Gotype B struct {
A bool `json:"a"`
}
// Bash outputjson: {"a":1}go: error occurred - json: cannot unmarshal number into Go struct field B.a of type bool

Solution:

This is slightly trickier to solve. Go’s json package does not provide any inbuilt support for such inconsistent behaviour from API providers.

So we have to build our own solution, which will use a custom data type which implements the Unmarshaler interface.

Let’s add the following type to our Go decoder program.

type jbool boolfunc (jb *jbool) UnmarshalJSON(b []byte) error {
var i interface{}
err := json.Unmarshal(b, &i)
if err != nil {
return err
}
switch v := i.(type) {
case bool:
*jb = jbool(v)
case string:
x, err := strconv.ParseBool(v)
if err != nil {
return err
}
*jb = jbool(x)
case float64:
*jb = v != 0
case int, int32, int64:
*jb = v == 1
default:
err = errors.New("bool could not be decoded into jbool")
}
return nil
}

Then, this works:

$to_encode = 1; //PHP// Go
type B struct {
A jbool `json:"a"`
}
// Bash outputjson: {"a":1}go: {A:true}

Conclusion

If parsing JSON in Go, and you do not trust the API provider to be consistent with types, or the upstream is using dynamically typed language, for all JSON numbers, prefer json.Number over float64 or int. For booleans, you should create your own custom data type which will handle any edge-case parsing for you.

Reference to other materials:

I code, among other things.