Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
131 changes: 131 additions & 0 deletions docs/boundaries.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
---
redirect_from: "/boundaries.html"
title: Token Boundaries in Ion Text
description: "Details regarding how Ion text is tokenized and where values are split in cases where no whitespace is used between them."
---

# [Docs][docs]/ {{ page.title }}

Ion text values are best separated by explicit whitespace (space, horizontal tab, vertical tab, line feed, carriage return, form feed, and comments). However, the specification defines circumstances in which it is acceptable for values to appear immediately adjacent to each other.

In general, Ion text values can appear adjacent to each other when the boundary between the values is unambiguous. The boundaries between quoted symbols, double-quoted strings, triple-quoted strings, lobs, structs, lists, and S-expressions are unambiguous, so these elements can appear adjacent to one another in Ion text:

```ion
{% raw %}
// A valid Ion data stream
(1)[2]{a:struct}'hello'"world"{{ SSBsb3ZlIElvbiE= }}{{ "Same with clob" }}'annotations too!'::123
{% endraw %}
```

Additionally, any value can appear immediately *after* a value of these types, including numeric values, keywords, and identifiers.

Container separators delimit values unambiguously, so any value can also appear adjacent to a container boundary:

```ion
// Some valid, compact containers
[1,2,"three"]
{name:"Austin",message:"Hello world!"}
```

Indentifiers (unquoted symbols) and reserved keywords (`null` and its typed variants, `true`, `false`, and `nan`) are terminated by the first non-identifier character encountered. The identifier characters are ASCII letters, digits, or the characters `$` (dollar sign) or `_` (underscore). This allows symbols to contain a reserved keyword as a prefix:

```ion
// All single symbols, even when prefixed by a keyword
trueSymbol
null123
nanfalse
```

This means that any value that begins with a non-identifier character can appear immediately after an identifier or one of these reserved keywords:

```ion
// All valid pairs of two top-level values
abc-5
symbol["And a list"]
anotherOne'andAnother'
true(story)
null.float+inf
```

Typed nulls do not form valid Ion if not followed by a non-identifier character:

```ion
null.struct5E10 // ERROR: not a valid Ion value - struct5E10 is a bad null decorator
```

## Numeric stop-characters

Ion text enforces stricter rules on certain "numeric" types. Ion text enforces that integers, real values (decimals and floats, including the special float values `-inf` and `+inf` but excluding `nan`), and timestamps must be followed by one of the following 15 numeric stop-characters: `{}[](),"' \t\n\r\v\f`.

This means that strings, quoted symbols, lobs, structs, lists, and S-expressions can appear immediately after a numeric or timestamp value in a top-level datagram or S-expression:

```ion
{% raw %}
// All of this is valid Ion
123{a: "struct"}456
-0.5[a, list]
5.34e9"then a string"
0xdeadbeef'hello'::world
+inf{{ "<-- also works with -inf" }}
(1["list"]2000T{a: "struct"}0b010101("sexp")4)
{% endraw %}
```

Anything that is not a numeric stop-character appearing immediately after a numeric or timestamp value is a syntax error. This notably includes comments and S-expression operators:

```ion
123// a comment // ERROR: single-line comment is not a valid numeric stop
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is surprising. Comments are explicitly called out in the spec as being equivalent to whitespace, and I believe that all of our implementations treat comments as a valid whitespace for the purposes of terminating a numeric value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also found this surprising, but some implementations do reject Ion that uses a comment immediately after a numeric.

The spec does say specifically:

In the text notation, real values [and int, ts] must be followed by one of the fifteen numeric stop-characters: {},"'\ \t\n\r\v\f.

Which calls out specific whitespace characters \t\n\r\v\f and excludes comments.

Here are comparisons of our Ion JS, Python, Java, and Rust impls. I did not test Go, C yet.

Handling of 123//comment:

Language Success Result
JavaScript ❌: Error Error: invalid character after number
Python ❌: IonException IERR_INVALID_SYNTAX
Java
$ion_1_0
123
Rust ❌: Decoding invalid Ion syntax encountered
offset=3
buffer head=<//comment>
buffer tail=<//comment>
buffer len=9

Handling of 123/*comment*/:

Language Success Result
JavaScript ❌: Error Error: invalid character after number
Python ❌: IonException IERR_INVALID_SYNTAX
Java
$ion_1_0
123
Rust ❌: Decoding invalid Ion syntax encountered
offset=3
buffer head=</comment/>
buffer tail=</comment/>
buffer len=11

Package versions used:

  JS Python Java Rust
Ion version 5.2.1 0.13.0 1.11.11 1.0.0-rc.7

Here's the code I used to test:

Java impl:

import com.amazon.ion.*;
import com.amazon.ion.system.IonSystemBuilder;
import com.amazon.ion.system.IonTextWriterBuilder;
import com.fasterxml.jackson.databind.ObjectMapper;
import java.util.HashMap;
import java.util.Map;

public class Main {
    public static void main(String[] args) {
        ObjectMapper mapper = new ObjectMapper();
        Map<String, Object> response = new HashMap<>();
        
        try {
            String input = args[0];
            IonSystem ION = IonSystemBuilder.standard().build();
            IonValue datagram = ION.getLoader().load(input);
            StringBuilder sb = new StringBuilder();
            IonWriter writer = IonTextWriterBuilder.pretty().build(sb);
            datagram.writeTo(writer);
            writer.close();
            
            response.put("success", true);
            response.put("result", sb.toString());
            System.out.println(mapper.writeValueAsString(response));
        } catch (Exception e) {
            response.put("success", false);
            response.put("errorType", e.getClass().getSimpleName());
            response.put("error", e.toString());
            try {
                System.out.println(mapper.writeValueAsString(response));
            } catch (Exception jsonE) {
                System.out.println("{\"success\":false,\"errorType\":\"JsonProcessingException\",\"error\":\"" + jsonE.toString() + "\"}");
            }
        }
    }
}

JS impl:

import * as ion from 'ion-js'

try {
    const input = process.argv[2]
    const reader = ion.makeReader(input)
    const writer = ion.makePrettyWriter()
    writer.writeValues(reader)
    writer.close()

    const decoder = new TextDecoder("utf-8")
    console.log(JSON.stringify({
        success: true,
        result: decoder.decode(writer.getBytes())
    }))
} catch (e) {
    console.log(JSON.stringify({
        success: false,
        errorType: e.constructor.name,
        error: e.toString()
    }))
}

Python impl:

#!/usr/bin/env python3
import sys
import json
import amazon.ion.simpleion as ion

try:
    input_str = sys.argv[1]
    data = ion.loads(input_str, single_value=False)
    result = ion.dumps(data, sequence_as_stream=True, binary=False, indent='  ')
    
    print(json.dumps({
        "success": True,
        "result": result
    }))
except Exception as e:
    print(json.dumps({
        "success": False,
        "errorType": type(e).__name__,
        "error": str(e)
    }))

Rust impl:

use ion_rs::*;
use serde_json::json;
use std::env;

fn main() {
    let args: Vec<String> = env::args().collect();
    
    match run(&args[1]) {
        Ok(result) => {
            println!("{}", json!({
                "success": true,
                "result": result
            }));
        }
        Err(e) => {
            println!("{}", json!({
                "success": false,
                "errorType": format!("{:?}", e).split('(').next().unwrap_or("Error"),
                "error": format!("{}", e)
            }));
        }
    }
}

fn run(input: &str) -> IonResult<String> {
    let elements = Element::read_all(input.as_bytes())?;
    let out: String = elements.encode_as(v1_0::Text.with_format(TextFormat::Pretty))?;
    Ok(out)
}

5D3/* block this time */ // ERROR: block comment also cannot act as numeric stop
(10-.5*3) // ERROR: operators in an S-expression cannot act as numeric stop
(2007-01-01T~2007-12-31T) // ERROR: same goes for timestamps
-inf// a comment // ERROR: +inf and -inf also require valid numeric stops
```

Note that the special float value `nan` does not require a valid numeric stop:

```ion
nan// This is okay!
```

## Associativity of + and - in S-expressions

`+` and `-` can have two meanings in S-expressions based on context, either as operators or prefixes to a numeric value. If a minus or plus sign appears without another operator symbol immediately preceding it and the value immediately following is an integer or real number (for minus), or `inf` (for either), then the sign binds to the following value rather than acting as an operator symbol. For example, the following S-expressions all contain exactly 2 values:

```ion
// These S-expressions all have two elements
// The sign binds to the following value
(1 -2)
(+inf -.5E3)
(nan-inf)
(-16D-3 +inf)
```

The following S-expressions, on the other hand, all contain 3 values, one of which is an operator symbol:

```ion
(1 --123) // equivalent to (1 '--' 123) - minus sign is immediately preceded by another operator symbol character
(1 *+inf) // equivalent to (1 '*+' inf) - plus sign is immediately preceded by another operator symbol character
(1 -2000T) // equivalent to (1 '-' 2000T) - minus sign cannot bind to a timestamp
(1 +infx) // equivalent to (1 '+' infx) - +infx is not positive infinity, this is an operator and a symbol
```

To use an operator before a token that begins with `+` or `-`, use whitespace around the operator:

```ion
(10 + -6 = 4)
```

Note that if `+inf` or `-inf` appear in an S-expression but are not followed by a numeric stop-character, then they will be interpreted as an operator symbol and the indentifier `inf`.

```ion
// In these S-expressions, the first +/-inf cannot be the special float values because they are not
// followed by a valid numeric stop-character. However, they can still be interpreted as an operator
// and the symbol 'inf'.
(+inf*3) // equivalent to ('+' inf '*' 3)
(-inf-inf) // equivalent to ('-' inf -inf)
```

It is recommended that you always use explicit whitespace around operators in S-expressions to avoid confusing operator associativity.

<!-- references -->
[docs]: {{ site.baseurl }}/docs.html
9 changes: 6 additions & 3 deletions docs/spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -149,7 +149,8 @@ _1 // A symbol (ints cannot start with underscores)
```

In the text notation, integer values must be followed by one of the
fifteen numeric stop-characters: `{}[](),\"\'\ \t\n\r\v\f`.
fifteen numeric stop-characters: `{}[](),\"\'\ \t\n\r\v\f`. See [Text Token Boundaries](boundaries.html)
for more details.

### Real Numbers {#real-numbers}

Expand Down Expand Up @@ -189,7 +190,8 @@ The `float` type denotes either 32-bit or 64-bit IEEE-754 floating-point values;
sizes may be supported in future versions of this specification.

In the text notation, real values must be followed by one of the
fifteen numeric stop-characters: `{}[](),\"\'\ \t\n\r\v\f`.
fifteen numeric stop-characters: `{}[](),\"\'\ \t\n\r\v\f`. See
[Text Token Boundaries](boundaries.html) for more details.

The precision of `decimal` values, including trailing zeros, is significant and
is preserved through round-trips. Because most decimal values cannot be
Expand Down Expand Up @@ -265,7 +267,8 @@ not equivalent:
```

In the text notation, timestamp values must be followed by one of the
fifteen numeric stop-characters: `{}[](),\"\'\ \t\n\r\v\f`.
fifteen numeric stop-characters: `{}[](),\"\'\ \t\n\r\v\f`. See
[Text Token Boundaries](boundaries.html) for more details.

### Strings {#string}

Expand Down