Parsell: streaming parsers in Rust

Alan Jeffrey, Rust Bay Area Meetup, February 2016

Goals

Parse streaming input
No memory allocation
Infer complex types
No dynamic method calls

Goal 1: Parse streaming input

Regular parsing is all-at-once:

let parser = ···;
let mut test_data = "(x + 37) + y".chars();
match parser.init(&mut test_data).unwrap() {
Done(parsed) ⇒ ··· // yay, parsed it!
}

Goal 1: Parse streaming input

Streaming parsing is bit-by-bit:

let mut test_data1 = "(x + ".chars();
let mut test_data2 = "37) + y".chars();
match parser.init(&mut test_data1).unwrap() {
Continue(parsing) ⇒ match parsing.more(&mut test_data2) {
Done(parsed) ⇒ ··· // yay, parsed it!
} ··· }

Goal 1: Parse streaming input

Streaming parsing is painful.

Input arrives in chunks (e.g. from the network).

Pain: save/restore parser state at chunk boundaries.

Gain: much less buffering.

Goal 2: no memory allocation

We’d like the parsing library to use zero heap.

User code (e.g. building an AST) can do what it likes.

Can the library work with no memory allocation?

Goal 2: no memory allocation

No.

Contradicts goal 1.

Goal 2: no memory allocation

Remember save/restore parser state at chunk boundaries?

That state has to go somewhere.

Can’t go on the stack (not fixed size).

This just leaves the heap.

Revised goal 2

No memory allocation between chunk boundaries.

Goal 3: infer complex types

Parser libraries often have complex types.

For example, the parser “foo.or_else(bar).and_then(baz)”
has type “AndThen〈OrElse〈Foo,Bar〉,Baz〉”.
This is very similar to the Rust iterator library.

Users should not have to see complex types.

Yay type inference!

Goal 4: no dynamic method calls

Dynamic method calls can be expensive.

Can mess up instruction pipelining / prefetching.
Can mess up optimizations from inlining.

Can we do everything with static method calls?

Goal 4: no dynamic method calls

AFAIK no, at least not in Rust.

Contradicts previous goal.

Goal 4: no dynamic method calls

Rust does static method dispatch by type.

You write “this.that()”, rustc works out (this: SomeType)
and looks up (impl ThatTrait for SomeType),
at compile time.

Goal 4: no dynamic method calls

Imagine a parser for expressions like "(x + 37) + y".

Expressions are arbitrarily deep, so the parser is recursive.

Imagine a type Exp1〈P〉 that can parse expressions of depth 1,
and uses parser P each time it reaches parentheses.

We’re looking for the type Exp1〈Exp1〈Exp1〈···〉〉〉.

Goal 4: no dynamic method calls

Yay, Rust has recursive types, using Box.

struct Exp(Box〈Exp1〈Exp〉〉);

Yay! Recursive parsers, static dispatch, we’re good?

Goal 4: no dynamic method calls

Remember Users should not have to see complex types?

struct Exp(Box〈Exp1〈Exp〉〉);

Exp1 is one of those complex types.

AFAIK, Rust doesn’t allow type inference to make these invisible,
(e.g. in supertrait type arguments or associated type definitions).

Goal 4: no dynamic method calls

Trait objects to the rescue!

struct Exp(Box〈SomeParserTrait〉);

Yay: parser traits much are simpler than the complex parser types.

But: trait objects use dynamic dispatch.

Revised goal 4

No dynamic method calls between chunk boundaries.

Revised goals

Parse streaming input

No memory allocation between chunk boundaries

Infer complex types

No dynamic method calls between chunk boundaries

State of the art

Rust parser libraries

LALRPOP (Niko Matsakis)

Nom (Geoffroy Couprie)

Parsell (me)

···

Rust parser libraries: LALRPOP

Domain-specific language (à la yacc, Bison, JavaCC, ...).

Recognizes a rich class of languages (LALR).

All-at-once parsing.

Rust parser libraries: Nom

Parser library, makes heavy use of macros.

Recognizes a rich class of languages (CFG).

Streaming parsers.

Does some allocation / dynamic calls.

Hides complex types behind macros.

Rust parser libraries: Parsell

Parser library, no macros.

Recognizes a limited class of languages (LL(1)).

Streaming parsers.

Only does allocation / dynamic calls at chunk boundaries.

Inspired by Nom and Hutton/Meijer monadic parsing.

Parsell

Parsell: meets goals

Parse streaming input

No memory allocation between chunk boundaries

Infer complex types

No dynamic method calls between chunk boundaries

Parsell: example

let ALPHABETIC = character(char::is_alphabetic);
let ALPHANUMERIC = character(char::is_alphanumeric);
let parser = ALPHABETIC
.and_then(ALPHANUMERIC.star(String::new));
let mut test_data = "abc123!".chars();
match parser.init(&mut test_data).unwrap() {
Done(result) ⇒ assert_eq!(result, ('a', "bc123")), ...
}
assert_eq!(test_data.as_str(), "!");

Parsell: example with no copying

let parser = ALPHABETIC
.and_then(ALPHANUMERIC.star(ignore))
.buffer(); // Returns a Cow<'a,str>
let mut test_data = "abc123!".chars();
match parser.init(&mut test_data).unwrap() {
Done(result) ⇒ assert_eq!(result, Borrowed("abc123")), ...
}

Parsell: example with copying

let parser = ALPHABETIC
.and_then(ALPHANUMERIC.star(ignore))
.buffer(); // Returns a Cow<'a,str>
let mut test_data1 = "abc".chars();
let mut test_data2 = "123!".chars();
match parser.init(&mut test_data1).unwrap() {
Continue(parsing) ⇒ match parsing.more(&mut test_data2) {
Done(result) ⇒ assert_eq!(result, Owned(String::from("abc123"))), ... }, ... }

Parsell: implementation fragment

trait Stateful〈Ch, Str〉where Str: Iterator〈Item=Ch〉{
type Output;
fn more(self, data: &mut Str) → ParseResult〈Self, Self::Output〉;
}

enum ParseResult〈State, Output〉{
Continue(State),
Done(Output),
}

Parsell: lessons learned

Goals can be met (I wasn’t sure about this when I started!)

Rust borrowchk catches bugs that testing wouldn’t.

Linearity enforces safety (can’t call more after done).

Lifetime polymorphism / HRTB is tricky. (Thanks :eddyb!)

Needed traits PeekableIterator and ToStatic.

Hack without fear!

Thanks

IRL: Alan Jeffrey

EMail: ajeffrey@mozilla.com

Twitter: @asajeffrey

Repo: https://github.com/asajeffrey/parsell

Doc: http://asajeffrey.github.io/parsell

Crate: https://crates.io/crates/parsell