Parser combinators are one of the most awesome functional techniques for parsing strings into trees, like constructing JSON. In this talk from try! Swift, Yasuhiro Inami describes how they work by combining small parsers together to form more complex and practical ones.
Overview (00:00)
My name is Yasihiro Inami; I work at the LINE Corporation, a messenger app company. It is an honor to be a speaker at the try! Swift conference, with great developers from all over the world. In this talk I present one of the most fantastic techniques in functional programming: parser combinator.
Parser Combinator (01:06)
Parser (01:24)
Parser is an analyzer that takes a string as an input and creates a syntax tree. It consists of two parts: the lexer (takes a long string and converts it into small chunks, called tokens), and the syntactic analyzer (takes the tokens and creates a Syntax Tree).
Bottom-up, top-down (02:25)
When we do parsing, there are mainly two approaches: bottom-up and top-down parsing.
For example, there is a string in the bottom A = B + C * 2 ; D = 1
. How do we parse this, and in what order?
In bottom-up parsing, we take a closer look at the bottom tree first. We keep reading the text, we can expand the tree and, lastly, we reach the top. The parsing starts from the bottom to the top.
In top-down parsing, you read the partial text and, at some point (a bit earlier than bottom-up parsing), the top-down parser starts to predict what tree should come at the top. And it keeps guessing down to the bottom.
Parsing algorithms (03:35)
The parser combinator is a recourse in this parser, and it is a top-down parser. If you have ever read the Apple Swift Code before, the code is in leaf parse, and it also uses the same technique in C++. This is known as DSL imperative way.
Combinator (04:12)
This is called Combinator logic:
I = λx.x = { x in x }
K = λxy.x = { x, y in x }
S = λxyz.xz(yz)
= { x, y, z in x(z)(y(z))}
Y = S(K(SII))(S(S(KS)K)(K(SII)))
ι = λx.xSK
For now, think of it as a function (closures, in Swift). They can be combined together to form more interesting characteristics.
Parser combinator (04:48)
Parser combinator is a higher-order function that combines multiple parsers to form one big parser. Here (see slide) there are two parsers, the inside parsers, collaborating with each other.
Simple Parser in Swift (05:23)
How do we implement this in our first language, Swift?
Parser monad (05:33)
struct Parser<A> {
let parse: String -> (A, String)?
}
We prepare a simple struct called Parser
(this is a generic Parser<A>
). It is a container of the closure (let parse
). It takes String as input, and, in this case, the output is the tuple. The tuple consists of “output” and “remaining input”. Because the parsing sometimes fails, I use the Optional.
This structure is called state monad, and it may combine the monad or optional monad together. From here is monad’s showtime. It is not difficult; relax.
Applicative pure, alternative empty (06:27)
// Applicative pure
func pure<A>(a: A) -> Parser<A> {
return Parser { (a, $0) }
}
// Alternative empty
func empty<A>() -> Parser<A> {
return Parser { _ in nil }
}
The pure
is instantiating the parser. It parses the closure later. Inside that closure, you regain the tuple, which means this parser is always successful returning the variable <A>
, and it does not consume the input. For the empty
, the nil
is returned. That means this parser always fails.
Monad bind, functor fmap (07:05)
// Monad bind (>>=)
func >>- <A, B>(p: Parser<A>, f: A -> Parser<B>) -> Parser<B> {
return Parser { input in
if let (result, input2) = p.parse(input) {
return f(result).parse(input2)
} else {
return nil
}
}
}
// Functor fmap (<$>)
func <^> <A, B>(f: A -> B, p: Parser<A>) -> Parser<B> {
return p >>- { a in pure(f(a)) }
}
First, you receive the input, you see the closure, and the input is passed as an argument. You parse that with the given parser p
. If that parsing succeeds: you get a tuple of it, you extract the result part, and apply f
. And that f
is the result of the second parser. You use that parser for the main input, in this case, input2
. Otherwise, if the first parser fails in the first place, you would just return nil
. That means the whole parser fails (because the first parsing fails).
Once you have this fmap
operator, you can immediately get the functor fmap
. In Swift, we do not use $
, instead we use a ^
sign (we cannot use $
for the custom operator).
Alternative choice operator (08:18)
// Alternative choice (associative operation)
func <|> <A>(p: Parser<A>, q: Parser<A>) -> Parser<A> {
return Parser { input in
if let (result, input2) = p.parse(input) {
return (result, input2)
} else {
return q.parse(input)
}
}
}
You have a parser p
and q
, you try parser p
first. If that parsing succeeds, then finish, and you return that tuple. If it fails, you try q
instead.
Applicative sequential application (08:48)
I would like to mention three important sequential operations:
// Applicative sequential application
func <*> <A, B>(p: Parser<A -> B>, q: Parser<A>) -> Parser<B> {
return p >>- { f in f <^> q }
}
// Sequence actions, discarding the value of the second argument
func <* <A, B>(p: Parser<A>, q: Parser<B>) -> Parser<A> {
return const <^> p <*> q
}
// Sequence actions, discarding the value of the first argument
func *> <A, B>(p: Parser<A>, q: Parser<B>) -> Parser<B> {
return const(id) <^> p <*> q
}
-
The first one is a function application for monadic structure.
-
The second one is a sequence action (it looks like an arrow pointing to the left). It tries parser
p
thenq
; it uses both. But only uses the left side of the output ofp
; it discards the output ofq
. -
The last of our operators is the opposite. We try parser
p
thenq
. It is the same as before, but it only uses the output ofq
(notp
).
Parse one character (10:08)
func satisfy(predicate: Character -> Bool) -> Parser<Character> {
return Parser { input in
if let (head, tail) = uncons(input) where predicate(head) {
return (tail, head)
} else {
return nil
}
}
}
Here is the satisfy
parser I implemented (it is fundamental). It examines the input. You decompose it using the uncons
function, it is the head
and tail
. Examining the head part with a predicate: if that predicate succeeds, you get the decomposed tuple; otherwise, the poor parser fails.
You can also create any()
parser (can parse any character), or digit()
(can only parse the numeric number, and numeric character, or some specific character), and char(c: Character)
parser. It is readable, simple:
func any() -> Parser<Character> {
return satisfy(const(true))
}
func digit() -> Parser<Character> {
return satisfy { "0"..."9" ~= $0 }
}
func char(c: Character) -> Parser<Character> {
return satisfy { $0 == c }
}
Parse string (11:09)
In short, this is parsing the specific string. Otherwise it fails.
func string(str: String) -> Parser<String> {
if let (head, tail) = uncons(str) {
return char(head) *> string(tail) *> pure(str)
} else {
return pure("")
}
}
Combine parsers (11:29)
Here are some important parsers called many
and many1
. The parser p
is given, and they use that parser p
many times. The first many
parser uses the parser p
from zero occurrence to many occurrences, and it never fails. The many1
parser must be able to parse at least one occurrence of p
, otherwise it fails. They are mutually recursive.
func many<A>(p: Parser<A>) -> Parser<[A]> {
return many1(p) <|> pure([])
}
func many1<A>(p: Parser<A>) -> Parser<[A]> {
return cons <^> p <*> many(p)
}
Here are skipping parsers: skipMany
and skipMany1
. It looks similarly to the previous many
and many1
, but notice that the result part is now an empty void, an empty tuple (meaningless output).
It seems similar to using this parser for the grounds, but it is more efficient, it has better performance than many
and many1
. And we can implement important particles, skipSpaces
. As you can see, the implementation skipSpaces()
is common when we want to parse things. We want to skip spaces, but we do not care whether it is top or line with feet.
func skipMany<A>(p: Parser<A>) -> Parser<()> {
return skipMany1(p) <|> pure(())
}
func skipMany1<A>(p: Parser<A>) -> Parser<()> {
return p *> skipMany(p)
}
func skipSpaces() -> Parser<()> {
return skipMany(space)
}
Parse tokens (13:07)
Let’s add symbol(str: String)
and natural()
parsers. You might notice that there is a middle parser in the middle that is sandwiched 🍞 by skipSpaces()
parser. With applicative, I do not like operators pointing this direction. It first skips the head white space, then uprising middle parser, and lastly, skips the tail space. But it only uses the output of the middle parser. It does not care about the spaces, it only cares about the string (like a hamburger! 🍔). For the simple parser, the relevant (delicious) part (meat) is a string parser. For natural parser, the relevant part is many1(digit()))
(again, like a hamburger!), which converts the result party from string to integer. Only three lines, sometimes one line of code.
func symbol(str: String) -> Parser<String> {
return skipSpaces() *> string(str) <* skipSpaces()
}
func natural() -> Parser<Int> {
return skipSpaces() *>
({ Int(String($0))! } <^> many1(digit()))
<* skipSpaces()
}
Let’s Play (14:40)
Let’s use the +
and *
signs, parentheses, and natural numbers to make a simple calculator.
Simple Arithmetics (14:59)
The mathematical expression below is called Backus–Naur Form (BNF):
<expr> ::= <expr> + <term> | <term>
<term> ::= <factor> * <term> | <factor>
<factor> ::= ( <expr> ) | <number>
<number> ::= '0' | '1' | '2' | ...
It is common when you do parsing. In Swift, our first language, we implement the following:
func expr() -> Parser<Int> {
return term() >>- { t in // uses right recursion
(symbol("+") *> expr() >>- { e in pure(t + e) }) <|> pure(t)
}
}
func term() -> Parser<Int> {
return factor() >>- { f in
(symbol("*") *> term() >>- { t in pure(f * t) }) <|> pure(f)
}
}
func factor() -> Parser<Int> {
return (symbol("(") *> expr() <* symbol(")")) <|> natural()
}
I used the right recursion, but you can write it in almost one line. Less than three or maybe some more lines. A bit weird, but it is still simple. Here goes the calculation:
let (ans, _) = expr().parse(" ( 12 + 3 ) * 4+5")!
expect(ans) == 65
12 + 3, save on space plus 4+5 is expected to be equal to 65… and it works! 💪
More? TryParsec (16:58)
For those who want more on parsing, I developed TryParsec. It is a monadic parser combinator for this try! Swift conference (nowhere else!). It is inspired by Haskell, and especially the Attoparsec and Aeson libraries. It supports CSV, XML, and JSON. Technically, it can parse LL(*)
, with backtracking by default. This library is not using the do track catch in Swift. This Tryparsec… does not actually try anything, it does not use any try keywords, but please try it (because this is the try! Swift conference). This library is not open source yet (I will, right after this conference).
Here are some functions of TryParsec:
-
Basic Operators:
>>-
,<^>
,<*>
,*>
,<*
,<|>
,<?>
-
Combinators:
many
,many1
,manyTill
,zeroOrOne
,skipMany
,skipMany1
,sepBy
,sepBy1
,sepEndBy
,sepEndBy1
,count
,chainl
,chainl1
,chainr
,chainr1
-
Text (
UnicodeScalarView
):peek
,endOfInput
,satisfy
,skip
,skipWhile
,take
,takeWhile
,any
,char
,not
,string
,asciiCI
,oneOf
,noneOf
,space
,skipSpaces
,number
… (etc)
Parse JSON (19:02)
Let’s see how this TryParsec works with JSON.
enum JSON
(19:12)
enum JSON {
case String(Swift.String)
case Number(Double)
case Bool(Swift.Bool)
case Null
case Array([JSON])
case Object([Swift.String : JSON])
}
In TryParsec, the enum JSON
is already built in (common, basic type). Let’s say we have a JSON String, how do we parse? All we need to do is call parseJSON
, and you immediately get the JSON
enum tree. Simple, easy.
JSON example (19:21)
{
"string": "hello",
"num": 123.45,
"bool": true,
"null": null,
"array": ["hello", 9.99, false],
"dict": { "key": "value" },
"object": { "enabled": true }
}
Parse JSON (to AST) (19:30)
let json = parseJSON(jsonString).value
print(json)
In this process, I do not use any NSJSONSerialization
, or any object that is not involved in this process. It is all written in pure Swift. But we are not interested in this structure. This is an intermediate representation. We want our custom model to be mapped to it. We need JSON decoding and encoding features (which we all love ❤️), and Swift has about 20-30 libraries for doing so. Fortunately, TryParsec, is the 31st JSON decoding and encoding library. 💯
JSON Decoding (19:54)
struct Model
(20:38)
We saw the JSON string example, and we have model mapped this one. In the bottom I added a let dummy
. The JSON string does not have this field, but I added it.
struct Model {
let string: String
let num: Double
let bool: Bool
let null: Any?
let array: [Any]
let dict: [String : Any]
let subModel: SubModel
let dummy: Bool? // doesn't exist in raw JSON
}
FromJSON
protocol (21:02)
You have to implement the conformance to the FromJSON
protocol. You implement the static func fromJSON
, and you use the helper message code fromJSONObject
. This is common writing code applicative style. If you have ever tried the JSON library called Argo, TryParsec uses the same technique. Call function decode, and you get the mapped result.
extension Model: FromJSON {
static func fromJSON(json: JSON) -> Result<Model, JSON.ParseError> {
return fromJSONObject(json) {
curry(self.init)
<^> $0 !! "string"
<*> $0 !! "num"
<*> $0 !! "bool"
<*> $0 !! "null"
<*> $0 !! "array"
<*> $0 !! "dict"
<*> $0 !! "object" // mapping to SubModel
<*> $0 !? "dummy" // doesn't exist in raw JSON
}
}
}
… simple.
JSON Encoding (21:55)
ToJSON
protocol (22:03)
For JSON encoding, conform to the ToJSON
protocol, use the function toJSONObject
, and pass the array of the key-value pairs.
extension Model: ToJSON {
static func toJSON(model: Model) -> JSON {
return toJSONObject([
"string" ~ model.string,
"num" ~ model.num,
"bool" ~ model.bool,
"null" ~ model.null,
"array" ~ model.array,
"dict" ~ model.dict,
"subModel" ~ model.subModel
])
}
}
Encode to JSON string (22:13)
Call the function encode
, and you immediately get back the jsonString
.
let jsonString = encode(model)
print(jsonString)
Summary (TryParsec) (22:24)
TryParsec supports JSON, CSV and XML. It is simple, readable, and easy to create your own parsers (please try it). I also have the Xcode playground (you can also play around with it).
But let me point out some caveats:
- It does not have good performance yet (compared to
NSJSONSerialization
). FromJSON
/ToJSON
protocols (the mapping) are limited. Any library (Argo and other great libraries) has limitations… because Swift is currently limited.FromJSON
/ToJSON
does not work in some nested structures.
Hopefully, Swift 3 will solve these problems. Swift is now open source, it is evolving every day, and we have a great community.
Please check out my library, and help me improve the code base.
Q&A (24:25)
Q: Cool stuff. I have two questions, you can pick which one you want to answer. The first one is, is there any support for error? If the parsing fails can you annotate your parser combinators to say, “if this part fails, maybe I am going to throw this error here”. The second question is, how did you do the ToJSON
, do you have a printer combinator library?
Yasuhiro: I will answer the second question first. There is no printed JSON yet. I think I need to implement that. Now it is just one line JSON string. And the second one, is an error porting is important. But I did not implement it, because the performance is already bad at this point. I need to improve that first, and then hit our importing. Sorry about that.
Receive news and updates from Realm straight to your inbox