How to Train Your Swift: Examples of Computational Statistics in Swift

Swift’s ease of use and elegance of form make it perfect for math hobbyists looking to explore simple mathematical concepts. In this talk from try! Swift, Diana Zmuda presents a statistical model to rank data, a bag-of-words model to classify new elements, and a Markov Chain algorithm to generate entirely new data points. Over the course of the session, she walks through a variety of examples of exciting formulas implemented entirely in Swift, building up to a program trained to sort, classify, and generate data.


Overview (0:00)

I’m Diana, and I’m an iOS developer at Thoughtbot. I’m here to talk about a few examples of computational statistics and machine learning in Swift. Keep in mind that I am an iOS developer and not a mathematician; judge me on my correct use of the word tuple, and not my pronunciation of the word “Bernoulli trial”.

Today I am going to go over:

  • Different applications of models and formulae in Swift

  • Machine learning concepts

  • Using Swift to solve some problems (that we usually do not get to see during our day-to-day work, building Mac and mobile apps), such as:

    • Calculating a ranking system for data points based on preexisting scores

    • Creating a classifier for text using probabilities derived from our training data that we already know the classification for

    • Using a naive summarization algorithm to summarize paragraphs of text by surfacing only the most important and most relevant texts from the whole document

    • Creating a Markov chain text generator by creating a model, teaching it our entire data set, and then traversing through our model to generate text

Our Dataset (1:59)

Our data set will be anime descriptions, titles and ratings.

In the end, we will have a library that can rank different anime according to ratings, classify what genre an anime is in based on its description, create new descriptions and titles for fake anime, and summarize long descriptions of anime into terse, concise summaries.

Rating (2:53)

Let’s talk about calculating a rating. I’m not the greatest expert at anime, so when I choose a show, it is usually because one of my friends has recommended it. But this is the age of the internet, and word of mouth is dead. Let’s see how we can automate rankings so we can avoid relying on people.

Our problem is that we need to sort items (i.e. anime titles) by rank. We are given the number of positive and negative votes. However, we cannot be sure that the rating we get from those votes is accurate. When I say “accurate”, I mean how close the observed score is to what its true score is. And when I say “true score”, I mean the score it would have if everybody saw it and everybody rated it, and did not (for example) accidentally click the wrong button. We have to account for the overall quantity of votes. We can assume that as we get more votes, we can be more confident about the rating being exactly what we are observing.

We’ll use the lower bound of a Wilson score interval (one among other possible solutions). We want to use a system that is skeptical about any rating that is based on very few votes.

For our implementation, we first want to create a simple protocol that allows any object to be Rateable by having a vote tuple, up-and-down votes. Our use-case for ranking is for collections of objects. We get to create an extension on CollectionType, where all of our elements conform to the Rateable protocol. We use the sort function, but we are passing at the lower bounds of our wilsonConfidenceInterval, because we are being a bit pessimistic.

Wilson’s score interval (4:25)

Let’s look at our implementation for the Wilson confidence interval. We are returning a tuple of our lower and upper bounds (that we mentioned before), even though we are only going to use the lower bounds. If you wanted to be very optimistic, you could use the upper bounds, but it would not be advisable. The inverse normal cumulative distribution function (inverseNormalCDF) is a constant for whichever confidence level you require. 95% is very common; you could just use a static (constant) number for Z. Our lower bounds and our upper bounds rely on Z, the number of positive votes, and on the total number of votes. Being able to use special characters is very helpful and appropriate when you are implementing “mathy” formulae.

protocol Rateable {
    var votes: (up: Int, down: Int) { get }
}
extension CollectionType Generator.Element: Rateable {
    func sortByWilsonRanking() > [Generator.Element] {
        return self .sort {
            return wilsonConfidenceInterval(
                $0.votes.up, $0.votes.down).lower >
                wilsonConfidenceInterval($1.votes.up, $1.votes.down).lower
        }
    }
}
func wilsonConfidenceInterval(positive: Int,
                              _ negative: Int,
                              confidence: Double = 0.95)
                              > (lower: Double, upper: Double) { 
    guard case let n = Double(positive + negative) where n != 0 else {
        return (0.0, 0.0)
    }

    let z = inverseNormalCDF(1.0 confidence / 2.0)
    let z² = (z * z)

    let p̂ = 1.0 * Double(positive) / n

    let lhs = p̂ + z² / (2 * n)
    let rhs = z sqrt((p̂ * (1  p̂) + z² / (4 * n)) / n)
    let divisor = 1 + z² / n

    let lowerBound = (lhs  rhs) / divisor
    let upperBound = (lhs + rhs) / divisor

    return (lowerBound, upperBound)
}
struct Anime: Rateable {
    let name: String
    let votes: (up: Int, down: Int)

    init(name: String, votes: (up: Int, down: Int)) {
        self.name = name
        self.votes = votes
    }
}

let animeTT = Anime(name: "Tenjho Tenge", votes: (8, 3))
let animeMM = Anime(name: "Madoka Magica", votes: (159, 61))

[animeTT, animeMM].sortByWilsonRanking()

Let’s use our rating protocol that we created, and our Wilson score. We create an anime that conforms to our Rateable protocol, and we create new instances of the class with vote tuples. When we sort our list of these Anime structs by Wilson ranking, they rank themselves quite nicely. You can obviously do more than just two, but that is all that would fit.

Now that we know how to rank our data, we will talk about classifying our data.

Tokenizing (7:12)

The first step to classifying is to tokenize your text. I have determined that the only anime I want to watch is anime about giant robots: I need a way to filter out any anime that is not in the genre I like. The first step to classifying our data is to represent our descriptions and model a correlation between all genre-A anime and a different correlation between genre-B anime. To do this we will use a bag-of-words model.

Bag-of-Words (7:55)

Each anime description in our training data set will be broken down into a set of all of the words it contains. We strip out non-specific words (e.g. and, it, the). Then we determine which words will occur most reliably in text that is known to be in a specific category. For example, the word “space” appears often in mecha-genre anime descriptions.

Bernoulli trial (8:34)

We will use the Bernoulli document model to classify our anime. Bernoulli trial means you can have two different outcomes - our two different outcomes would each be a category of anime.

func tokenize(text: String) > Set<String> {
    func tagIsSignificant(tag: String) > Boot {
        switch tag {
        case NSLinguisticTagParticle, NSLinguisticTagPronoun, NSLinguisticTagClassifier:
            return false default: return true
        default:
            return true
        }
    }

    var tokens: Set<String> = []
    text.enumerateLinguisticTagsInRange(text.startIndex..<text.endIndex,
        scheme: NSLinguisticTagSchemeLexicalClass,
        options: [.JoinNames, .OmitPunctuation, .OmitWhitespace, .OmitOther],
        orthography: nil) { tag, tokenRange, sentenceRange, _ in
            if tagIsSignificant(tag) {
                let token = text.substringWithRange(tokenRange)
                tokens. insert(token.lowercaseString)
            }
    }

    return tokens
}

We are creating a usable bag-of-words that excludes particles and pronouns (not domain-specific words). The tagIsSignificant function excludes all of those non-relevant words. I used a nested function because it is a small, hyper-local, singularly-relevant function that is only part of the tokenize function. After enumerating through all the linguistic tags, and inserting only words that are relevant, we have a workable bag-of-words.

class Classifier {
    private var training: [Classification: [Set<String>]] = [:]
}

For now, all our Classifier has is a variable that is our training data set. Our training data set is keyed off of the classification of each piece of data. We already know the classification of all of the data that we are putting into our training data set, because it is training data we already know.

Classifying (10:18)

Now that we can create tokenized representations of our anime descriptions, we can classify new pieces of text. For example, we could determine whether an anime description should probably be classified as a “magical girl”, or as a “giant robot” anime. Again, that is a Bernoulli trial because there are two.

Naive Bayes (10:46)

We will be using a naive Bayes classifier to classify our text. The probability of each word occurring in the text is independent of the probability of other words occurring in the text.

Implementation

lazy var vectors: [Classification: [String: Double]] = {
    var vectors: [Classification: [String: Double]] = [:]
    for (classification, setsOfTokens) in self.training {
        var probabilities: [String: Double] = [:]
        let allTokens = Set(setsOfTokens.flatMap{$0})
        let count = setsOfTokens.count

        for token allTokens {
            let n = setsOfTokens.map{$0.contains(token)}
                                .reduce(0){$0 + ($1 ? 1 : 0)}

            probabilities[token] = Double(n) / Double(count)
        }

        vectors [classification] = probabilities

    }

    return vectors
}()

lazy var creates vectors of word probabilities, using the keywords from our training data set. Technically, we are not creating vectors.; we are creating dictionaries whose key is the relevant word, and whose value is a probability. The probability is the word probability for each classification. For example, the word robot is in 66% of giant-robot anime descriptions: its word probability is 66% (although the word probability for magical girl anime is entirely separate). We are calculating these probabilities and storing them in this loop.

typealias ClassificationType = protocol<RawRepresentable, Hashable>

class Classifier<C: ClassificationType> {
    typealias Classification = C

    private var training: [Classification: [Set<String>]] = [:]
}

Now that we have our training data set classified and our probabilities from that data set calculated and stored, we can create our classifier. The first step is to create a typealias for our classification type. This gives us the flexibility to classify practically anything (at least anything that is representable and attachable). Inside of our classifier, we are type aliasing the generic type C to classification for convenience.

func classify(text: String) > Classification? {
    guard case let tokens = tokenize(text) where tokens.count > 0 else {
        return nil
    }
    var maximum: (classification: Classification?, score: Double) = (nil, 0.0)

    for (classification, probabilities) in vectors {
        var p = 1.0

        for (token, probability) in probabilities {
            if tokens.contains(token) {
                p *= probability
            } else {
                p *= (1.0  probability)
            }
        }

        if maximum.score < p {
            maximum = (classification, p)
        }
    }  

    return maximum. classification
}

Inside the classify function, we create a maximum tuple that is going to be our result. The tuple contains a classification and a score. These will end up being reset to the highest calculated score in our for loop.

Let’s run through the for loop for a giant robots classification. Our probability starts off at 1.0, which is 100%, but if you had a prior probability that said 80% of all anime is always about giant robots, you could use that instead. For each word that we have a word probability for, we check if the text that we are classifying has that word in it; if it does, we multiply the overall probability (which started out at 100), by that token’s word likelihood. If it does not, we multiply the overall probability by the opposite of that word’s likelihood - that word’s likelihood of not appearing in a giant robots anime description. If the probability that we calculated for the giant robots classification ends up being higher than the magical girl classification, then the maximum will be set to the giant robots tuple, and (at the very end) will return the classification that had the highest calculated probability. If it is giant robots, I would end up watching it.

Summarizing (14:55)

Let’s try our hand at summarizing anime descriptions.

A data summarization is technically machine learning, and today we will be using a nice, simple implementation of that. To summarize our text, we want to find a representative subset of all of our sentences that we believe contains the information that is in the whole set of sentences. To determine if a sentence is representative of all of the sentences, we will compare it to every other sentence in our text. If the sentence is similar to many other sentences, we can say it is highly representative.

We will measure similarity between two sentences by how many words that they have in common. A sentence with many words in common with other sentences is very representative, and will most likely end up being in our summary at the end.

Implementation

Let’s start by splitting our text summary into pieces.

We have three new var that return a String divided (.byParagraphs), (.bySentences), and (.byWords). We will use paragraphs and sentences (alternatively, we could use words). These all return an array of strings.

extension String {
    var paragraphs: [String] {
        return enumeratedSubstringsWithOptions(.ByParagraphs)
    }

    var sentences: [String] {
        return enumeratedSubstringsWithOptions(.BySentences)

    var words: [String] {
        return enumeratedSubstringsWithOptions(.ByWords)
    }
}

The method enumeratedSubstringsWithOptions calls enumerate substrings in range with the provided option. Our next extension on String returns the representative score of all of the sentences in our String. We loop through every sentence in our String and calculate and record a score for every sentence. A sentence’s representative score is the sum of its intersection with each and every other sentence in the text. We could have normalized our score based on the length of the sentence; e.g. shorter sentences could be weighted against longer ones (I like that this is a bit biased towards long sentences with many words, but that is a preference). After we have looped through all of our sentences, we will return the dictionary that just contains all of their representative scores.

var scoresBySentence: [String: Int] {
    var scores: [String: Int] = [:]

    let sentences = self.sentences

    for (i, sentence) in sentences.enumerate() {
        var score = 0
        let tokens = Set(sentence.tokens)

        for (j, comparison) in sentences.enumerate() {
            guard j != i else { continue }
            score += tokens.intersect(comparison.tokens).count
        }

        scores[sentence] = score
    }

    return scores
}

Let’s take advantage of the way that people write. Paragraphs are usually reliable as limiters between subjects within a document of text. Let’s use our previous var scoresBySentence: and split our text by paragraphs. For each sentence in that paragraph, we can check its score in our scoresBySentence dictionary that we created. And we can return only the highest scoring sentence in each paragraph. After we append all of the highest scoring sentences from each paragraph, we have our final summary.

var paragraphSummaries: [String] {
    let scoresBySentence = self.scoresBySentence

    var summaries: [String] = []
    for paragraph in self. paragraphs {
        guard case let sentences = paragraph.sentences
            where sentences.count > 0
            else { continue }

        let bestSentence = sentences.maxElement {
            return scoresBySentence[$0] > scoresBySentence[$1]
        }
        summaries.append(bestSentence!)
    }

    return summaries
}

Using a six-paragraph summary of the film Akira, we can produce a series of sentences, each of which is the most relevant sentence from each paragraph. As you can see below, it is a high level, decent synopsis of the movie (even if some of the references are ambiguous):

[0] "One night, teen delinquent Shotaro Kaneda has his bosozoku gang, the Capsules, battle the Clowns, their rivals."
[1] "The Clowns ambush the two but the Capsules rescue them."
[2] "He fights Kei and exhumes Akira's remains but finds only tissue samples."
[3] "Tetsuo attempts to seek a cure from Kaori, but gets shot by Shikishima."
[4] "The other espers aid in the effort at the cost of being unable to return."
[5] "The explosion destroys most of Neo—Tokyo, killing Onishi in the process."

Generating (20:12)

Last step, we will create a Markov chain text generator. A Markov chain is a model of possible states that includes probabilities for transitioning from one state to another. The probability of moving from one state to another has to depend only on the current state, and not on any of the preceding states)

You may have seen Markov chain generators before; e.g. on Twitter, they use your old tweets, and then they tweet back at you in your own words. Most of those are in JavaScript (I think), but we will see one in Swift. We will use our Markov chain model to generate new anime titles. In our model, each state will be a word, the current state will be the current word, and the next state will be all possible next words. The probability for which possible state we transition to will be random. To represent this model in Swift, we will use dictionaries, where the key is the current state and the value is an array of possible next states.

Training algorithm (20:30)

In our Markov object, our memory variable is the Markov chain that we will create and teach using our anime descriptions. First, we need to define a seed that will represent the beginning state and ending state of our model. The seed cannot be a word or letter that could appear in our text (I was using an Emoji, but that did not render quite correctly).

class Markov {
    static let seed = ""
    static let separator = " "

    var memory = [String : [String]]()
    var previousWord = Markov.seed

    func learn(sentence: String) {
        var parts = sentence.characters.split { $0 == " " }.map(String.init)
        parts.append(Markov.seed)

        for part in parts {
            learnWord(part)
        }

        previousWord = Markov.seed
    }
}

Implementation

func learnWord(newWord: String) > [String : [String]] {
    if ((memory[previousWord] == nil)) {
        memory[previousWord] = []
    }

    memory[previousWord]?.append(newWord)
    previousWord = newWord

    return memory
}

The learnWord function splits up text words by word and adds each word to the Markov chain. We can see how a word is learned by adding it to the memory model. We learn words by looking up our state, which is the previous word, and then adding the new word to the array of possible next words, or by creating a new array. Let’s use our model to generate a sequence of words. We will traverse through our model using random probabilities when choosing the next word.

fun ask(word: String = Markov.seed) > String {
  let sentence = lookupWord(Markov.seed, sentence: [String]())
  return sentence.joinWithSeparator(Markov.separator)
}

Our simple ask function, where we recursively query our Markov chain for the next state, will join this chain of states together to form a new description. The lookupWord function randomly selects the next random state given the current state, and then recursively calls itself until it encounters the seed, which signals ending the chain; the end of the sentence.

func lookupWord(word: String, sentence: [String]) > [String] {
    guard let possibleWords = memory[word] tse { return sentence }

    let randomIndex = Int(arc4random_uniform(UInt32(possibleWords.count)))
    let nextWord = possibleWords[randomIndex]

    sentence.append(nextWord)
    if (nextWord == Markov.seed) { return sentence }

    return lookupWord(nextWord, sentence: sentence)
}

Recap (23:44)

We have written implementations of rating, classifying, generating, and summarizing text in Swift. Now we can use all of these frameworks, that we have created, to build an app that ranks, automatically classifies, summarizes, and generates new anime.

A fun note about tokenizing: in Japanese, if you strip away everything except for the kanji (which are the root characters), that is the perfect way of tokenizing a sentence or a piece of text, because you have no conjugation or plurality in kanji. It is a root word - that is perfect for tokenization.

Swift is an amazing language for creating iOS and Mac apps, but given its ease of use, it is also a great tool for learning new concepts.

Summary (24:45)

Because of the unreasonable efficiency of simple implementations of machine learning and statistical concepts, they are not perfect. Theoretically, we could have built all of these using Python, and some of the very beautiful libraries that they have available in that language. But I dare say it would have taken more time.

In the implementations that we discussed, we saw examples of gems that Swift has to offer when trying to write code that is very flexible and reusable. That trait is crucial when you are creating an implementation of a formula that can classify any object or sort generic objects. We used certain features of the language that are indispensable when you are modeling statistical formulae.

Given that Swift may someday become our back-end language of choice (and also the fact that Python cannot reign supreme as the king of natural languages and natural language processing forever), it is worthwhile to try our hand at approaching the sum, back-end type, and computationally-intensive problems in Swift.

Further Reading (27:33)

Q&A (27:46)

Q: This process explodes the strings and creates in-memory objects. Do you know what the memory implications are of doing something like this, with a large body of text? Are there strategies to minimize that impact?

Diana: Quite frankly, when I was implementing these, I was not using incredibly large data sets - I did not see any lag from that. All were originally implemented inside of a playground, which ended up obviously having overhead; that was very slow, but that is for hacking it out. I think mostly what I was trying to illustrate in this talk is, you can hack it out, and even though it might not end up being the most efficient method, it is a tool for learning. If you wanted to do a deep dive, yes, you would probably want to explore things like that.


Diana Zmuda

Diana Zmuda

Diana is an iOS developer at Thoughtbot. She co-wrote a book about building mobile apps in tandem with APIs called iOS on Rails. She is also an instructor for App Camp for Girls, a summer camp where young girls learn how to write software. Occasionally, she tweets iOS related puns, @dazmuda