Apple has offered an API for natural language processing since iOS 5, which allowed us to tokenize text, detect the language, and determine parts of speech. With Swift and the introduction of Playgrounds, it’s faster and more delightful than ever to experiment with linguistics. We welcomed Ayaka Nonaka, who leads the iOS team at Venmo, to give our next talk on NLP. She’ll go over how to build a spam detector in Swift, starting with the basic theory and ending with a fully functional Naive Bayes classifier.
To see the full code used in demonstrations at once you can visit Ayaka’s GitHub repo.
Naive Bayes Classifier
Theory (2:54)
The Naive Bayes Classifier in machine learning is based on past examples, also known as training data. This theorem comes from the Bayes’ theorem, which allows you to compute the probability of an event given the fact that another event has happen. Thus, P(A|B) is equal to P(A) times P(B|A), all divided by P(B). This formula is derived from the equation for calculating the probability of both events A and B happening.
Example: Spam & Sodium (5:16)
To work out some of this probability theory, with the larger goal of showing how this will be relevant to the classifier, we’ll use an example where emails can be classifed as ‘spam’ or ‘sodium’. Given the total number of emails and the number of those which contain either the words ‘spam’ or ‘sodium’, we can try to calculate the probability that an email is spam given that it contains the word ‘sodium’. However, we run into problems when we try to calculate probabilities with another trigger word in addition to ‘sodium’, and even more difficulties when we add a third word.
Here’s where the Naive Bayes Classifier comes in: by assuming conditional independence, we can simplify some of the probabilites so that we have known values to plug into the formula. We didn’t even have enough information to solve for probabilities while only trying to classify based on two words, so this assumption helps us enormously. The English language has hundreds of thousands of words, and the naive assumption helps us avoid a problem that would otherwise be computationally unfeasible.
Code: NSLinguisticTagger! (13:28)
NSLinguisticTagger has existed since iOS5. It is likely used in Siri, since it’s available for iOS but not for Mac, and has some pretty neat features. NSLinguisticTagger lets you do things like lemmatization, where you can do things like convert the word “running” to “run”. This is different from the concept of stemming, since you can also turn the word “was” to “is”. You can also do things like part of speech tagging: you can find out whether something is a noun, an adjective, or an adverb. Finally, NSLinguisticTagger can do language detection. All of this is done locally on the device, and no network calls are made to any language APIs.
Live Coding (14:28)
We’ll be using Swift Playgrounds to play around with Naive Bayes and NSLinguisticTagger. It’s very easy to experiment with things in Playgrounds, since they provide a quick feedback loop without requiring you to first write tests and commit to certain code because you’ve already tested. For code used in this demonstration, go here!
To start, the tag function takes in a string text and a scheme, and it returns an array of tagged tokens. We have to say what options we use for the NSLinguisticTagger, and in this case I’ve said that I don’t care about white space, punctuation, or things that aren’t words. Then you can create a new linguistic tagger by giving it a tag scheme. You can give the tagger the string that you want to tokenize or tag, then eventually you can get the token. This new token and tag are appended to the array and return.
typealias TaggedToken = (String, String?) // Can’t add tuples to an array without typealias. Compiler bug... Sigh.
func tag(text: String, scheme: String) -> [TaggedToken] {
let options: NSLinguisticTaggerOptions = .OmitWhitespace | .OmitPunctuation | .OmitOther
let tagger = NSLinguisticTagger(tagSchemes: NSLinguisticTagger.availableTagSchemesForLanguage("en"),
options: Int(options.rawValue))
tagger.string = text
var tokens: [TaggedToken] = []
// Using NSLinguisticTagger
tagger.enumerateTagsInRange(NSMakeRange(0, count(text)), scheme:scheme, options: options) { tag, tokenRange, _, _ in
let token = (text as NSString).substringWithRange(tokenRange)
tokens.append((token, tag))
}
return tokens
}
The following functions call tag with different schemes that serve as the second argument. The first, partOfSpeech, tags words as different parts of speech. The next function is lemmatize, which does the same thing but with the scheme lemma. The third function returns the phrases with a language abbreviation tag for the language of the word.
func partOfSpeech(text: String) -> [TaggedToken] {
return tag(text, NSLinguisticTagSchemeLexicalClass)
}
// returns I as a pronoun, went as a verb, to as a preposition, etc
partOfSpeech("I went to the store")
func lemmatize(text: String) -> [TaggedToken] {
return tag(text, NSLinguisticTagSchemeLemma)
}
// turns "went" into "go"
lemmatize("I went to the store")
func language(text: String) -> [TaggedToken] {
return tag(text, NSLinguisticTagSchemeLanguage)
}
language("Ik ben Ayaka") // tag is nl for Nederlands
language("Ich bin Ayaka") // tag is de for Deutsch
language("私の名前は彩花です") // tag is ja for Japanese
Finally we get to the actual Naive Bayes Classifier. The first lines are some variables that keep track of counts so that you can compute probabilities. There’s an initializer that takes in a tokenizer, and then we get to the training and the classifying phases. The training phase is where you have past examples of what goes in your categories, and once you do that, you can move on to classifying.
Training isn’t too interesting. trainWithText calls training tokens, goes through them, and counts things, while classifying things computes the probabilities of all the possible tags the string could fall under and classifies it with the category of the maximum probability. You start by defining a max category and its score, which will be updated everytime you loop through the words. If the score is greated compared to the current max, we just set the new category, update the score, and return the category that wins.
// MARK: - Training
public func trainWithText(text: String, category: Category) {
trainWithTokens(tokenizer(text), category: category)
}
public func trainWithTokens(tokens: [String], category: Category) {
let tokens = Set(tokens)
for token in tokens {
incrementToken(token, category: category)
}
incrementCategory(category)
trainingCount++
}
// MARK: - Classifying
public func classifyText(text: String) -> Category? {
return classifyTokens(tokenizer(text))
}
public func classifyTokens(tokens: [String]) -> Category? {
// Compute argmax_cat [log(P(C=cat)) + sum_token(log(P(W=token|C=cat)))]
var maxCategory: Category?
var maxCategoryScore = -Double.infinity
for (category, _) in categoryOccurrences {
let pCategory = P(category)
let score = tokens.reduce(log(pCategory)) { (total, token) in
// P(W=token|C=cat) = P(C=cat, W=token) / P(C=cat)
total + log((P(category, token) + smoothingParameter) / (pCategory + smoothingParameter * Double(tokenCount)))
}
if score > maxCategoryScore {
maxCategory = category
maxCategoryScore = score
}
}
return maxCategory
}
Here’s some sample training data: we give the model existing data and their categories. Once we’ve trained the model, we can give it some new data to see what it thinks the new strings should be classified as.
nbc.trainWithText("spammy spam spam", category: "spam")
nbc.trainWithText("spam has a lot of sodium and cholesterol", category: "spam")
nbc.trainWithText("nom nom ham", category: "ham")
nbc.trainWithText("please put the ham and eggs in the fridge", category: "ham")
In our first version of the classifier, we simply try to classify words by using a whitespace delimiter. In our second go-around, we use the function we wrote early, partOfSpeech, and in the third we chose lemmatize. Each version improves our classifier slightly so that it correctly classifies the new data we give the model.
let nbc = NaiveBayesClassifier { (text: String) -> [String] in
return text.componentsSeparatedByString(" ")
}
nbc.classifyText("sodium and cholesterol") // "spam"
nbc.classifyText("spam and eggs") // "spam"
nbc.classifyText("do you like spam?") // "spam"
nbc.classifyText("use the eggs in the fridge") // "ham"
nbc.classifyText("ham and eggs") // "ham"
nbc.classifyText("do you like ham?") // "spam", INCORRECT
nbc.classifyText("do you eat egg?") // "spam", INCORRECT
// partOfSpeech
let nbc = NaiveBayesClassifier { (text: String) -> [String] in
return partOfSpeech(text).map { (token, _) in
return token
}
}
nbc.classifyText("do you like ham?") // changed to "ham"
nbc.classifyText("do you eat egg?") // "spam"
// lemmatize
let nbc = NaiveBayesClassifier { (text: String) -> [String] in
return lemmatize(text).map { (token, tag) in
return tag ?? token
}
}
nbc.classifyText("do you eat egg?") // now is "ham"!
Relevant links (30:09)
- Natural Language Toolkit
- Parsimmon, an Obj-C linguistics toolkit for iOS
Q&A (31:56)
Q: Why do you take a log of the formula before computing probabilities?
Ayaka: If you don’t take the log and if you have hundreds of thousands of words, each probability is going to be a very small positive number. If you multiply all those, it becomes an extremely small floating point, which can lead to floating point underflow. By taking the log, you can instead add things together and not really worry about getting a number too small that computers can’t represent.
Q: Do you use this sort of technology for Venmo as well?
Ayaka: No, this is just based on my side project Parsimmon. We don’t use anything related to this in Venmo.
Q: How would you extend this to also consider not just words but phrases that seem more spammy than others?
Ayaka: Instead of using just words, you can look at clusters of words. One thing you can look at is bigrams, adjacent pairs of words, and then cluster words in that way.
Q: You’re mapping quite large number of words, so how can you actually scale this on a device where you have limited memory?
Ayaka: I haven’t tried load testing this method, but you’re welcome to try. There’s a bunch of examples online with data sets you can use that people use for research. You can test the limits using those and see how it does.
Q: How configurable is NSLinguisticTagger? In NLTK you can have different strategies like lemming and stemming, so what kind of options does this give you in comparison?
Ayaka: Not that many. If you look at options, you can omit words, punction, white space, and other. It’s not super configurable and it can’t do much more than what I showed you, but it’s still cool that it exists.
Q: How much of Siri is being done by the server and how much on the device?
Ayaka: I have no idea, or if it’s actually even used in Siri. It’s just speculation, since Apple won’t really say.
Receive news and updates from Realm straight to your inbox