The most common tokenization methods divide a given string into spaces or non-words. Bloodhound provides an implementation of these methods out of the box:
// returns ['one', 'two', 'twenty-five'] Bloodhound.tokenizers.whitespace(' one two twenty-five'); // returns ['one', 'two', 'twenty', 'five'] Bloodhound.tokenizers.nonword(' one two twenty-five');
To tokenize a request, you probably want to use one of the above methods. To indicate data tokenization, you can do something more complex.
For data, sometimes you want tokens to be obscured from several properties. For example, if you were building a search engine for GitHub repositories, it would probably be wise to have tokens derived from the name, owner, and main language of the repo:
var repos = [ { name: 'example', owner: 'John Doe', language: 'JavaScript' }, { name: 'another example', owner: 'Joe Doe', language: 'Scala' } ]; function customTokenizer(datum) { var nameTokens = Bloodhound.tokenizers.whitespace(datum.name); var ownerTokens = Bloodhound.tokenizers.whitespace(datum.owner); var languageTokens = Bloodhound.tokenizers.whitespace(datum.language); return nameTokens.concat(ownerTokens).concat(languageTokens); }
There may also be a scenario in which you want to tokenize the binding to the backend. The best way to do this is to simply add the property to your data containing these tokens. Then you can provide a tokenizer that simply returns existing tokens:
var sports = [ { value: 'football', tokens: ['football', 'pigskin'] }, { value: 'basketball', tokens: ['basketball', 'bball'] } ]; function customTokenizer(datum) { return datum.tokens; }
There are many other ways you can use data tokenization, it really depends on what you are trying to do.
Unfortunately, it seems that this information is not easy to find from the main documentation.