Probabilistic string matching in Python

I am in the process of writing a bot that bets on the Betfair website using their Python API. I want to bet on football matches when they play.

I encoded an XML feed to give me live data from games, however the XML feed does not always use the same name for football teams as using Betfair.

For example, when it comes to Manchester United Betfire, you can use Manchester United, while the XML feed can use Manchester United or some other option. I'm not limited to popular markets, so creating a standard Betfair to XML conversion table is not possible.

I'm trying to use some kind of probabilistic string matching to give me an indication that two data sources belong to the same commands.

So far I have been playing with a reverend who seems to be doing Bayesian calculations, however I don't think I am using it correctly since I need to split the string into characters before train guesser . Then I simply average the probability that each letter is associated with each name, I know that it is mathematically incorrect, but I thought it could be a feasible heuristic test.

Here is my code:

 import scorefeed from reverend.thomas import Bayes guesser = Bayes() teams=['home','away'] def train(team_no, name): for char in name: guesser.train(teams[team_no], char) def untrain(team_no, name): for char in name: guesser.untrain(teams[team_no], char) def guess(name): home_guess = 0.0 away_guess = 0.0 for char in name: if len(guesser.guess(char)) > 0: for guess in guesser.guess(char): if guess[0] == teams[0]: home_guess = home_guess + guess[1] print home_guess if guess[0] == teams[1]: away_guess = away_guess + guess[1] print away_guess home_guess = home_guess / float(len(name)) away_guess = away_guess / float(len(name)) probs = [home_guess, away_guess] return probs def game_match(betfair_game_string, feed_home, feed_away): home_team = betfair_game_string[0:betfair_game_string.find(' V ')] away_team = betfair_game_string[betfair_game_string.find('V')+2:len(betfair_game_string)] train(0, home_team) train(1, away_team) probs = [] probs.append(guess(feed_home)[0]) probs.append(guess(feed_away)[1]) untrain(0, home_team) untrain(1, away_team) return probs print game_match("Man Utd V Lpool", "Manchester United", "Liverpool") 

The probability obtained with the current setting is [0.4705411764705883, 0.5555] . I would really appreciate any ideas or improvements.

EDIT: I had a different thought: I need the likelihood that it will be the same match on Betfair and the feed. But this gives me the chance of a first name match and a second name match. I need to find the probability of coincidence of the first And second names. So I coded the following function, which seems to give me more reasonable results:

 def prob_match(probs): prob_not_home = 1.0 - probs[0] prob_not_away = 1.0 - probs[1] prob_not_home_and_away = prob_not_home*prob_not_away prob_home_and_away = 1.0 - prob_not_home_and_away return prob_home_and_away 

I would appreciate any suggestions on various methods or recommendations of existing libraries that do the same, or advice on correcting my probabilistic calculations.

+7
source share
1 answer

Here is my advice. Read http://norvig.com/spell-correct.html , implement something on this basis and see how well it works. Hope it will work well enough.

Speed ​​it up by caching the results on the fly so that as soon as he finds out the guess for the given name, he simply repeats the guess.

Your implementation should have an exception report for the most dubious guesses so that you can manually review and either reject or correct them.

+2
source

All Articles