Check if string is a valid abbreviation for name

Question

Check if string is a valid abbreviation for name

I am trying to develop a python algorithm to check if a string can be an abbreviation for another word. for example

fck is a match for fc kopenhavn since it matches the first characters of a word. fhk does not match.
fco must not match fc kopenhavn because none of them will reduce FC Kopenhavn as FCO.
irl is a coincidence for in real life .
ifk - match for ifk goteborg .
aik is a match for allmanna idrottskluben .
aid is a match for allmanna idrottsklubben . This is not a real abbreviation of the team name, but I think it is difficult to exclude it if you do not apply domain knowledge, how Swedish abbreviations are formed.
manu is a match for manchester united .

It is difficult to describe the exact rules of the algorithm, but I hope that my examples show what I need.

Update I made a mistake when displaying strings with the corresponding letters in upper case. In a real scenario, all letters are lowercase, so it's not as simple as checking which letters are uppercase.

+8

python string-matching slug abbreviation text-analysis

Björn lindqvist Sep 7 '11 at 9:20

source share

5 answers

Here you can do what you want to do.

 import re def is_abbrev(abbrev, text): pattern = ".*".join(abbrev.lower()) return re.match("^" + pattern, text.lower()) is not None

The carriage ensures that the first character of the abbreviation matches the first character of the word, this should be true for most abbreviations.

Edit : Your new update has slightly changed the rules. Using "(|.*\s)" instead of ".*" , The abbreviation characters will only match if they are next to each other, or if the next character appears at the beginning of a new word.

This will match fck with FC Kopenhavn , but fco will not. However, comparing aik with allmanna idrottskluben will not work, since it requires knowledge of the Swedish language and is not so trivial.

Here is the new code with a slight modification

 import re def is_abbrev(abbrev, text): pattern = "(|.*\s)".join(abbrev.lower()) return re.match("^" + pattern, text.lower()) is not None

+4

Michael brennan Sep 7 '11 at 10:05

source share

@Ocaso Protal said in a comment how should you decide that aik is valid, but aid is not valid? and he is right.

The algorithm that occurred to me was to work with the word threshold (the number of words separated by a space).

 words = string.strip().split() if len(words) > 2: #take first letter of every word elif len(words) == 2: #take two letters from first word and one letter from other else: #we have single word, take first three letter or as you like

you must define your logic, you cannot find the abbreviation blindly.

+4

Aamir adnan Sep 7 '11 at 10:28

source share

Your algorithm seems simple - the abbreviation is the Concatenation of all uppercase letters. So:

 upper_case_letters = "QWERTYUIOPASDFGHJKLZXCVBNM" abbrevation = "" for letter in word_i_want_to_check: if letter in letters: abbrevation += letter for abb in _list_of_abbrevations: if abb=abbrevation: great_success()

0

Dominik Sep 7 '11 at 9:28

source share

That might be enough.

 def is_abbrevation(abbrevation, word): lowword = word.lower() lowabbr = abbrevation.lower() for c in lowabbr: if c not in lowword: return False return True print is_abbrevation('fck', 'FC Kopenhavn')

0

Juho Sep 7 '11 at 9:30

source share

unutbu · Accepted Answer · 2011-09-07T09:28:07+0000

This passes all the tests, including some additional ones that I created. It uses recursion. Here are the rules I used:

The first letter of the abbreviation must match the first letter of the text
The rest of the abbreviation (abbreviation minus the first letter) must be an abbreviation for:
- other words or
- remaining text starting at any position in the first word.

 tests=( ('fck','fc kopenhavn',True), ('fco','fc kopenhavn',False), ('irl','in real life',True), ('irnl','in real life',False), ('ifk','ifk gotebork',True), ('ifko','ifk gotebork',False), ('aik','allmanna idrottskluben',True), ('aid','allmanna idrottskluben',True), ('manu','manchester united',True), ('fz','faz zoo',True), ('fzz','faz zoo',True), ('fzzz','faz zoo',False), ) def is_abbrev(abbrev, text): abbrev=abbrev.lower() text=text.lower() words=text.split() if not abbrev: return True if abbrev and not text: return False if abbrev[0]!=text[0]: return False else: return (is_abbrev(abbrev[1:],' '.join(words[1:])) or any(is_abbrev(abbrev[1:],text[i+1:]) for i in range(len(words[0])))) for abbrev,text,answer in tests: result=is_abbrev(abbrev,text) print(abbrev,text,result,answer) assert result==answer

Check if string is a valid abbreviation for name

More articles: