How to classify / categorize strings according to regex rules in Python

I am writing an ETL script in Python that receives data in CSV files, validates and degrades the data, and also classifies or classifies each row according to some rules and finally loads it into the postgresql database.

The data looks like this (simplified):

ColA, ColB, Timestamp, Timestamp, Journaltext, AmountA, AmountB

Each row represents a financial transaction. I want to classify or classify transactions based on some rules. Rules are basically regular expressions matching the text in the Journaltext column.

So what I want to do is something like this:

transactions = []
for row in rows:
    t = Transaction (category = classify (row.journaltext))
    transactions.append (t)

I'm not sure how to write classify () function effectively.

So the classification rules apply:

  • There are several categories (more can and will be added later)
  • Each category has a set of substrings or regular expressions, which, if Journaltext transactions match this expression or contain this substring, then this transaction belongs to this category.
  • A transaction can only be in one category.
  • FOO "foo" "Foo", BAR "", Journaltext = 'food' FOO, FOO, Journaltext = 'footballs' BAR. , , .
  • , , "UNKNOWN" . .

Ok. , Python?

. . , , . .

+5
2

- :

categories = [
  ('cat1', ['foo']),
  ('cat2', ['football']),
  ('cat3', ['abc', 'aba', 'bca'])
]

def classify(text):
  for category, matches in categories:
    if any(match in text for match in matches):
      return category
  return None

Python in . , isinstance(match, str), , . , .

+2

:

def classify(journaltext):
    prio_list = ["FOO", "BAR", "UPS", ...] # "..." is a placeholder: you have to give the full list here.
    # dictionary: 
    # - key is the name of the category, must match the name in the above prio_list
    # - value is the regex that identifies the category
    matchers = {"FOO": "the regex for FOO", "BAR": "the regex for BAR", "UPS":"...", ...}
    for category in prio_list:
        if re.match(matchers[category], journaltext):
            return category
    return "UNKOWN" # or you can "return None"

:

  • prio_list, .
  • .
  • . .
  • , .

, ...

+2

All Articles