I am writing an ETL script in Python that receives data in CSV files, validates and degrades the data, and also classifies or classifies each row according to some rules and finally loads it into the postgresql database.
The data looks like this (simplified):
ColA, ColB, Timestamp, Timestamp, Journaltext, AmountA, AmountB
Each row represents a financial transaction. I want to classify or classify transactions based on some rules. Rules are basically regular expressions matching the text in the Journaltext column.
So what I want to do is something like this:
transactions = []
for row in rows:
t = Transaction (category = classify (row.journaltext))
transactions.append (t)
I'm not sure how to write classify () function effectively.
So the classification rules apply:
- There are several categories (more can and will be added later)
- Each category has a set of substrings or regular expressions, which, if Journaltext transactions match this expression or contain this substring, then this transaction belongs to this category.
- A transaction can only be in one category.
- FOO "foo" "Foo", BAR "", Journaltext = 'food' FOO, FOO, Journaltext = 'footballs' BAR. , , .
- , , "UNKNOWN" . .
Ok. , Python?
. . , , . .