Methods for extracting regular expressions from a tagged dataset

Suppose I have a dataset of several hundred thousand rows (which happens to be natural language sentences, if that matters), each of which is labeled with a specific β€œlabel”. Each sentence is marked with exactly one label, and there are about 10 labels, each of which contains about 10% of the data set that belongs to them. There is a high degree of similarity to the sentence structure within the label.

I know that the above sounds like a classic example of a machine learning problem, but I want to ask a slightly different question. Are there any well-known methods for programmatically generating a set of regular expressions for each label that can successfully classify training data while still summarizing future test data?

I would be very pleased with the references to the literature; I understand that this will not be a simple algorithm :)

PS: I know that the usual way to classify is with machine learning methods such as SVM or such. However, I am clearly looking for a way to generate regular expressions. (I would be pleased with machine learning technology to create regular expressions, rather than using machine learning methods for the classification itself!)

+7
source share
3 answers

As far as I know, this is the subject of ongoing research into evolutionary computing.

Here are some examples:

See slides 40-44 on

https://cs.byu.edu/sites/default/files/Ken_De_Jong_slides.pdf (slides exist when sending this answer).

Also see

http://www.citeulike.org/user/bartolialberto/article/10710768

for a more detailed overview of the system presented at GECCO 2012.

+1
source

This problem is usually formed as the creation of finite state machines from sets of strings, rather than regular expressions, although you can obviously generate REs from FA, since they are equivalent .

If you are looking for the induction of automata , you can find quite a lot of literature on this topic, including GA approaches.

+3
source

Note. . This can help. This function below generates a RegEx pattern for a given value of a and b . Where a and b are both alpha strings. And the function will create a fair RegEx pattern to match the range between a and b . The function will take only the first three characters to create the template and create a result , which can be something like the starts-with() function in some language with a hint of the general benefit of RegEx.

A simple example of VB.NET

 Public Function GetRangePattern(ByVal f_surname As String, ByVal l_surname As String) As String Dim f_sn, l_sn As String Dim mnLength% = 0, mxLength% = 0, pdLength% = 0, charPos% = 0 Dim fsn_slice$ = "", lsn_slice$ = "" Dim rPattern$ = "^" Dim alphas As New Collection Dim tmpStr1$ = "", tmpStr2$ = "", tmpStr3$ = "" '///init local variables f_sn = f_surname.ToUpper.Trim l_sn = l_surname.ToUpper.Trim '///do null check If f_sn.Length = 0 Or l_sn.Length = 0 Then Return "-!ERROR!-" End If '///return if both equal If StrComp(f_sn, l_sn, CompareMethod.Text) = 0 Then Return "^" & f_sn & "$" End If '///return if 1st_name present in 2nd_name If InStr(1, l_sn, f_sn, CompareMethod.Text) > 0 Then tmpStr1 = f_sn tmpStr2 = l_sn.Replace(f_sn, vbNullString) If Len(tmpStr2) > 1 Then tmpStr3 = "[A-" & tmpStr2.Substring(1) & "]*" Else tmpStr3 = tmpStr2 & "*" End If tmpStr1 = "^" & tmpStr1 & tmpStr3 & ".*$" tmpStr1 = tmpStr1.ToUpper Return tmpStr1 End If '///initialize alphabets alphas.Add("A", CStr(Asc("A"))) alphas.Add("B", CStr(Asc("B"))) alphas.Add("C", CStr(Asc("C"))) alphas.Add("D", CStr(Asc("D"))) alphas.Add("E", CStr(Asc("E"))) alphas.Add("F", CStr(Asc("F"))) alphas.Add("G", CStr(Asc("G"))) alphas.Add("H", CStr(Asc("H"))) alphas.Add("I", CStr(Asc("I"))) alphas.Add("J", CStr(Asc("J"))) alphas.Add("K", CStr(Asc("K"))) alphas.Add("L", CStr(Asc("L"))) alphas.Add("M", CStr(Asc("M"))) alphas.Add("N", CStr(Asc("N"))) alphas.Add("O", CStr(Asc("O"))) alphas.Add("P", CStr(Asc("P"))) alphas.Add("Q", CStr(Asc("Q"))) alphas.Add("R", CStr(Asc("R"))) alphas.Add("S", CStr(Asc("S"))) alphas.Add("T", CStr(Asc("T"))) alphas.Add("U", CStr(Asc("U"))) alphas.Add("V", CStr(Asc("V"))) alphas.Add("W", CStr(Asc("W"))) alphas.Add("X", CStr(Asc("X"))) alphas.Add("Y", CStr(Asc("Y"))) alphas.Add("Z", CStr(Asc("Z"))) '///populate max-min length values mxLength = f_sn.Length If l_sn.Length > mxLength Then mnLength = mxLength mxLength = l_sn.Length Else mnLength = l_sn.Length End If '///padding values pdLength = mxLength - mnLength f_sn = f_sn.PadRight(mxLength, "A") 'f_sn = f_sn.PadRight(mxLength, "~") l_sn = l_sn.PadRight(mxLength, "Z") 'l_sn = l_sn.PadRight(mxLength, "~") '///get a range like A??-B?? If f_sn.Substring(0, 1).ToUpper <> l_sn.Substring(0, 1).ToUpper Then fsn_slice = f_sn.Substring(0, 3).ToUpper lsn_slice = l_sn.Substring(0, 3).ToUpper tmpStr1 = fsn_slice.Substring(0, 1) & fsn_slice.Substring(1, 1) & "[" & fsn_slice.Substring(2, 1) & "-Z]" tmpStr2 = lsn_slice.Substring(0, 1) & lsn_slice.Substring(1, 1) & "[A-" & lsn_slice.Substring(2, 1) & "]" tmpStr3 = "^(" & tmpStr1 & "|" & tmpStr2 & ").*$" Return tmpStr3 End If '///looping charwise For charPos = 0 To mxLength fsn_slice = f_sn.Substring(charPos, 1) lsn_slice = l_sn.Substring(charPos, 1) If StrComp(fsn_slice, lsn_slice, CompareMethod.Text) = 0 Then rPattern = rPattern & fsn_slice Else 'rPattern = rPattern & "(" If charPos < mxLength Then Try If Asc(fsn_slice) < Asc(lsn_slice) Then tmpStr1 = fsn_slice & "[" & f_sn.Substring(charPos + 1, 1) & "-Z" & "]|" If CStr(alphas.Item(Key:=CStr(Asc(fsn_slice) + 1))) < CStr(alphas.Item(Key:=CStr(Asc(lsn_slice) - 1))) Then tmpStr2 = "[" & CStr(alphas.Item(Key:=CStr(Asc(fsn_slice) + 1))) & "-" & CStr(alphas.Item(Key:=CStr(Asc(lsn_slice) - 1))) & "]|" Else tmpStr2 = vbNullString End If tmpStr3 = lsn_slice & "[A-" & l_sn.Substring(charPos + 1, 1) & "]" rPattern = rPattern & "(" & tmpStr1 & tmpStr2 & tmpStr3 & ").*$" 'MsgBox("f_sn:= " & f_sn & " -- l_sn:= " & l_sn & vbCr & rPattern) Exit For Else Return "-#ERROR#-" End If Catch ex As Exception Return "-|ERROR|-" & ex.Message End Try End If End If Next charPos Return rPattern End Function 

And he is called

 ?GetRangePattern("ABC","DEF") 

creates

 "^(AB[CZ]|DE[AF]).*$" 
0
source

All Articles