How to optimize the performance of regular expressions?

I have a very long regex. My regular expression is a combination of about 5,000 or more phrases.

In addition, the text on which I execute the regex is also huge. The text size is about 5 kilobytes.

Since the regular expression as well as the input text is huge, it takes at least 2 minutes to execute the regular expression, which is not valid in my project.

So, I would like to know how I can optimize this. One way I can think of is to split the regex and use multiple threads to minimize execution time. Is this the right option or is there another way?

Part of my regex looks like this:

(ACS | ADDR.com Technologies | ADP Private Limited | ADP | ADP India Private Limited | AIT Software Services PTE Limited | AMK Technologies Private Limited | ANMSoft Technologies Private Limited | ANZ Information Technology private limited | ASD Global India private Limited | ASD India private Limited | ASM Technologies private limited | AXA Group Solutions India private limited | AXA technology India limited | Aarkay Infonet private limited | AbsolutData Research and Analytics private limited | Accenture India private limited | Accenture Services India | Accenture Services P Limited | Accenture Services Private Restrictions | Accenture | Accenture Software Private Limited | Accurum India private limited | AceTechnologies Inc | Aclat Inc | AcmeCeeYess Softech Private Limited | Adaequare India Private Limited | Adaequare Info Private Limited | Adea International Private Limited | Adea Technologies | Adeptra | Aditi Technologies | Adobe Systems | Adroit Business Solutions | Adroit and Claretdene Infotech closed to private individuals | Affron Infotech | Agile Software Enter Priority Limited | Limited Access for Agilent Technologies International | Akebono Soft Technologies Private Limited | AkebonoSoft Technologies Private Limited | Akmin Technologies | Algorhythm Technologies Private Limited | Allsec Technologies private limited | Alphonso Informex private limited | Altria Client Services | Altruist India private limited | Amdocs | Amdocs India Development Center Private Limited | Amdocs Development Center India | American CyberSystems | American express service India private limited | American Stock Exchange | Amrok Securities | Anish Information Technology private limited | Anhhnet Informations private limited | Apex Technologies private limited | AppLabs | AppLabs Technologies private limited | Appshark India | Apptix Software private limited | Aquila Technologies | Arcot R and D Software private limited | Arsin Systems private limited | Ascendum Solutions private limited | AskMe Software private limited | Atos Origin private limited | Atos Origin | Atos Origin India private limited | Aurigo Software Technologie Limited Access to Private Limited Sites Bayon Solutions | Limited access to Bayon Solutions | Limited Data Access Baywood Solutions | Birlasoft limited | Blue Bird Technologies private limited | Blue Fountain Media private limited | Blue Star InfoTech | Boden inc | Boston | Braahamam Net Solutions Private Limited | Braahmam Net Solutions Private Limited | Brain Soft technology private restrictions | Brigade Corporation Private Limited | Business Link Automation India Private Limited | BusinessLink Automation Private Limited | C Ahead Info Technologies India private limited | CDI Corporation | CCG India private limited | CEM Solutions | CGI Information systems and management consultants private restrictions | CGI Information Systems private limited | CGI Information System and Management Consultants private restrictions | CGI Information and Management private limited | CGI Netvorks | CISCO Systems India Limited Limit | CMC Limited | COMSYS Inc | CORE SHELL TECHNO CRC India software closed to individuals | CRV Executive Search private limited | CS Software Solutions private Limited | CSC India private Limited | CSS Corp Private Limited | Cambridge Solutions Limited | Cambridge Solutions | Cambridge Solutions Sdn. Bhd | Candor Ind. Private limited | Candor India private limited | Canvas Creatives private limited | Canvera | Capgemini Business Service India Limited | Capgemini private)

I am using C # for this stuff.

Please enlighten !!!!

+7
source share
5 answers

You can significantly improve the performance of this regular expression by adding \b at the beginning:

 \b(ACS| ... |Z) 

This will prevent each character from being checked and will instead check every word.

+8
source

You can optimize the regular expression using atomic grouping or using possessive quantifiers where possible.

Also, if you have things like .* Or .+ In your regular expression that may be real pigs with memory / runtime, replace them with (possessive) character classes (again, if possible).

For more specific answers you will need to post your regular expression.

Good luck

+7
source

One optimization is to extract common prefixes. Change occurrences such as

 (This is some text|This is some other text) 

to

 This is some (text|other text) 

It must also be done at every level. Change occurrences such as

 ABCD|ADCB|BACD|BADC|BCAD|BCDA|BDAC|BDCA|CABD 

to

 A(BCD|DCB)|B(A(CD|DC)|C(AD|DA)|D(AC|CA))|CABD 

This optimization is such that the Regex engine should not test the same characters multiple times.

This can be achieved by sorting phases and viewing sequential elements. Be careful not to break the metacharacters. You do not want to break the middle .* Or \. .

Another way would be to use a Trie structure to search for prefixes. It is more stable, but a bit more complicated.

+7
source

I know this is old, but still ...

"OR" (for this all standard rules: concat, repeat and or) do not require manual optimization. Although compiling most regex engines optimizes it. Sometimes it's the other way around: too many groups can have an impact on performance, as the engine must maintain each group match.

What will hit performance hard is to look ahead and look for rules that are not used in your request.

In this case, the author could add the rule "\ b" at the beginning and at the end of the query to require a holistic search for words, which would significantly limit the places that the engine will begin to match.

+2
source

Python example (there is also a C-tool for optimizing regular expressions at https://github.com/ksx123/regex-optimization ):

 import hachoir_regex optimized = hachoir_regex.parse("(ACS|ADDR.com Technologies|ADP private limited|ADP|ADP India private limited|AIT Software Services PTE limited|AMK Technologies private limited|ANMSoft Technologies private limited|ANZ Information Technology private limited|ASD Global India private Limited|ASD India private Limited|ASM Technologies private limited|AXA Group Solutions India private limited|AXA technology India limited|Aarkay Infonet private limited|AbsolutData Research and Analytics private limited|Accenture India private limited|Accenture Services India|Accenture Services P Limited|Accenture Services private Limited|Accenture|Accenture Software Private Limited|Accurum India private limited|AceTechnologies Inc|Aclat Inc|AcmeCeeYess Softech Private Limited|Adaequare India private limited|Adaequare Info private limited|Adea International private limited|Adea Technologies|Adeptra|Aditi Technologies|Adobe Systems|Adroit Business Solutions|Adroit and Claretdene Infotech private limited|Affron Infotech|Agile Software Enterprise private limited|Agilent Technologies International private limited|Akebono Soft Technologies private limited|AkebonoSoft Technologies private limited|Akmin Technologies|Algorhythm Technologies private limited|Allsec Technologies private limited|Alphonso Informex private limited|Altria Client Services|Altruist India private limited|Amdocs|Amdocs Development Center India private limited|Amdocs Development Centre India|American CyberSystems|American Express Service India private limited|American Stock Exchange|Amrok Securities|Anish Information Technology private limited|Ankhnet Informations private limited|Apex Technologies private limited|AppLabs|AppLabs Technologies private limited|Appshark India|Apptix Software private limited|Aquila Technologies|Arcot R and D Software private limited|Arsin Systems private limited|Ascendum Solutions private limited|AskMe Software private limited|Atos Origin private limited|Atos Origin|Atos Origin India private limited|Aurigo Software Technologies private limited|Aurona Technologies private limited|Autopower Software Solutions|Aztecsoft|BMC Software India private limited|Balasai Net private limited|Bayon Solutions private limited|Beachwood Computing Limited|Birlasoft limited|Blue Bird Technologies private limited|Blue Fountain Media private limited|Blue Star InfoTech|Boden Inc|Boston|Braahamam Net Solutions private limited|Braahmam Net Solutions private limited|Brain Soft technology private limited|Brigade Corporation Private Limited|Business Link Automation India private limited|BusinessLink Automation private limited|C Ahead Info Technologies India private limited|CDI Corporation|CCG India private limited|CEM Solutions|CGI Information Systems and Management Consultants private limited|CGI Information Systems private limited|CGI Information System and Management Consultants private limited|CGI Information and Management private limited|CGI Netvorks|CISCO Systems India private limited|CMC Limited|COMSYS Inc|CORE SHELL TECHNOLOGIES|CRC Software India private limited|CRV Executive Search private limited|CS Software Solutions private Limited|CSC India private Limited|CSS Corp private limited|Cambridge Solutions Limited|Cambridge Solutions|Cambridge Solutions Sdn. Bhd|Candor Ind. private limited|Candor India private limited|Canvas Creatives private limited|Canvera|Capgemini Business Service India Limited|Capgemini private)") len(str(optimized)) # has length 3048 

So far, the original string has a length of 3399 . The larger the line, the more optimizations. It uses the hachoir-regex library . You can use this in addition to adding \b as suggested.

0
source

All Articles