RegEx for splitting camelCase or TitleCase (advanced)

I found brilliant RegEx to extract part of a camelCase or TitleCase expression.

(?<!^)(?=[AZ]) 

It works as expected:

  • value → value
  • camelValue → camel / value
  • TitleValue → Title / Meaning

For example, with Java:

 String s = "loremIpsum"; words = s.split("(?<!^)(?=[AZ])"); //words equals words = new String[]{"lorem","Ipsum"} 

My problem is that in some cases this does not work:

  • Case 1: VALUE → V / A / L / U / E
  • Case 2: eclipseRCPExt -> eclipse / R / C / P / Ext

In my opinion, the result will be as follows:

  • Case 1: VALUE
  • Case 2: eclipse / RCP / Ext

In other words, given n uppercase characters:

  • If n characters are followed by lowercase letters, the groups must be: (n-1 characters) / (n-th char + lower characters)
  • If n characters are at the end, the group must be: (n characters).

Any idea on how to improve this regex?

+67
java regex camelcasing title-case
Sep 29 2018-11-11T00:
source share
8 answers

The following regular expression works for all the above examples:

 public static void main(String[] args) { for (String w : "camelValue".split("(?<!(^|[AZ]))(?=[AZ])|(?<!^)(?=[AZ][az])")) { System.out.println(w); } } 

It works by forcing a negative lookbehind to not only ignore matches at the beginning of a line, but also ignore matches where the uppercase letter is preceded by another uppercase letter. This handles cases like "VALUE".

The first part of the regular expression itself crashes on "eclipseRCPExt", not understanding between "RPC" and "Ext". This is the purpose of the second sentence: (?<!^)(?=[AZ][az] . This sentence allows splitting before each uppercase letter followed by a lowercase letter, with the exception of the beginning of the line.

+89
Sep 29 '11 at 7:45
source share

It seems you are making it more complicated than necessary. For camelCase, a shared location just anywhere in a capital letter immediately follows a lowercase letter:

(?<=[az])(?=[AZ])

Here's how this regular expression breaks your example data:

  • value -> value
  • camelValue -> camel / Value
  • TitleValue -> Title / Value
  • value -> value
  • eclipseRCPExt -> eclipse / RCPExt

The only difference from your desired result is the eclipseRCPExt , which I would say is correctly divided here.

Addendum - Improved Version

Note. This answer has recently gained an edge, and I realized that there is a better way ...

By adding a second alternative to the above regular expression, all of the OP test cases are correctly separated.

(?<=[az])(?=[AZ])|(?<=[AZ])(?=[AZ][az])

Here's how enhanced regex parses example data:

  • value -> value
  • camelValue -> camel / Value
  • TitleValue -> Title / Value
  • value -> value
  • eclipseRCPExt -> eclipse / RCP / Ext

Edit: 20130824 An improved version has been added to handle the case of RCPExt -> RCP / Ext .

+57
Sep 29 2018-11-11T00:
source share

Another solution would be to use a dedicated method in commons-lang : StringUtils # splitByCharacterTypeCamelCase

+22
Sep 29 '11 at 18:56
source share

I couldn't get the aix solution to work (and it doesn't work on RegExr either), so I came up with my own, which I tested, and seems to be doing exactly what you are looking for:

 ((^[az]+)|([AZ]{1}[az]+)|([AZ]+(?=([AZ][az])|($)))) 

and here is an example of its use:

 ; Regex Breakdown: This will match against each word in Camel and Pascal case strings, while properly handling acrynoms. ; (^[az]+) Match against any lower-case letters at the start of the string. ; ([AZ]{1}[az]+) Match against Title case words (one upper case followed by lower case letters). ; ([AZ]+(?=([AZ][az])|($))) Match against multiple consecutive upper-case letters, leaving the last upper case letter out the match if it is followed by lower case letters, and including it if it followed by the end of the string. newString := RegExReplace(oldCamelOrPascalString, "((^[az]+)|([AZ]{1}[az]+)|([AZ]+(?=([AZ][az])|($))))", "$1 ") newString := Trim(newString) 

Here, I separate each word with a space, so here are a few examples of how the string is converted:

  • ThisIsATitleCASEString => This is a CASE header line
  • andThisOneIsCamelCASE =>, and this is one of CEMEL Camel



This solution above does what the original message requests, but I also needed a regular expression to search for camel and pascal strings that included numbers, so I also came up with this variation to include numbers:

 ((^[az]+)|([0-9]+)|([AZ]{1}[az]+)|([AZ]+(?=([AZ][az])|($)|([0-9])))) 

and an example of its use:

 ; Regex Breakdown: This will match against each word in Camel and Pascal case strings, while properly handling acrynoms and including numbers. ; (^[az]+) Match against any lower-case letters at the start of the command. ; ([0-9]+) Match against one or more consecutive numbers (anywhere in the string, including at the start). ; ([AZ]{1}[az]+) Match against Title case words (one upper case followed by lower case letters). ; ([AZ]+(?=([AZ][az])|($)|([0-9]))) Match against multiple consecutive upper-case letters, leaving the last upper case letter out the match if it is followed by lower case letters, and including it if it followed by the end of the string or a number. newString := RegExReplace(oldCamelOrPascalString, "((^[az]+)|([0-9]+)|([AZ]{1}[az]+)|([AZ]+(?=([AZ][az])|($)|([0-9]))))", "$1 ") newString := Trim(newString) 

And here are some examples of how a string with numbers is converted using this regular expression:

  • myVariable123 => my Variable 123
  • my2Variables => my 2 Variables
  • The3rdVariableIsHere => 3 rdVariable Is Here
  • 12345NumsAtTheStartIncludedToo => 12345 Nums At The Start Included Too
+8
Mar 11 '12 at 6:40
source share

To handle more letters than just AZ :

 s.split("(?<=\\p{Ll})(?=\\p{Lu})|(?<=\\p{L})(?=\\p{Lu}\\p{Ll})"); 

Or:

  • A separation after any lowercase letter followed by an uppercase letter.

For example, parseXML - parseXML parse , XML .

or

  • Separate after any letter followed by an uppercase letter and a lowercase letter.

eg. XMLParser XML , Parser .




In a more readable form:

 public class SplitCamelCaseTest { static String BETWEEN_LOWER_AND_UPPER = "(?<=\\p{Ll})(?=\\p{Lu})"; static String BEFORE_UPPER_AND_LOWER = "(?<=\\p{L})(?=\\p{Lu}\\p{Ll})"; static Pattern SPLIT_CAMEL_CASE = Pattern.compile( BETWEEN_LOWER_AND_UPPER +"|"+ BEFORE_UPPER_AND_LOWER ); public static String splitCamelCase(String s) { return SPLIT_CAMEL_CASE.splitAsStream(s) .collect(joining(" ")); } @Test public void testSplitCamelCase() { assertEquals("Camel Case", splitCamelCase("CamelCase")); assertEquals("lorem Ipsum", splitCamelCase("loremIpsum")); assertEquals("XML Parser", splitCamelCase("XMLParser")); assertEquals("eclipse RCP Ext", splitCamelCase("eclipseRCPExt")); assertEquals("VALUE", splitCamelCase("VALUE")); } } 
+2
Feb 11 '13 at 16:09
source share

Brief

Both of the top answers here contain code using positive lookbehind, which are not supported by all regex flavors. The regular expression below will display both PascalCase and camelCase and can be used in several languages.

Note: I understand that this question is about Java, however I also see a few mentions of this post in other questions tagged for different languages, as well as some comments on this subject for the same.

the code

See this regex used here

 ([AZ]+|[AZ]?[az]+)(?=[AZ]|\b) 

results

Input example

 eclipseRCPExt SomethingIsWrittenHere TEXTIsWrittenHERE VALUE loremIpsum 

Output example

 eclipse RCP Ext Something Is Written Here TEXT Is Written HERE VALUE lorem Ipsum 

Description

  • Matches one or more uppercase alpha characters [AZ]+
  • Or a match with a null or one uppercase alpha character [AZ]? followed by one or more lowercase letters [AZ]+
  • Verify the following: upper case alpha character [AZ] or word boundary character \b
+2
Sep 25 '17 at 15:54
source share

You can use the expression below for Java:

 (?<=[az])(?=[AZ])|(?<=[AZ])(?=[AZ][az])|(?=[AZ][az])|(?<=\\d)(?=\\D)|(?=\\d)(?<=\\D) 
0
Jul 10 '16 at 23:31
source share

Instead of looking for delimiters that are not there, you can also consider looking for components of the name (they certainly are):

 String test = "_eclipse福福RCPExt"; Pattern componentPattern = Pattern.compile("_? (\\p{Upper}?\\p{Lower}+ | (?:\\p{Upper}(?!\\p{Lower}))+ \\p{Digit}*)", Pattern.COMMENTS); Matcher componentMatcher = componentPattern.matcher(test); List<String> components = new LinkedList<>(); int endOfLastMatch = 0; while (componentMatcher.find()) { // matches should be consecutive if (componentMatcher.start() != endOfLastMatch) { // do something horrible if you don't want garbage in between // we're lenient though, any Chinese characters are lucky and get through as group String startOrInBetween = test.substring(endOfLastMatch, componentMatcher.start()); components.add(startOrInBetween); } components.add(componentMatcher.group(1)); endOfLastMatch = componentMatcher.end(); } if (endOfLastMatch != test.length()) { String end = test.substring(endOfLastMatch, componentMatcher.start()); components.add(end); } System.out.println(components); 

This outputs [eclipse, 福福, RCP, Ext] . Converting to an array is of course simple.

0
Jun 03 '17 at 16:16
source share



All Articles