Splitting a string in java into substrings of equal length while maintaining word boundaries

How to split a string into equal parts of the maximum character length while maintaining word boundaries?

Say, for example, if I want to split the string "hello world" into equal substrings with a maximum of 7 characters, it should return me

"hello " 

and

 "world" 

But my current implementation returns

 "hello w" 

and

 "orld " 

I use the following code, taken from Split a string into substrings of equal length in Java , to split the input string into equal parts

 public static List<String> splitEqually(String text, int size) { // Give the list the right capacity to start with. You could use an array // instead if you wanted. List<String> ret = new ArrayList<String>((text.length() + size - 1) / size); for (int start = 0; start < text.length(); start += size) { ret.add(text.substring(start, Math.min(text.length(), start + size))); } return ret; } 

Is it possible to keep word boundaries when splitting a string into a substring?

To be more specific, I need a line-breaking algorithm to take into account the word boundary provided by spaces, and not only rely on the length of the character when splitting the string, although this also needs to be taken into account, but looks more like a maximum range of characters, rather than a hard character length.

+8
java string
source share
2 answers

If I understand your problem correctly, this code should do what you need (but assumes maxLenght is equal to or greater than the long word )

 String data = "Hello there, my name is not importnant right now." + " I am just simple sentecne used to test few things."; int maxLenght = 10; Pattern p = Pattern.compile("\\G\\s*(.{1,"+maxLenght+"})(?=\\s|$)", Pattern.DOTALL); Matcher m = p.matcher(data); while (m.find()) System.out.println(m.group(1)); 

Exit:

 Hello there, my name is not importnant right now. I am just simple sentecne used to test few things. 

A brief (or not) explanation of "\\G\\s*(.{1,"+maxLenght+"})(?=\\s|$)" regex:

(let's just remember that in Java \ not only special in regular expression, but also in string literals, so to use predefined character sets like \d , we need to write it as "\\d" , because we need was to escape that \ also in a string literal)

  • \G is an anchor representing the end of a previously established match, or if there is no match (when we started the search), the beginning of the line (same as ^ )
  • \s* - represents zero or more spaces ( \s represents spaces, * "zero or more" quantifiers)
  • (.{1,"+maxLenght+"}) - allows you to divide it into more parts (at runtime :maxLenght will hold a certain numeric value, for example 10, so the regular expression will see it as .{1,10} )
    • . represents any character (in fact, by default it can represent any character except line separators such as \n or \r , but thanks to the Pattern.DOTALL flag it can now represent any character - which you can get rid of this method argument if you want to start dividing each sentence separately, since its beginning will be printed on a new line anyway )
    • {1,10} is a quantifier that allows the previously described element to be displayed from 1 to 10 times (by default it will try to find the maximum number of matching repetitions),
    • .{1,10} - therefore, based on what we said now, it simply represents "1 to 10 any characters"
    • ( ) - brackets create groups , structures that allow us to hold certain parts of the match (here we added brackets after \\s* , because we want to use only the part after spaces)
  • (?=\\s|$) is a look-ahead that will ensure that the text being matched .{1,10} , after it

    • space ( \\s )

      OR (written as | )

    • end of line $ after it.

Thus, thanks to .{1,10} we can match up to 10 characters. But after (?=\\s|$) after that, we require that the last character matched .{1,10} should not be part of an incomplete word (there must be a space or the end of a line after it).

+16
source share

A solution without regular expressions, in case someone is more convenient (?), Without using regular expressions:

 private String justify(String s, int limit) { StringBuilder justifiedText = new StringBuilder(); StringBuilder justifiedLine = new StringBuilder(); String[] words = s.split(" "); for (int i = 0; i < words.length; i++) { justifiedLine.append(words[i]).append(" "); if (i+1 == words.length || justifiedLine.length() + words[i+1].length() > limit) { justifiedLine.deleteCharAt(justifiedLine.length() - 1); justifiedText.append(justifiedLine.toString()).append(System.lineSeparator()); justifiedLine = new StringBuilder(); } } return justifiedText.toString(); } 

Test:

 String text = "Long sentence with spaces, and punctuation too. And supercalifragilisticexpialidosus words. No carriage returns, tho -- since it would seem weird to count the words in a new line as part of the previous paragraph length."; System.out.println(justify(text, 15)); 

Output:

 Long sentence with spaces, and punctuation too. And supercalifragilisticexpialidosus words. No carriage returns, tho -- since it would seem weird to count the words in a new line as part of the previous paragraph's length. 

It takes into account words that are longer than the set limit, so it does not skip them (unlike the regular expression version, which simply stops processing when it finds supercalifragilisticexpialidosus ).

PS: a comment about all input words, which should be shorter than a given limit, was made after I came up with this solution;)

+1
source share

All Articles