Unicode text ligature detection in Clojure / Java

Ligatures are Unicode characters that are represented by more than one code. For example, in Devanagari त्र there is a ligature, which consists of code points त + ् + र .

When viewed in simple text file editors such as Notepad, त्र displayed as त् + र and saved as three Unicode characters. However, when the same file opens in Firefox, it appears as the correct ligature.

So my question is how to programmatically define such ligatures by reading a file from my code. Because Firefox does this, there must be a way to do this programmatically. Are there any Unicode properties that contain this information, or do I need to have a map for all such ligatures?

The SVG CSS text-rendering property, when set to optimizeLegibility , does the same (combines code into the correct ligature).

PS: I use Java.

EDIT

The purpose of my code is to count the characters in Unicode text, assuming the ligature is the only character. So I need a way to collapse multiple code points into one ligature.

+6
java text unicode clojure
source share
5 answers

While Aaron's answer is not entirely correct, he pushed me in the right direction. After reading the Java API docs java.awt.font.GlyphVector and playing a lot of REPL, I was able to write a function that does what I want.

The idea is to find the width of the glyphs in glyphVector and combine the zero-width glyphs with the last non-zero width glyph found. The solution is in Clojure, but if necessary it should be translated into Java.

 (ns net.abhinavsarkar.unicode (:import [java.awt.font TextAttribute GlyphVector] [java.awt Font] [javax.swing JTextArea])) (let [^java.util.Map text-attrs { TextAttribute/FAMILY "Arial Unicode MS" TextAttribute/SIZE 25 TextAttribute/LIGATURES TextAttribute/LIGATURES_ON} font (Font/getFont text-attrs) ta (doto (JTextArea.) (.setFont font)) frc (.getFontRenderContext (.getFontMetrics ta font))] (defn unicode-partition "takes an unicode string and returns a vector of strings by partitioning the input string in such a way that multiple code points of a single ligature are in same partition in the output vector" [^String text] (let [glyph-vector (.layoutGlyphVector font, frc, (.toCharArray text), 0, (.length text), Font/LAYOUT_LEFT_TO_RIGHT) glyph-num (.getNumGlyphs glyph-vector) glyph-positions (map first (partition 2 (.getGlyphPositions glyph-vector 0 glyph-num nil))) glyph-widths (map - (concat (next glyph-positions) [(.. glyph-vector getLogicalBounds width)]) glyph-positions) glyph-indices (seq (.getGlyphCharIndices glyph-vector 0 glyph-num nil)) glyph-index-width-map (zipmap glyph-indices glyph-widths) corrected-glyph-widths (vec (reduce (fn [acc [kv]] (do (aset acc kv) acc)) (make-array Float (count glyph-index-width-map)) glyph-index-width-map))] (loop [idx 0 pidx 0 char-seq text acc []] (if (nil? char-seq) acc (if-not (zero? (nth corrected-glyph-widths idx)) (recur (inc idx) (inc pidx) (next char-seq) (conj acc (str (first char-seq)))) (recur (inc idx) pidx (next char-seq) (assoc acc (dec pidx) (str (nth acc (dec pidx)) (first char-seq)))))))))) 

Also published in Gist .

+1
source share

Computer sketch on wikipedia page -

The modern Roman font with TeX includes five common ligatures ff, fi, fl, ffi and FFL. When TeX finds these combinations in the text, it replaces the corresponding ligature, if only the typesetter is redefined.

This means that it is an editor that performs the substitution. Moreover,

Unicode claims that ligature is a problem of presentation, not a problem of character definition, and that, for example, "if a modern font asked to display" h ", followed by" r ", and the font has an" hr "ligature in it can display a ligature."

As far as I can see (I have an interest in this topic and I’m reading several articles now), instructions for replacing the ligature are built into the font. Now I went deeper and found them for you; GSUB - Glyph Substitution Table and Ligature Substitution Subtitle from OpenType File Format Specification.

Next, you need to find a library that can allow you to peak inside OpenType font files, i.e. a file parser for quick access. Reading the following two discussions may give you some guidance on how to make these replacements:

+2
source share

What you are talking about is not ligatures (at least not in Unicode), but grapheme clusters. There is a standard application that deals with the detection of text boundaries, including the boundaries of grapheme clusters:

http://www.unicode.org/reports/tr29/tr29-15.html#Grapheme_Cluster_Boundaries

Also see the description of grouped grapheme clusters in regular expressions:

http://unicode.org/reports/tr18/#Tailored_Graphemes_Clusters

And the definition of graphemes sort:

http://www.unicode.org/reports/tr10/#Collation_Graphemes

I think these are the starting points. The more complicated part is likely to be finding a Java implementation of the Unicode sorting algorithm that works for Devanagari locales. If you find it, you can parse strings without resorting to OpenType functions. This would be a little cleaner, since OpenType deals exclusively with presentation details, not with the semantics of a cluster of characters or graphemes, but the sorting algorithm and the algorithm for finding the boundaries of the cluster algorithm look as if they could be implemented independently of the fonts.

+2
source share

You can get this information from the GlyphVector class.

For a given String, a font instance can create a GlyphVector that can provide information about the rendering of the text.

The layoutGlyphVector () method in Font can provide this.

The FLAG_COMPLEX_GLYPHS GlyphVector attribute can tell you if the text contains 1 to 1 with input characters.

The following example shows an example:

 JTextField textField = new JTextField(); String textToTest = "abcdefg"; FontRenderContext fontRenderContext = textField.getFontMetrics(font).getFontRenderContext(); GlyphVector glyphVector = font.layoutGlyphVector(fontRenderContext, textToTest.toCharArray(), 0, 4, Font.LAYOUT_LEFT_TO_RIGHT); int layoutFlags = glyphVector.getLayoutFlags(); boolean hasComplexGlyphs = (layoutFlags & GlyphVector.FLAG_COMPLEX_GLYPHS) != 0; int numberOfGlyphs = glyphVector.getNumGlyphs(); 

numberOfGlyphs should represent the number of characters used to display the input text.

Unfortunately, you need to create a GUI component for Java to get the FontRenderContext.

+1
source share

I think you are really looking for Unicode Normalization .

For Java you should check http://download.oracle.com/javase/6/docs/api/java/text/Normalizer.html

Choosing the right form of normalization, you can get what you are looking for.

0
source share

All Articles