Normalization in Java

Java provides Unicode normalization capabilities through the java.text.Normalizer class, which allows strings to be normalized into different Unicode normalization forms.

Syntax:

String Normalizer.normalize(CharSequence input, Normalizer.Form form)

Parameters:

input: The string to be normalized.
form: The desired normalization form, represented by the Normalizer.Form enum:
- Normalizer.Form.NFC (Canonical Composition)
- Normalizer.Form.NFD (Canonical Decomposition)
- Normalizer.Form.NFKC (Compatibility Composition)
- Normalizer.Form.NFKD (Compatibility Decomposition)

Return Value:

The normalized string.

Normalizer.isNormalized()

Checks if a string is already in the specified normalization form.

Syntax:

boolean Normalizer.isNormalized(CharSequence input, Normalizer.Form form)

Parameters:

input: The string to check.
form: The normalization form to check against.

Return Value:

true if the string is normalized, false otherwise.

Examples

Example 1: Canonical Composition

import java.text.Normalizer;

public class NormalizationExample {
    public static void main(String[] args) {
        String str = "e\u0301"; // "e" + combining acute accent
        String normalized = Normalizer.normalize(str, Normalizer.Form.NFC);
        System.out.println(normalized); // Outputs: "é"
    }
}

Example 2: Canonical Decomposition

import java.text.Normalizer;

public class NormalizationExample {
    public static void main(String[] args) {
        String str = "é"; // Precomposed character
        String decomposed = Normalizer.normalize(str, Normalizer.Form.NFD);
        System.out.println(decomposed); // Outputs: "é" (split into base and combining mark)
    }
}

Use Case: Comparing Strings

Strings may look identical but differ in their internal Unicode representations. Normalization ensures consistency for accurate comparisons.

import java.text.Normalizer;

public class NormalizationComparison {
    public static void main(String[] args) {
        String str1 = "e\u0301"; // "e" + combining acute accent
        String str2 = "é";      // Single precomposed character

        System.out.println(str1.equals(str2)); // False
        System.out.println(Normalizer.normalize(str1, Normalizer.Form.NFC).equals(str2)); // True
    }
}

Use Case: Compatibility Decomposition

Normalization can transform compatibility characters into simpler equivalents for easier processing.

import java.text.Normalizer;

public class CompatibilityNormalization {
    public static void main(String[] args) {
        String str = "①"; // Circled number one
        String normalized = Normalizer.normalize(str, Normalizer.Form.NFKC);
        System.out.println(normalized); // Outputs: "1"
    }
}

Limitations and Dependencies

Performance — Normalization can be computationally intensive for large datasets.
Complexity — Requires explicit calls to the Normalizer class, making normalization an additional step in text processing workflows.
The Normalizer class is part of the standard Java Development Kit (JDK), so no additional dependencies are needed.