Normalization in Java
Java provides Unicode normalization capabilities through the java.text.Normalizer class, which allows strings to be normalized into different Unicode normalization forms.
Syntax:
String Normalizer.normalize(CharSequence input, Normalizer.Form form)
Parameters:
input: The string to be normalized.-
form: The desired normalization form, represented by theNormalizer.Formenum:Normalizer.Form.NFC(Canonical Composition)Normalizer.Form.NFD(Canonical Decomposition)Normalizer.Form.NFKC(Compatibility Composition)Normalizer.Form.NFKD(Compatibility Decomposition)
Return Value:
- The normalized string.
Normalizer.isNormalized()
Checks if a string is already in the specified normalization form.
Syntax:
boolean Normalizer.isNormalized(CharSequence input, Normalizer.Form form)
Parameters:
input: The string to check.form: The normalization form to check against.
Return Value:
trueif the string is normalized,falseotherwise.
Examples
Example 1: Canonical Composition
import java.text.Normalizer;
public class NormalizationExample {
public static void main(String[] args) {
String str = "e\u0301"; // "e" + combining acute accent
String normalized = Normalizer.normalize(str, Normalizer.Form.NFC);
System.out.println(normalized); // Outputs: "é"
}
}
Example 2: Canonical Decomposition
import java.text.Normalizer;
public class NormalizationExample {
public static void main(String[] args) {
String str = "é"; // Precomposed character
String decomposed = Normalizer.normalize(str, Normalizer.Form.NFD);
System.out.println(decomposed); // Outputs: "é" (split into base and combining mark)
}
}
Use Case: Comparing Strings
Strings may look identical but differ in their internal Unicode representations. Normalization ensures consistency for accurate comparisons.
import java.text.Normalizer;
public class NormalizationComparison {
public static void main(String[] args) {
String str1 = "e\u0301"; // "e" + combining acute accent
String str2 = "é"; // Single precomposed character
System.out.println(str1.equals(str2)); // False
System.out.println(Normalizer.normalize(str1, Normalizer.Form.NFC).equals(str2)); // True
}
}
Use Case: Compatibility Decomposition
Normalization can transform compatibility characters into simpler equivalents for easier processing.
import java.text.Normalizer;
public class CompatibilityNormalization {
public static void main(String[] args) {
String str = "①"; // Circled number one
String normalized = Normalizer.normalize(str, Normalizer.Form.NFKC);
System.out.println(normalized); // Outputs: "1"
}
}
Limitations and Dependencies
- Performance — Normalization can be computationally intensive for large datasets.
- Complexity — Requires explicit calls to the
Normalizerclass, making normalization an additional step in text processing workflows. - The
Normalizerclass is part of the standard Java Development Kit (JDK), so no additional dependencies are needed.