Monday, 24 August 2015

Handle UTF8 file with BOM

From Wikipedia, the byte order mark (BOM) is a Unicode character used to signal the endianness (byte order) of a text file or stream. Its code point is U+FEFF. BOM use is optional, and, if used, should appear at the start of the text stream. Beyond its specific use as a byte-order indicator, the BOM character may also indicate which of the several Unicode representations the text is encoded in.
The common BOMs are :

EncodingRepresentation (hexadecimal)Representation (decimal)
UTF-8EF BB BF239 187 191
UTF-16 (BE)FE FF254 255
UTF-16 (LE)FF FE255 254
UTF-32 (BE)00 00 FE FF0 0 254 255
UTF-32 (LE)FF FE 00 00255 254 0 0

UTF8 file are a special case because it is not recommended to add a BOM to them because it can break other tools like Java. In fact, Java assumes the UTF8 don't have a BOM so if the BOM is present it won't be discarded and it will be seen as data.
To create an UTF8 file with a BOM, open the Windows create a simple text file and save it as utf8.txt with the encoding UTF-8.
Now if you examine the file content as binary, you see the BOM at the beginning.

If we read it with Java.

package com.java;

import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.InputStreamReader;

public class BomExample {

/**
* @author Imroze.Mohammad
*/
public static void main(String args[]) {
try {
FileInputStream fis = new FileInputStream("E:\\demo\\UTF8withBOM.txt");
BufferedReader r = new BufferedReader(new InputStreamReader(fis,
"UTF8"));
for (String s = ""; (s = r.readLine()) != null;) {
System.out.println(s);
}
r.close();
System.exit(0);
}

catch (Exception e) {
e.printStackTrace();
System.exit(1);
}
}
}



The output contains a strange character at the beginning because the BOM is not discarded :
?helloworld
The next example converts an UTF8 file to ANSI. We check the first line for the presence of the BOM and if present, we simply discard it.


package com.java;

import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.InputStreamReader;

public class BomExample {

/**
* @author Imroze.Mohammad
*/
 public static void main(String args[]) {
   try {
       FileInputStream fis = new FileInputStream("E:\\jeeworkspace\\talend_web_app\\demo\\UTF8withBOM.txt");
       BufferedReader r = new BufferedReader(new InputStreamReader(fis,
               "UTF8"));
       boolean firstLine=true;
       for (String s = ""; (s = r.readLine()) != null;) {
        if(firstLine){
        s=removeBOMChar(s);
        }
           System.out.println(s);
       }
       r.close();
       System.exit(0);
   }

   catch (Exception e) {
       e.printStackTrace();
       System.exit(1);
   }
 }
 
 
 private static String removeBOMChar(String s){
 
 if(s.startsWith("\uFEFF"))
 s=s.substring(1);
return s;
 
 }
 
}



No comments:

Post a Comment