From Wikipedia, the byte order mark (BOM) is a Unicode character used to signal the endianness (byte order) of a text file or stream. Its code point is U+FEFF. BOM use is optional, and, if used, should appear at the start of the text stream. Beyond its specific use as a byte-order indicator, the BOM character may also indicate which of the several Unicode representations the text is encoded in.
package com.java;
import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.InputStreamReader;
public class BomExample {
/**
* @author Imroze.Mohammad
*/
public static void main(String args[]) {
try {
FileInputStream fis = new FileInputStream("E:\\demo\\UTF8withBOM.txt");
BufferedReader r = new BufferedReader(new InputStreamReader(fis,
"UTF8"));
for (String s = ""; (s = r.readLine()) != null;) {
System.out.println(s);
}
r.close();
System.exit(0);
}
catch (Exception e) {
e.printStackTrace();
System.exit(1);
}
}
}
package com.java;
import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.InputStreamReader;
public class BomExample {
/**
* @author Imroze.Mohammad
*/
public static void main(String args[]) {
try {
FileInputStream fis = new FileInputStream("E:\\jeeworkspace\\talend_web_app\\demo\\UTF8withBOM.txt");
BufferedReader r = new BufferedReader(new InputStreamReader(fis,
"UTF8"));
boolean firstLine=true;
for (String s = ""; (s = r.readLine()) != null;) {
if(firstLine){
s=removeBOMChar(s);
}
System.out.println(s);
}
r.close();
System.exit(0);
}
catch (Exception e) {
e.printStackTrace();
System.exit(1);
}
}
private static String removeBOMChar(String s){
if(s.startsWith("\uFEFF"))
s=s.substring(1);
return s;
}
}
The common BOMs are :
Encoding | Representation (hexadecimal) | Representation (decimal) |
UTF-8 | EF BB BF | 239 187 191 |
UTF-16 (BE) | FE FF | 254 255 |
UTF-16 (LE) | FF FE | 255 254 |
UTF-32 (BE) | 00 00 FE FF | 0 0 254 255 |
UTF-32 (LE) | FF FE 00 00 | 255 254 0 0 |
UTF8 file are a special case because it is not recommended to add a BOM to them because it can break other tools like Java. In fact, Java assumes the UTF8 don't have a BOM so if the BOM is present it won't be discarded and it will be seen as data.
To create an UTF8 file with a BOM, open the Windows create a simple text file and save it as utf8.txt with the encoding UTF-8.
Now if you examine the file content as binary, you see the BOM at the beginning.
If we read it with Java.
import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.InputStreamReader;
public class BomExample {
/**
* @author Imroze.Mohammad
*/
public static void main(String args[]) {
try {
FileInputStream fis = new FileInputStream("E:\\demo\\UTF8withBOM.txt");
BufferedReader r = new BufferedReader(new InputStreamReader(fis,
"UTF8"));
for (String s = ""; (s = r.readLine()) != null;) {
System.out.println(s);
}
r.close();
System.exit(0);
}
catch (Exception e) {
e.printStackTrace();
System.exit(1);
}
}
}
The output contains a strange character at the beginning because the BOM is not discarded :
?helloworld
The next example converts an UTF8 file to ANSI. We check the first line for the presence of the BOM and if present, we simply discard it.
import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.InputStreamReader;
public class BomExample {
/**
* @author Imroze.Mohammad
*/
public static void main(String args[]) {
try {
FileInputStream fis = new FileInputStream("E:\\jeeworkspace\\talend_web_app\\demo\\UTF8withBOM.txt");
BufferedReader r = new BufferedReader(new InputStreamReader(fis,
"UTF8"));
boolean firstLine=true;
for (String s = ""; (s = r.readLine()) != null;) {
if(firstLine){
s=removeBOMChar(s);
}
System.out.println(s);
}
r.close();
System.exit(0);
}
catch (Exception e) {
e.printStackTrace();
System.exit(1);
}
}
private static String removeBOMChar(String s){
if(s.startsWith("\uFEFF"))
s=s.substring(1);
return s;
}
}
No comments:
Post a Comment