The byte order mark (BOM) is a particular usage of the special Unicode character code, U+FEFF ZERO WIDTH NO-BREAK SPACE, whose appearance as a magic number at the start of a text stream can signal several things to a program reading the text:

  • the byte order, or endianness, of the text stream in the cases of 16-bit and 32-bit encodings;
  • the fact that the text stream's encoding is Unicode, to a high level of confidence;
  • which Unicode character encoding is used.

BOM use is optional. Its presence interferes with the use of UTF-8 by software that does not expect non-ASCII bytes at the start of a file but that could otherwise handle the text stream.

Unicode can be encoded in units of 8-bit, 16-bit, or 32-bit integers. For 16- and 32-bit representations, a computer receiving text from arbitrary sources needs to know which byte order the integers are encoded in. The BOM becomes a noncharacter Unicode code point if its bytes are swapped. Hence, the process accessing the text can examine these first few bytes to determine the endianness, without requiring some contract or metadata outside of the text stream itself. Generally the receiving computer will swap the bytes to its own endianness, if necessary, and would no longer need the BOM for processing.

The byte sequence of the BOM differs per Unicode encoding (including UTF-8 and ones outside the Unicode standard such as UTF-7, see table below), and none of the sequences is likely to appear at the start of text streams stored in other encodings. Therefore, placing an encoded BOM at the start of a text stream can indicate that the text is Unicode and identify the encoding scheme used. This use of the BOM is called a "Unicode signature".

Usage

The BOM is, simply, the Unicode codepoint U+FEFF ZERO WIDTH NO-BREAK SPACE, encoded in the current encoding. A text file beginning with the bytes FE FF suggests that the file is encoded in big-endian UTF-16.

The name ZWNBSP (zero-width no-break space) should be used if the BOM appears in the middle of a data stream. Unicode says it should be interpreted as a normal codepoint (namely a word joiner), not as a BOM. Since Unicode 3.2, this usage has been deprecated in favor of U+2060 WORD JOINER.

The Unicode 1.0 name for this codepoint is also BYTE ORDER MARK.

UTF-8

The UTF-8 representation of the BOM is the (hexadecimal) byte sequence EF BB BF.

The Unicode Standard permits the BOM in UTF-8, but does not require or recommend its use. UTF-8 always has the same byte order, so its only use in UTF-8 is to signal at the start that the text stream is encoded in UTF-8, or that it was converted to UTF-8 from a stream that contained an optional BOM. The standard also does not recommend removing a BOM when it is there, so that round-tripping between encodings does not lose information, and so that code that relies on it continues to work. The IETF recommends that if a protocol either (a) always uses UTF-8, or (b) has some other way to indicate what encoding is being used, then it "SHOULD forbid use of U+FEFF as a signature." An example of not following this recommendation is the IETF Syslog protocol which requires text to be in UTF-8 and also requires the BOM.

Not using a BOM allows text to be backwards-compatible with software designed for extended ASCII. For instance many programming languages permit non-ASCII bytes in string literals but not at the start of the file.

A BOM is not necessary for detecting UTF-8 encoding.[citation needed] UTF-8 is a sparse encoding: a large fraction of possible byte combinations do not result in valid UTF-8 text. Binary data and text in any other encoding are likely to contain byte sequences that are invalid as UTF-8, so existence of such invalid sequences indicates the file is not UTF-8, while lack of invalid sequences is a very strong indication the text is UTF-8. Practically the only exception is text containing only ASCII-range bytes, as this may be a non-ASCII 7-bit encoding, but this is unlikely in any modern data and even then the difference from ASCII is minor (such as changing '\' to '¥').

Microsoft compilers and interpreters, and many pieces of software on Microsoft Windows such as Notepad (prior to Windows 10 Build 1903) treat the BOM as a required magic number rather than use heuristics. These tools add a BOM when saving text as UTF-8, and cannot interpret UTF-8 unless the BOM is present or the file contains only ASCII. Windows PowerShell (up to 5.1) will add a BOM when it saves UTF-8 XML documents. However, PowerShell Core 6 has added a -Encoding switch on some cmdlets called utf8NoBOM so that document can be saved without BOM. Google Docs also adds a BOM when converting a document to a plain text file for download.

UTF-16

In UTF-16, a BOM (U+FEFF) may be placed as the first bytes of a file or character stream to indicate the endianness (byte order) of all the 16-bit code units of the file or stream. If an attempt is made to read this stream with the wrong endianness, the bytes will be swapped, thus delivering the character U+FFFE, which is defined by Unicode as a "noncharacter" that should never appear in text.

  • If the 16-bit units are represented in big-endian byte order ("UTF-16BE"), the BOM is the (hexadecimal) byte sequence FE FF
  • If the 16-bit units use little-endian order ("UTF-16LE"), the BOM is the (hexadecimal) byte sequence FF FE

For the IANA registered charsets UTF-16BE and UTF-16LE, a byte order mark should not be used because the names of these character sets already determine the byte order.

Clause D98 of conformance (section 3.10) of the Unicode standard states, "The UTF-16 encoding scheme may or may not begin with a BOM. However, when there is no BOM, and in the absence of a higher-level protocol, the byte order of the UTF-16 encoding scheme is big-endian." Whether or not a higher-level protocol is in force is open to interpretation. Files local to a computer for which the native byte ordering is little-endian, for example, might be argued to be encoded as UTF-16LE implicitly. Therefore, the presumption of big-endian is widely ignored. The W3C/WHATWG encoding standard used in HTML5 specifies that content labelled either "utf-16" or "utf-16le" are to be interpreted as little-endian "to deal with deployed content". However, if a byte order mark is present, then that BOM is to be treated as "more authoritative than anything else".

Without a BOM, it is still fairly reliable to detect if text is UTF-16 and what byte order it is in if the text is sufficiently long.[citation needed] Characters in the Basic Latin block (U+0001 to U+007F) have a NUL (0x00) high byte; even in non-Latin scripts, line-endings and spaces (which are included in this block) are used frequently. If NUL bytes are much more often at even offsets in the file then it is likely to be big-endian UTF-16, and little-endian UTF-16 if they occur more often at odd offsets.

UTF-32

Although a BOM could be used with UTF-32, this encoding is rarely used for transmission. Otherwise the same rules as for UTF-16 are applicable.

The BOM for little-endian UTF-32 is the same pattern as a little-endian UTF-16 BOM followed by a UTF-16 NUL character, an unusual example of the BOM being the same pattern in two different encodings. Programmers using the BOM to identify the encoding will have to decide whether UTF-32 or UTF-16 with a NUL first character is more likely. UTF-32 is easily detected without a BOM because every 4th byte is NUL.

Byte order marks by encoding

This table illustrates how the BOM is represented as a byte sequence in various encodings and how those sequences might appear in a text editor that is interpreting each byte as a legacy encoding (Windows-1252 with caret notation for the C0 controls):

EncodingRepresentation (hexadecimal)Representation (decimal)Bytes interpreted as Windows-1252
UTF-8EF BB BF239 187 191
UTF-16 (BE)FE FF254 255þÿ
UTF-16 (LE)FF FE255 254ÿþ
UTF-32 (BE)00 00 FE FF0 0 254 255^@^@þÿ
UTF-32 (LE)FF FE 00 00255 254 0 0ÿþ^@^@
UTF-72B 2F 7643 47 118+/v
UTF-1F7 64 4C247 100 76÷dL
UTF-EBCDICDD 73 66 73221 115 102 115Ýsfs
SCSU0E FE FF14 254 255^Nþÿ
BOCU-1FB EE 28251 238 40ûî(
GB1803084 31 95 33132 49 149 51„1•3

See also

External links