Endianness - Byte Order Marker – Dhval Mudawal

Endianness so simple and yet I confuse myself with Big vs Little endian.

Big endian machine: Stores data big-end first. When looking at multiple bytes, the first byte (lowest address) is the biggest.
Little endian machine: Stores data little-end first. When looking at multiple bytes, the first byte is smallest.

In one byte data encoding (ASCII) endianness do not matter. But when we use more than two bytes to represent a character we need to agree to store left to right or vice versa.

Endianess is also referred to as the NUXI problem. Imagine the word UNIX stored in two 2-byte words. In a Big-Endian system, it would be stored as UNIX. In a little-endian system, it would be stored as NUXI.

Big Endian is how we read in english left to right, hence high order byte is stored at 0 position. Consider the 32-bit number, 0xDEADBEEF.

Big-Endian: The most significant byte is stored at the lowest byte address.

Little-endian: Least significant byte is stored at the lowest byte address.

Solution 1: Use a common format

It is important to use hton before sending data, even if you are big-endian. Your program may be so popular it is compiled on different machines, and you want your code to be portable (don't you?).

Similarly, there is a function ntoh (network to host) used to read data off the network. You need this to make sure you are correctly interpreting the network data into the host's format. You need to know the type of data you are receiving to decode it properly, and the conversion functions are:

 htons()--"Host to Network Short"
 htonl()--"Host to Network Long"
 ntohs()--"Network to Host Short"
 ntohl()--"Network to Host Long"

Remember that a single byte is a single byte, and order does not matter. Declared in winsock2.h, which are defined for TCP/IP, so all machines that support TCP/IP networking have them available. They store the data in 'network byte order' which is big endian.

Function	Purpose
`ntohs`	Convert a 16-bit quantity from network byte order to host byte order (Big-Endian to Little-Endian).
`ntohl`	Convert a 32-bit quantity from network byte order to host byte order (Big-Endian to Little-Endian).
`htons`	Convert a 16-bit quantity from host byte order to network byte order (Little-Endian to Big-Endian).
`htonl`	Convert a 32-bit quantity from host byte order to network byte order (Little-Endian to Big-Endian).

If the processor on which the TCP/IP stack is to be run is itself also Big-Endian, each of the four macros (i.e. ntohs, ntohl, htons, htonl) will be defined to do nothing and there will be no run-time performance impact. If, however, the processor is Little-Endian, the macros will reorder the bytes appropriately. These macros are routinely called when building and parsing network packets and when socket connections are created. Serious run-time performance penalties occur when using TCP/IP on a Little-Endian processor. For that reason, it may be unwise to select a Little-Endian processor for use in a device, such as a router or gateway, with an abundance of network functionality. (Excerpt from reference [1]).

One additional problem with the host-to-network APIs is that they are unable to manipulate 64-bit data elements.

Solution 2: Use a Byte Order Mark (BOM)

The other approach is to include a magic number, such as 0xFEFF, before every piece of data. If you read the magic number and it is 0xFEFF, it means the data is in the same format as your machine, and all is well.

If you read the magic number and it is 0xFFFE (it is backwards), it means the data was written in a format different from your own. You'll have to translate it.

BOM adds overhead to all data that is transmitted. Even if you are only sending 2 bytes of data, you need to include a 2-byte BOM. Ouch!

Unicode uses a BOM when storing multi-byte data (some Unicode character encodings can have 2, 3 or even 4-bytes per character). XML avoids this mess by storing data in UTF-8 by default, which stores Unicode information one byte at a time.

Why are there endian issues at all? Can't we just get along?

Each byte-order system has its advantages. Little-endian machines let you read the lowest-byte first, without reading the others. You can check whether a number is odd or even (last bit is 0) very easily, which is cool if you're into that kind of thing. Big-endian systems store data in memory the same way we humans think about data (left-to-right), which makes low-level debugging easier.

Resources -

http://betterexplained.com/articles/understanding-big-and-little-endian-byte-order/

http://www.codeproject.com/KB/cpp/endianness.aspx

http://people.cs.umass.edu/~verts/cs32/endian.html