Data Representation is a complex subject. People have built their careers in data representations, and some have lost their sleep. While the parent post refers to binary and non-binary data, the subject of data representations is more complex for a single blog post. If you are as old as me and lived through the data representation standardization, you will understand. If you are a millennial, you can reap the benefits of painful standardization of data structures. Semantic Data is still open for standardization.
What is Data?
Data is a collection of datum. Datum is singular, and Data is plural. In the computer language, “Data” is also widely (loosely) used as a singular.
A datum is a single piece of information (a single fact, a starting point for measurement). A character, a quantity, or a symbol on which computer operations (add, multiply, divide, reverse, flip) are applied. E.g., The character ‘H’ is a datum, and the string “Hello World” is data composed of different datum characters.
From now on, we will call ‘H’ and ‘Hello World’ as Data.
What are Data Types?
Data types are attributes of data that tell the computer the programmer’s intent to use the data. E.g., If the data type is a number, the programmer can add, multiply, and divide the data. If the data type is a character, then the programmer can compose the characters into strings. The operations add, multiply, and divide do not apply to characters.
Computers need to store, compute, and transfer different types of data.
Some common Data types are best described below that illustrate basic and composite data types:
| Data Types | Examples |
| Characters and Symbols | ‘A’, ‘a’, ‘$’, ‘ह‘, ‘छ’ |
| Digits | 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 |
| Integers (Signed and unsigned) | -24, -191, 533, 322 |
| Boolean (Binary) | [True, False], [1, 0] |
| Floats (single precision) | -4.5f |
| Doubles (double precision) | -4.5d |
| Composite: Imaginary Numbers | a + b*i |
| Composite: Strings | “Heaven on Earth” |
| Composite: Arrays of Strings | [‘Heaven’, ‘on’, ‘Earth’] |
| Composite: Maps (key-value) | {‘sad’: ‘:(‘, ‘happy’: ‘:)’} |
| Composite: Decimal (Fraction) | 22 / 7 |
| Composite: Enumerations | [Violet, Indigo, Blue, Green, Yellow, Orange, Red] |
What are Data Representations?
Logically, computers represent a datum by mapping it to a unique number and data as a sequence of numbers. This representation makes computing consistent – everything is a number. This mapping is called “Unicode.”
| Example | Number (Unicode code points) | HTML | Comments |
| ‘A’ | U+0041 | A | 0x41 = 0d65 |
| ‘a’ | U+0061 | a | 0x61 = 0d97 |
| 8 | U+0038 | 8 | 0x38 = 0d56 |
| ‘ह’ | U+0939 | ह | 0x939 = 0d2361 |
The numbers can themselves be represented in the base of 2, 8, 10, or 16. The human-readable number is base-10, whereas base-2 (binary), base-8 (octal), and base-16 (hexadecimal) are the other standard base systems. The Unicode code points (mappings) above are represented in hexadecimal.
| Base-10 | Base-2 | Base-8 | Base-16 |
| 0d25 (2*10 + 5) | 0b11001 (1*24 + 1* 23 + 0*22 + 0*21 + 1*20) | 031 (3*81 + 1*80) | 0x19 (1*161 + 9*160) |
Computers use base-2 (binary) to store, compute, and transfer data. Computers use base-2 because the electronic gates that make up the computers use binary inputs. Each storage cell in memory can store “one bit,” i.e., either a ‘0’ or a ‘1’. A group of 8 bits is a byte. The Arithmetic Logic Unit (ALU) uses a combination of AND, OR, NAND, XOR, NOR gates for mathematical operations (add, subtract, multiply, divide) on binary (base-2) representation of numbers. In modern memory systems (SSDs), each storage cell can store more than one bit of information. These are called MLCs (Multi-level cells). E.g., TLCs store 3 bits of information – or – 8 (23) stable states. This MLC helps to build fast, big, and cheap storage.
Historically, there have been many different character sets. E.g., ASCII for English, Windows-1252 (expanded ASCII) used by windows-95 systems to represent new characters and symbols. However, modern computers use the Unicode character set for (structural) interoperability between computer systems. The current Unicode (v.13) character set has 143,859 unique code points and can expand to 1,114,112 unique code points.
While all the characters in a character set can be mapped to numbers, precision point numbers (floats, doubles) are represented in the computers differently. They are represented as a composite of a sign, mantissa (significant), and exponent:
± (mantissa) * 2exponent
| Decimal | Binary | Comment |
| 1.5 | 1.1 | 1 * 20 + 1 * 2-1 |
| 33.25 | 100001.01 | 1 * 25 + 0 * 24 + 0 * 23 + 0 * 22 + 0 * 21 + 1 * 20 + 0*2-1 + 1*2-2 |
The example below shows how 33.25 is converted to a float (single precision) representation – 1 sign bit, 8 exponent bits, 23 mantissa bits:
| Convert 33.25 to Binary | 100001.01 |
| Normalized Form | (-1)0 * 1.0000101 * 25 [ (-1)s * mantissa * 2exponent ] |
| Convert exponent using biased notation Represent decimal as binary | 5 + 127 = 13210 = 1000 01002 |
| Normalize the mantissa Adjust to 23 bits by padding 0s | 000 0101 0000 0000 0000 0000 |
| Represent the 4 byte (32 bits) | 0100 0010 0000 0101 0000 0000 0000 0000 |
Some scientific computing requires double precision to handle the underflow/overflow issues of single precision. Double precision (64 bits) uses 1 sign bit, 11 exponent bits, and 52 mantissa bits. There are also long doubles that store 128 bits of information. The arithmetic operations (add, multiple) in the electronics are simplified using this binary representation.
Despite great computer precision, some software manages decimals as two separate fields (numerator and denominator) or (before . and after .) as multi-byte integers. These are called “Fraction” or “Decimal” data types and are usually used to store “money” where precision loss is unacceptable (i.e., 20.20 USD is 20 Dollars and 20 cents and not 20 Dollars and 0.199999999999 dollars).
What is Data Encoding?
Encoding is converting data represented by a sequence of numbers from the character set mapping into bits and bytes. The encoding process could be fixed width or variable width and is used for storage/transfer of data. Base64 encoding uses a fixed width (8 bits) encoding to represent 64 ASCII characters (A-Z, a-z, 0-9, special characters). UTF-8 encoding uses a variable width (1-4 bytes) encoding to represent Unicode character set.
| Text | Base64 | UTF-8 |
| earth | ZWFydGg= | 01100101 01100001 01110010 01110100 01101000 |
| éarth | w6lhcnRo | 11000011 10101001 01100001 01110010 01110100 01101000 |
Base64 is usually used to convert binary data for media-safe data transfer. E.g., A modem/printer would interpret binary data differently (sometimes as control commands), so a base64 encoding is used to convert the data into ASCII to be media-safe. The Data is transferred as binary; however, since the bytes are ASCII (limited binary), the media/printer is not confused. If you observe, base64 has increased the number of bytes after the encoding. Earth (5 bytes) is encoded as ZWFydGg= (8 bytes). The Data is decoded back to binary at the receiver’s end. The example below shows the process:
| 1 | earth (40 bits) | 01100101 01100001 01110010 01110100 01101000 |
| 2 | Buffer to have bits in the multiples of 6 at byte boundaries (48 bits) [48 is 6 bytes and a multiple of 6] | 01100101 01100001 01110010 01110100 01101000 00000000 |
| 3 | Regroup into 6 bit bytes | 011001 010110 000101 110010 011101 000110 100000 000000 |
| 4 | Use Base64 table to map to text (see Wikipedia for base64 map) | ZWFydGg= |
| 5 | Convert to binary to send to store or transfer | 01011010 01010111 01000110 01111001 01100100 01000111 01100111 00111101 |
There are many different types of encodings – UTF-7, UTF-16, UTF-16BE, UTF-32, UCS-2, and many more.
What is Data Endianness?
Endianness is the order of bytes in memory/storage or transfer. There are two primary types of Endianness: big-endian and little-endian. You might be interested in middle-endian (mixed-endian), and you can google that on your own.
As you can see in the diagram below, the computer may represent the data starting with the most significant byte (0x0A) or the least significant byte (0x0D).

Most modern computers are little-endian when they store multi-byte data. Networks are consistently big-endian. So, little-endian memory dumps have to arrive at the network as big-endian.
Summary: There are many data types – basic (chars, integers, floats) and composite (arrays, decimals). Data is mapped to numbers using a universal character set (Unicode). This Data is represented as a sequence of code points in Unicode and converted into characters (or bits/bytes) using an encoding process. The encoding process can be fixed-length (E.g., Base64, UTF-32) or variable length (UTF-8, UTF-16). Computers can be little or big-endian. Modern CISC computers (Intel x86) are little-endian, and RISC computers (ARM Processors) are big-endian. Networks are always big-endian.
Tips/Tricks: Stick to Unicode character set and UTF-8 encoding scheme. Use Base64 to transfer data to be media-safe (e.g., base64 encoding of strings in HTTP URLs to make them URL-safe). Using a modern programming language (E.g., Java) abstracts you from the Endianness. If you are an embedded engineer programming in C, you need to develop code to be Endianness safe (e.g., type casts and memcpy).
Even with all this structure, we cannot convey meaning (semantics). An ‘A’ for the computer is always U+0041. If the programmer wants to transfer ‘A,’ ‘A,’ or ‘A,’ more information is encoded for the receiver to interpret. More on that in future blogs.
This one was too long even for me!