# MD5 Algorithm: Working and Uses with C Code

Hash functions take strings or files of arbitrary length and compress them into shorter lines unique to the particular message or file processed. Hashing is a process of scrambling a piece of information …

9 mins Hash functions take strings or files of arbitrary length and compress them into shorter lines unique to the particular message or file processed. Hashing is a process of scrambling a piece of information or data beyond recognition to produce an output called the hash. The hash functions are computationally feasible when run in the forward direction, but they become computationally expensive when run backwards to derive the message or input from the hash value; this makes the hashes (or hash digests) practically irreversible. Mathematically, it can be represented as follows:

H : {0, 1} → {0, 1}n

The hash functions take any combination of zeros and ones, i.e. binary input, and calculate a hash value of fixed length (say n) for the file or message given as input. The hash value is practically different and unique for every individual file or message passed as input to the hash functions but, theoretically, as we can derive from the mathematical representation of the hash functions, the input set is of infinite elements, and the output set consists of a finite number of elements i.e. the hashes; so, there must be infinitely many messages for which the hash value will be same. A visual interpretation is shown below: Any cryptographic hash function is not completely unbreakable but the hash functions are designed in such a way that calculating the messages from their hash values is so computationally expensive that it will not be feasible for the computational power available today or the hash values should be practically irreversible. On the other hand, the same hash function should run in polynomial time and efficiently in the forward direction.

There are three main features of a cryptographic hash function, and they are as follows:

1. Collision resistance: A collision in a function H is a pair of distinct inputs x and x’ such that H(x) = H(x’). In this case, we also say that x and x’ collide under H. A function H is collision-resistant if it is infeasible for any polynomial-time algorithm to find a collision in H.
2. Preimage resistance: Informally, a hash function is preimage resistant if given a message x and some hash digest y. It is hard for a polynomial-time algorithm to find a value x’ such that H(x’) = y.
3. Second preimage resistance: In simple words, a hash function is said to be second preimage resistant if given a message and its hash value. It is computationally impractical to find another message such that they have the same hash value. In other words, a hash function is the second preimage resistant if given x; it is hard for a polynomial-time algorithm to find x’ such that H(x) = H(x’).

There are many fields of application for cryptographic hash functions like digital signatures, public-key encryption, integrity verification, message authentication, password protection, key agreement protocols, and many other cryptographic protocols. Currently, encryption of an email, sending a message on your mobile phone, connecting to an HTTPS website, or connecting to a remote machine through IPSec or SSH all use a hash function somewhere under the hood. Git revision control system uses a hash function to identify files and changes in a repository, host-based intrusion detection systems (HIDS) use them to detect modified files, network-based intrusion detection systems (NIDS) use hashes to detect known-malicious data going through a network, forensic analysts use hash values to prove that digital artefacts have not been modified, Bitcoin uses a hash function in its proof-of-work systems. There are many more such use-cases of hash functions.

Some names of hash functions are MD5, SHA-1, SHA-256, BLAKE, Keccak, etc. The MD5 and SHA-1 hash functions are cryptographically broken but are still used for non-cryptographic uses like integrity checks on local servers, etc. SHA-2 and SHA-3 are cryptographically secure and used widely for various purposes.

## MD5 Hash Algorithm

MD5 hash function was designed as a successor of MD4 by Ronald Rivest in 1991. It is a one-way hash function, initially designed for digital signatures. It has been broken and is not used for cryptographic-purpose anymore but still can be used for integrity check of files against unintentional corruption. It generates a 128-bit hash digest for a given input message or file. The algorithm for the MD5 hash function can be elaborated in six steps:

Some extra bits are added to the original message such that the total length of the output becomes 64-bit less than a multiple of 512-bit. Padding bits have the first bit like one, and the rest are 0’s. ### Appending length bits

To make the total length of the output a multiple of 512, we need to add a 64-bit length section to the message, which is determined by the size of the message. For example, if the message is a string is “ball”, then the length bit will be (8 x 4) bits = 32 bits = 00100000...00 (64 characters) in binary form and in little-endian representation.

### Initializing MD buffers and constants

Four 32-bit buffers that contain some hexadecimal values are initialised, as shown below:

1. A = 0x67452301
2. B = 0xefcdab89
4. D = 0x10325476

### Dividing the message into blocks and sub-blocks

The whole output generated after appending the padding and length bits is a multiple of 512 in length. So the message is now divided into n blocks of 512-bit each. Each block of 512-bit is now divided into 16 sub-blocks of 32-bit each and four rounds of operations are performed on each sub-block. 16 operations (one on each sub-block) are performed in each round i.e 64 operations in each block of 512-bit.

### Operation on each sub-block

Here, the figure below represents a single MD5 operation on a sub-block in each 512-bits block. Here,

1. a,b,c,d: Chaining variables
2. P: some non-linear function
3. M[i]: ith part of the message
4. t[k]: kth constant
5. CLSS: circular Left Shift by s-bits Here Single MD5 operation can also be represented in the equation as:

a = b + ((a + P(b,c,d) + M[i] + t[k]) <<< s)

In each round, 16 operations are performed and in each round different auxiliary functions or process (P) is used:

1. F(B,C,D) = (B ∧ C) ∨ (¬B ∧ D)
2. G(B,C,D) = (B ∧ D) ∨ (C ∧ ¬D)
3. H(B,C,D) = B ⊕ C ⊕ D
4. I(B,C,D) = C ⊕ (B ∨ ¬D)

### Generating the message digest

After all the rounds have been performed, the buffers a, b, c and d are added with the initial buffers A, B, C and D and are re-assigned to a, b, c and d. Now, a, b, c and d contain the MD5 output of the message, starting from a and ending with d, but the representation should be little-endian. The resultant buffer is passed on as the starting buffer for the next 512-bit block (if the current block is not the last 512-bit block). A similar cycle goes on until the last block from which we get the 128-bit MD5 digest or hash. ### Implementation of MD5 algorithm in C

This is a simple form of implementation only for short strings and not for files or large inputs. This program can be tweaked further and improved to handle large files and be a full-fledged md5 program. It is open for contribution, see the code on GitHub.

``````#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdint.h>

// initial MD buffers
uint32_t A, B, C, D;
#define LEFTROTATE(x, c) (((x) << (c)) | ((x) >> (32 - (c))))

// function for circular left shift

void md5(uint8_t *in_msg, size_t in_len)
{
// define the constants
uint32_t t[] = {0xd76aa478, 0xe8c7b756, 0x242070db, 0xc1bdceee, 0xf57c0faf, 0x4787c62a,
0xa8304613, 0xfd469501, 0x698098d8, 0x8b44f7af, 0xffff5bb1, 0x895cd7be, 0x6b901122, 0xfd987193,
0xa679438e, 0x49b40821, 0xf61e2562, 0xc040b340, 0x265e5a51, 0xe9b6c7aa, 0xd62f105d, 0x02441453,
0xd8a1e681, 0xe7d3fbc8, 0x21e1cde6, 0xc33707d6, 0xf4d50d87, 0x455a14ed, 0xa9e3e905, 0xfcefa3f8,
0x676f02d9, 0x8d2a4c8a, 0xfffa3942, 0x8771f681, 0x6d9d6122, 0xfde5380c, 0xa4beea44, 0x4bdecfa9,
0xf6bb4b60, 0xbebfbc70, 0x289b7ec6, 0xeaa127fa, 0xd4ef3085, 0x04881d05, 0xd9d4d039, 0xe6db99e5,
0x1fa27cf8, 0xc4ac5665, 0xf4292244, 0x432aff97, 0xab9423a7, 0xfc93a039, 0x655b59c3, 0x8f0ccc92,
0xffeff47d, 0x85845dd1, 0x6fa87e4f, 0xfe2ce6e0, 0xa3014314, 0x4e0811a1, 0xf7537e82, 0xbd3af235,

uint32_t s[] = {7, 12, 17, 22, 7, 12, 17, 22, 7, 12, 17, 22, 7, 12,
17, 22, 5, 9, 14, 20, 5, 9, 14, 20, 5, 9, 14, 20, 5, 9, 14,
20, 4, 11, 16, 23, 4, 11, 16, 23, 4, 11, 16, 23, 4, 11, 16,
23, 6, 10, 15, 21, 6, 10, 15, 21, 6, 10, 15, 21, 6, 10, 15, 21};

// initialize MD buffers
A = 0x67452301;
B = 0xefcdab89;
D = 0x10325476;

// copy of MD buffers
uint32_t a = A;
uint32_t b = B;
uint32_t c = C;
uint32_t d = D;

// calculate length of the message after padding --------- to be derived
// here it should be 448 bits = 448/8 = 56 bytes

// for length bit we need to add 64 bits = 64/8 = 8 bytes
// 56 + 8 = 64 bytes
uint8_t new_msg = {0};

// copy the messages ascii value to the new message
for (int i = 0; i < in_len; i++)
{
new_msg[i] = in_msg[i];
}

// add a 1 just after the message for the beginning of padding bits
// ascii of 128 = 10000000
new_msg[in_len] = 128;

// length bit: here a string is taken hence each character is of 1 byte
// thus, 8*in_len = length of message
uint8_t len_bit = 8 * in_len;

// add the length bit to the new message

// a loop for accessing 512-bit blocks
// a loop for accessing all 16 32-bit blocks in the 512-bit block
// here only one 512-bit block is there -> skipping the outer loop
uint32_t i;
uint32_t *lit_end_msg = (uint32_t *)(new_msg);

// print the new message in
for (i = 0; i < 64; i++)
{
uint32_t f, g;
if (i < 16)
{
f = (b & c) | ((~b) & d);
g = i;
}
else if (i < 32)
{
f = (d & b) | ((~d) & c);
g = (5 * i + 1) % 16;
}
else if (i < 48)
{
f = b ^ c ^ d;
g = (3 * i + 5) % 16;
}
else
{
f = c ^ (b | (~d));
g = (7 * i) % 16;
}

// swap buffers
f = f + a + t[i] + lit_end_msg[g];
a = d;
d = c;
c = b;
b = b + LEFTROTATE(f, s[i]);
}

// Add this chunk's hash to result so far:
A += a;
B += b;
C += c;
D += d;

// // print the new message
// for (int i = 0; i < 64; i++)
// {
// 	printf("%d: %d\n", i + 1, new_msg[i]);
// }
// // print the little endian message
// // 97 106 97 121 = 01100001 01101010 01100001 01111001 (ajay)
// // in lit endian = 01111001 01100001 01101010 01100001 (121 97 106 97 - yaja)
// for (int i = 0; i < 64; i++)
// {
// 	printf("%d: %d\n", i + 1, lit_end_msg[i]);
// }
}

int main()
{
char *msg = "12345"; // enter your message here or the string to hash
size_t msg_len = strlen(msg);

// invoke the md5 function
md5(msg, msg_len);

// hash
printf("Hash: %x%x%x%x\n", A, B, C, D);
// need to reverse the little endianness
uint8_t *temp = (uint8_t *)&A;
printf("MD5 Hash: %x%x%x%x", temp, temp, temp, temp);
temp = (uint8_t *)&B;
printf("%x%x%x%x", temp, temp, temp, temp);
temp = (uint8_t *)&C;
printf("%x%x%x%x", temp, temp, temp, temp);
temp = (uint8_t *)&D;
printf("%x%x%x%x", temp, temp, temp, temp);
}``````

### Output:

``````MD5 Hash: 29e45782db729fa1059d4294ede399
``````

## Reference:

1. Rivest R. (April 1992). “Step 4. Process Message in 16-Word Blocks”. The MD5 Message-Digest Algorithm. IETF. 