Saturday, August 18, 2007

Detecting overflow/underflow when adding/subtracting unsigned integers in C++

Assume three unsigned integers of the same type. In these examples, the type is char unsigned, but it can be any other unsigned integer type. In these examples, UCHAR_MAX is 255.

(Observe that, in any unsigned integer addition, where the highest possible value (2n − 1) is added to itself, the result is 2n − 1 + 2n − 1 = 2n − 2 where n is at least 8 or more in this context.)

Adding unsigned integers of the same type

When the sum is less than the augend then the sum is the result of an overflow.
char unsigned augend (UCHAR_MAX);
char unsigned const addend (UCHAR_MAX);
char unsigned const sum (augend + addend);

if (sum < augend)
{
std::puts ("Overflowed!");
}
sum = augend + addend
sum = 255 + 255
sum = 510 modulo 256 // Behind the scenes.
sum = 254

In the event of an unsigned integer overflow, the sum will always be less than the augend. In this example, an overflow occured because the sum (254) is less than the augend (255).

Subtracting unsigned integers of the same type

When the difference is greater than the minuend then the difference is the result of an underflow.
char unsigned minuend (0);
char unsigned const subtrahend (1);
char unsigned const difference (minuend − subtrahend);

if (difference > minuend)
{
std::puts ("Underflowed!");
}
The C++ standard (ISO 14882:2003): "Unsigned integers, declared unsigned, shall obey the laws of arithmetic modulo 2n where n is the number
of bits in the value representation of that particular size of integer." + "This implies that unsigned arithmetic does not overflow because a result that cannot be represented by the resulting unsigned integer type is reduced modulo the number that is one greater than the largest value that can be represented by the resulting unsigned integer type."

The original test program
#include <climits>
#include <cstdio>

void addition ()
{
char unsigned augend (0);
char unsigned const addend (1);
char unsigned sum;

for (int loop_count (0); loop_count < INT_MAX; ++loop_count)
{
sum = augend + addend;

if (sum < augend) // Unsigned integer overflow?
{
std::printf ("[%3d] %3u + %u = %u (Overflowed!)\n", loop_count, augend, addend, sum);
break;
}
else
{
std::printf ("[%3d] %3u + %u = %u\n", loop_count, augend, addend, sum);
augend = sum;
}
}
}

void subtraction ()
{
char unsigned minuend (UCHAR_MAX);
char unsigned const subtrahend (1);
char unsigned difference;

for (int loop_count (0); loop_count < INT_MAX; ++loop_count)
{
difference = minuend - subtrahend;

if (difference > minuend) // Unsigned integer underflow?
{
std::printf ("[%3d] %3u - %u = %u (Underflowed!)\n", loop_count, minuend, subtrahend, difference);
break;
}
else
{
std::printf ("[%3d] %3u - %u = %u\n", loop_count, minuend, subtrahend, difference);
minuend = difference;
}
}
}

int main ()
{
addition ();
subtraction ();
}
To handle signed integers, see the pages of Bjarne Stroustrup. One method is demonstrated on pages 12 and 13 in the paper Abstraction and the C++ machine model [PDF].

(Motivation for this post) I looked but I could not find a simple technique to check whether an unsigned integer overflowed in C++, and here it is. As a bonus, the opposite allows checking for underflows.

Friday, August 17, 2007

Less anonymous posts

Anonymous posters are often tagged with a generic name, such as "nobody" (if you've been to SourceForge). Some forum systems publish the IP address of these posters for all to see, but a better solution would be to generate a number unique to the poster and the server, based on the IP address. This number would then be associated with each "nobody", so that it would be easier to see whether a poster is from different or like sources.

Of course, there are a number of variables involved (DHCP, remote connection with another network), and it really needs to be tested on a live system to see how well it works or doesn't; it should be usable in 80% of the cases.

What about registered users? They should also have such a number, or else a registered user could easily log off to debate with itself without anyone knowing.

The motivation for this proposal came from observing how easy it is for anonymous posters to fake a debate, and the need for "identifying" the source of anonymous posts. This is no real identification measure, but will filter simple abuse of a forum system from less knowledgable users. Besides, there's nothing stopping most people from registering multiple accounts (but only an "administrator" would typically have access to the information required to identify this kind of abuse).

IP addresses change, so ultimately, anonymous people should have the possibility of entering some name, which are easily discerned from registered users. It may be local to a thread for all intents and purposes.

The implementation of the one-way generated number should have one or more static source (at least the IP address) and one or more variable sources known to the server (seeds). Anyway, it's the implementation which ultimately decides the fate of this method.

Thursday, August 9, 2007

Naming scheme for digital storage units

Overall context

This paper defines a consistent naming scheme intended for use in the communication of storage units. This naming scheme is intended to replace less concrete symbols used to describe machine-level data structures. Examples of less concrete symbols are byte, char, int, long, word, dword, qword, short. It may also include information about endianness. It is not tied to binary.

Organization

  1. Definitions
    1. a primary storage unit
    2. a storage unit symbol
    3. a secondary storage unit
    4. compound storage units
      1. Endianness
  2. Thoughts
  3. About
    1. Notational Conventions
    2. Document

1. Definitions

1.1. a primary storage unit

The primary storage unit represents the number base (the radix) of a digit.

(Example 1:2) A device may have a number base of 2 (a radix of 2). Its primary storage unit could then be referred to as a binary digit (a BD, or more commonly a bit). The numerals for a binary digit are {0, 1}.

(Example 2:2) A device may have a number base of 3 (a radix of 3). Its primary storage unit could then be referred to as a ternary digit (a TD). The numerals for a ternary digit may then be {-1, 0, +1}.

Note: The number base (the radix) is at least two. There is no upper limit.

1.2. a storage unit symbol

The storage unit symbol is the letter D, short for digits, followed by a base 10 (denary (decimal)) positive integer representing the number of primary storage units. There is no space between the letter 'D' and the number. The number must be at least 1.

1.3. a secondary storage unit

The secondary storage unit must consist of at least one primary storage unit (at least one digit). The secondary storage unit is commonly the smallest addressable unit in a device. Larger storage units are formed by treating two or more secondary storage units as a compound storage unit. Compound storage units are subject to endianness. (Explained in the next section.)

(Example 1:3) A device may have eight primary storage units (eight digits) for its secondary storage unit. Its secondary storage unit may then be referred to as an octet (a group of eight). If its primary storage unit has a number base of 2 (a radix of 2) its secondary storage unit could be referred to as a binary octet.

"Binary Octet"
Weight2726252423222120
Value1286432168421

(Example 2:3) A device may have four primary storage units (four digits) for its secondary storage unit. Its secondary storage unit may then be referred to as a quaternary (a group of four). If its primary storage unit has a number base of 3 (a radix of 3) its secondary storage unit could be referred to as a ternary quaternary.

"Ternary Quaternary"
Weight33323130
Value27931

(Example 3:3) A device may have nine primary storage units (nine digits) for its secondary storage unit. Its secondary storage unit may then be referred to as a nonary (a group of nine). If its primary storage unit has a number base of 2 (a radix of 2) its secondary storage unit could be referred to as a binary nonary.

"Binary Nonary"
Weight282726252423222120
Value2561286432168421

1.4. compound storage units

A compound storage unit consists of two or more secondary storage units.

(Example 1:3) If the secondary storage unit consists of four primary storage units (four digits), then the smallest compound storage unit is a D8, the next smallest is a D12, and so on in multiples of the secondary storage unit (D4).

(Example 2:3) If the secondary storage unit consists of eight primary storage units (eight digits), then the smallest compound storage unit is a D16, the next smallest is a D24, and so on in multiples of the secondary storage unit (D8).

(Example 3:3) If the secondary storage unit consists of nine primary storage units (nine digits), then the smallest compound storage unit is a D18, the next smallest is a D27, and so on in multiples of the secondary storage unit (D9).

1.4.1. Endianness

(Background) Numbers are commonly read and evaluated from left to right. The left-most digit in a multi-digit number holds the most significant weight of the number. The right-most digit holds the least significant weight of the number. The number is said to be big-endian because the most significant digit is also the one which is read first. This assumes that a number is always read from left to right.

Endianness specifies whether the least significant, or most significant secondary storage unit (SSU) comes first, when a compound storage unit is accessed at an address [N].

The upper-case letter 'B', short for "Big-endian compound storage unit", follows the storage unit symbol (1.2.). There is no space between the storage unit symbol and the upper-case letter 'B'. The upper-case letter 'B' denotes that the compound storage unit is stored starting with its most significant SSU at address [N]. This is commonly referred to as big-endian.

The upper-case letter 'L', short for "Little-endian compound storage unit", follows the storage unit symbol (1.2.). There is no space between the storage unit symbol and the upper-case letter 'L'. The upper-case letter 'L' denotes that the compound storage unit is stored starting with its least significant SSU at address [N]. This is commonly referred to as little-endian.

(Example 1:1) A device may have a secondary storage unit which consists of eight primary storage units (D8). The primary storage unit is a binary digit for this example. A D32 compound storage unit has the value 16r80402010 (2,151,686,160) and is stored at address [N], where N is in units of a secondary storage unit. The table below shows how the memory organization of this value differs.

Type?Address?
[N][N + 1][N + 2][N + 3]
D32B16r8016r4016r2016r10
D32L16r1016r2016r4016r80

(Note: The convention common to most, if not all microprocessors, is to store data in an increasing order starting at an address [N].)

2. Thoughts

  • Using the same naming scheme to describe primitive data types; IU16L would be an integer (I) which is unsigned (U), 16-digit, and little-endian (L). IS16B would be an integer (I) which is signed (S), 16-digit, and big-endian (B). In the case of a signed integer, there is no information about the interpretation of the digits. Is it two's complement? Is it one's complement? This detail would typically be documented for that particular field, or stated elsewhere in the documentation if it applied to all signed integers, instead of being assumed as it is today (2007). There is also no information about the primary storage unit (the number base (the radix)) encoded anywhere. This must also be stated in the documentation, even though most devices operate in binary only, so much so that the word digital implies an association with the word binary.
  • An additional character could be prefixed to the storage unit symbol. It would denote whether a primary storage unit is binary, ternary, denary, and so on. This would be useful in contexts where different primary storage units are described. For instance, a D8 could be written BD8 for a secondary storage unit of eight binary digits (BDs), TD8 for a secondary storage unit of eight ternary digits (TDs), DD8 for a secondary storage unit of eight denary digits (DDs), and so on.
  • An additional character N could be used to specify native endianness. For instance, D16N may refer to a D16B, or a D16L, depending on the overlying context. It would also make the reader aware of that the endianness may be one or the other depending on an overlying context.

3. About

3.1. Notational Conventions

  • The general Smalltalk syntax <radix>r<digits> was used to represent hexadecimal numbers. There are several reasons for this. 1) The C variant has its origins in the peculiarity of parsing, rather than trying to appeal to the human mind using a more sensible syntax. This was not acceptable to sustain in this context. 2) Suffixing the number with a radix in subscript notation is not very usable in plain text. 3) Prefixing a number with $ was as poor a choice as the first one.
  • [N] is the array index notation used to denote an address N.

3.2. Document

  • Version: 1.0 (2006-12-03)
  • License: Free to distribute verbatim. Non-commercial. Non-modifiable.
  • Copyright: © 2006 A. R. Svendsen. All rights reserved.
  • Author: A. R. Svendsen

Tuesday, August 7, 2007

In retrospect

This quote has little to do with Supersymmetry, Extra Dimensions and the Origin of Mass.

[0h:41m:10s] "At the moment we have a system where we use Oracle at CERN, and MySQL at the other institutions. In retrospect, we might have been better off with MySQL everywhere, but there is a tendency for people who aren't experts to think; if you pay money it must be a better product."

Google Tech Talks: Marjorie D. Shapiro
Supersymmetry, Extra Dimensions and the Origin of Mass
June 18, 2007

Friday, August 3, 2007

The most important part of information

The most important part of information is (still) the date and time.