Saturday, August 18, 2007

Detecting overflow/underflow when adding/subtracting unsigned integers in C++

Assume three unsigned integers of the same type. In these examples, the type is char unsigned, but it can be any other unsigned integer type. In these examples, UCHAR_MAX is 255.

(Observe that, in any unsigned integer addition, where the highest possible value (2n − 1) is added to itself, the result is 2n − 1 + 2n − 1 = 2n − 2 where n is at least 8 or more in this context.)

Adding unsigned integers of the same type

When the sum is less than the augend then the sum is the result of an overflow.

char unsigned augend (UCHAR_MAX);
char unsigned const addend (UCHAR_MAX);
char unsigned const sum (augend + addend);

if (sum < augend)
{
std::puts ("Overflowed!");
}
sum = augend + addend
sum = 255 + 255
sum = 510 modulo 256 // Behind the scenes.
sum = 254

In the event of an unsigned integer overflow, the sum will always be less than the augend. In this example, an overflow occured because the sum (254) is less than the augend (255).

Subtracting unsigned integers of the same type

When the difference is greater than the minuend then the difference is the result of an underflow.
char unsigned minuend (0);
char unsigned const subtrahend (1);
char unsigned const difference (minuend − subtrahend);

if (difference > minuend)
{
std::puts ("Underflowed!");
}
The C++ standard (ISO 14882:2003): "Unsigned integers, declared unsigned, shall obey the laws of arithmetic modulo 2n where n is the number
of bits in the value representation of that particular size of integer." + "This implies that unsigned arithmetic does not overflow because a result that cannot be represented by the resulting unsigned integer type is reduced modulo the number that is one greater than the largest value that can be represented by the resulting unsigned integer type."

The original test program
#include <climits>
#include <cstdio>

void addition ()
{
char unsigned augend (0);
char unsigned const addend (1);
char unsigned sum;

for (int loop_count (0); loop_count < INT_MAX; ++loop_count)
{
sum = augend + addend;

if (sum < augend) // Unsigned integer overflow?
{
std::printf ("[%3d] %3u + %u = %u (Overflowed!)\n", loop_count, augend, addend, sum);
break;
}
else
{
std::printf ("[%3d] %3u + %u = %u\n", loop_count, augend, addend, sum);
augend = sum;
}
}
}

void subtraction ()
{
char unsigned minuend (UCHAR_MAX);
char unsigned const subtrahend (1);
char unsigned difference;

for (int loop_count (0); loop_count < INT_MAX; ++loop_count)
{
difference = minuend - subtrahend;

if (difference > minuend) // Unsigned integer underflow?
{
std::printf ("[%3d] %3u - %u = %u (Underflowed!)\n", loop_count, minuend, subtrahend, difference);
break;
}
else
{
std::printf ("[%3d] %3u - %u = %u\n", loop_count, minuend, subtrahend, difference);
minuend = difference;
}
}
}

int main ()
{
addition ();
subtraction ();
}
To handle signed integers, see the pages of Bjarne Stroustrup. One method is demonstrated on pages 12 and 13 in the paper Abstraction and the C++ machine model [PDF].

(Motivation for this post) I looked but I could not find a simple technique to check whether an unsigned integer overflowed in C++, and here it is. As a bonus, the opposite allows checking for underflows.

Friday, August 17, 2007

Less anonymous posts

Anonymous posters are often tagged with a generic name, such as "nobody" (if you've been to SourceForge). Some forum systems publish the IP address of these posters for all to see, but a better solution would be to generate a number unique to the poster and the server, based on the IP address. This number would then be associated with each "nobody", so that it would be easier to see whether a poster is from different or like sources.

Of course, there are a number of variables involved (DHCP, remote connection with another network), and it really needs to be tested on a live system to see how well it works or doesn't; it should be usable in 80% of the cases.

What about registered users? They should also have such a number, or else a registered user could easily log off to debate with itself without anyone knowing.

The motivation for this proposal came from observing how easy it is for anonymous posters to fake a debate, and the need for "identifying" the source of anonymous posts. This is no real identification measure, but will filter simple abuse of a forum system from less knowledgable users. Besides, there's nothing stopping most people from registering multiple accounts (but only an "administrator" would typically have access to the information required to identify this kind of abuse).

IP addresses change, so ultimately, anonymous people should have the possibility of entering some name, which are easily discerned from registered users. It may be local to a thread for all intents and purposes.

The implementation of the one-way generated number should have one or more static source (at least the IP address) and one or more variable sources known to the server (seeds). Anyway, it's the implementation which ultimately decides the fate of this method.

Thursday, August 9, 2007

Naming scheme for digital storage units

Overall context

This paper defines a consistent naming scheme intended for use in the communication of storage units. This naming scheme is intended to replace less concrete symbols used to describe machine-level data structures. Examples of less concrete symbols are byte, char, int, long, word, dword, qword, short. It may also include information about endianness. It is not tied to binary.

Organization

  1. Definitions
    1. a primary storage unit
    2. a storage unit symbol
    3. a secondary storage unit
    4. compound storage units
      1. Endianness
  2. Thoughts
  3. About
    1. Notational Conventions
    2. Document

1. Definitions

1.1. a primary storage unit

The primary storage unit represents the number base (the radix) of a digit.

(Example 1:2) A device may have a number base of 2 (a radix of 2). Its primary storage unit could then be referred to as a binary digit (a BD, or more commonly a bit). The numerals for a binary digit are {0, 1}.

(Example 2:2) A device may have a number base of 3 (a radix of 3). Its primary storage unit could then be referred to as a ternary digit (a TD). The numerals for a ternary digit may then be {-1, 0, +1}.

Note: The number base (the radix) is at least two. There is no upper limit.

1.2. a storage unit symbol

The storage unit symbol is the letter D, short for digits, followed by a base 10 (denary (decimal)) positive integer representing the number of primary storage units. There is no space between the letter 'D' and the number. The number must be at least 1.

1.3. a secondary storage unit

The secondary storage unit must consist of at least one primary storage unit (at least one digit). The secondary storage unit is commonly the smallest addressable unit in a device. Larger storage units are formed by treating two or more secondary storage units as a compound storage unit. Compound storage units are subject to endianness. (Explained in the next section.)

(Example 1:3) A device may have eight primary storage units (eight digits) for its secondary storage unit. Its secondary storage unit may then be referred to as an octet (a group of eight). If its primary storage unit has a number base of 2 (a radix of 2) its secondary storage unit could be referred to as a binary octet.

"Binary Octet"
Weight2726252423222120
Value1286432168421

(Example 2:3) A device may have four primary storage units (four digits) for its secondary storage unit. Its secondary storage unit may then be referred to as a quaternary (a group of four). If its primary storage unit has a number base of 3 (a radix of 3) its secondary storage unit could be referred to as a ternary quaternary.

"Ternary Quaternary"
Weight33323130
Value27931

(Example 3:3) A device may have nine primary storage units (nine digits) for its secondary storage unit. Its secondary storage unit may then be referred to as a nonary (a group of nine). If its primary storage unit has a number base of 2 (a radix of 2) its secondary storage unit could be referred to as a binary nonary.

"Binary Nonary"
Weight282726252423222120
Value2561286432168421

1.4. compound storage units

A compound storage unit consists of two or more secondary storage units.

(Example 1:3) If the secondary storage unit consists of four primary storage units (four digits), then the smallest compound storage unit is a D8, the next smallest is a D12, and so on in multiples of the secondary storage unit (D4).

(Example 2:3) If the secondary storage unit consists of eight primary storage units (eight digits), then the smallest compound storage unit is a D16, the next smallest is a D24, and so on in multiples of the secondary storage unit (D8).

(Example 3:3) If the secondary storage unit consists of nine primary storage units (nine digits), then the smallest compound storage unit is a D18, the next smallest is a D27, and so on in multiples of the secondary storage unit (D9).

1.4.1. Endianness

(Background) Numbers are commonly read and evaluated from left to right. The left-most digit in a multi-digit number holds the most significant weight of the number. The right-most digit holds the least significant weight of the number. The number is said to be big-endian because the most significant digit is also the one which is read first. This assumes that a number is always read from left to right.

Endianness specifies whether the least significant, or most significant secondary storage unit (SSU) comes first, when a compound storage unit is accessed at an address [N].

The upper-case letter 'B', short for "Big-endian compound storage unit", follows the storage unit symbol (1.2.). There is no space between the storage unit symbol and the upper-case letter 'B'. The upper-case letter 'B' denotes that the compound storage unit is stored starting with its most significant SSU at address [N]. This is commonly referred to as big-endian.

The upper-case letter 'L', short for "Little-endian compound storage unit", follows the storage unit symbol (1.2.). There is no space between the storage unit symbol and the upper-case letter 'L'. The upper-case letter 'L' denotes that the compound storage unit is stored starting with its least significant SSU at address [N]. This is commonly referred to as little-endian.

(Example 1:1) A device may have a secondary storage unit which consists of eight primary storage units (D8). The primary storage unit is a binary digit for this example. A D32 compound storage unit has the value 16r80402010 (2,151,686,160) and is stored at address [N], where N is in units of a secondary storage unit. The table below shows how the memory organization of this value differs.

Type?Address?
[N][N + 1][N + 2][N + 3]
D32B16r8016r4016r2016r10
D32L16r1016r2016r4016r80

(Note: The convention common to most, if not all microprocessors, is to store data in an increasing order starting at an address [N].)

2. Thoughts

  • Using the same naming scheme to describe primitive data types; IU16L would be an integer (I) which is unsigned (U), 16-digit, and little-endian (L). IS16B would be an integer (I) which is signed (S), 16-digit, and big-endian (B). In the case of a signed integer, there is no information about the interpretation of the digits. Is it two's complement? Is it one's complement? This detail would typically be documented for that particular field, or stated elsewhere in the documentation if it applied to all signed integers, instead of being assumed as it is today (2007). There is also no information about the primary storage unit (the number base (the radix)) encoded anywhere. This must also be stated in the documentation, even though most devices operate in binary only, so much so that the word digital implies an association with the word binary.
  • An additional character could be prefixed to the storage unit symbol. It would denote whether a primary storage unit is binary, ternary, denary, and so on. This would be useful in contexts where different primary storage units are described. For instance, a D8 could be written BD8 for a secondary storage unit of eight binary digits (BDs), TD8 for a secondary storage unit of eight ternary digits (TDs), DD8 for a secondary storage unit of eight denary digits (DDs), and so on.
  • An additional character N could be used to specify native endianness. For instance, D16N may refer to a D16B, or a D16L, depending on the overlying context. It would also make the reader aware of that the endianness may be one or the other depending on an overlying context.

3. About

3.1. Notational Conventions

  • The general Smalltalk syntax <radix>r<digits> was used to represent hexadecimal numbers. There are several reasons for this. 1) The C variant has its origins in the peculiarity of parsing, rather than trying to appeal to the human mind using a more sensible syntax. This was not acceptable to sustain in this context. 2) Suffixing the number with a radix in subscript notation is not very usable in plain text. 3) Prefixing a number with $ was as poor a choice as the first one.
  • [N] is the array index notation used to denote an address N.

3.2. Document

  • Version: 1.0 (2006-12-03)
  • License: Free to distribute verbatim. Non-commercial. Non-modifiable.
  • Copyright: © 2006 A. R. Svendsen. All rights reserved.
  • Author: A. R. Svendsen

Wednesday, August 8, 2007

"Simultaneous" events

Have you ever read something, while at the "same time" listening to something (radio, TV, video, ...) in the "background", ánd, have you then noticed that, as you read the word, you heard someone utter it at the "exact time", ánd, have you noticed this coincidence several times?

This is not a rare occurence from personal experience, but it is still somewhat amusing, though not as amusing as it was at first. Because this happens with such a high frequency, I can only assume that it must be a common phenomenon.

There may even be a word describing this phenomenon, but that term may not be well known. I don't know about it.

What came into the comparison algorithm of the mind first, I wonder. The sound or the word?

Tuesday, August 7, 2007

In retrospect

This quote has little to do with Supersymmetry, Extra Dimensions and the Origin of Mass.

[0h:41m:10s] "At the moment we have a system where we use Oracle at CERN, and MySQL at the other institutions. In retrospect, we might have been better off with MySQL everywhere, but there is a tendency for people who aren't experts to think; if you pay money it must be a better product."

Google Tech Talks: Marjorie D. Shapiro
Supersymmetry, Extra Dimensions and the Origin of Mass
June 18, 2007

Friday, August 3, 2007

The most important part of information

The most important part of information is (still) the date and time.

Thursday, August 2, 2007

Questionable programming (QP) in C++

In questionable progamming (QP), there are two kinds of possible return values for non-special member functions; these are bool and void, the two possible return values for non-special member functions.

bool function_name ()
{
bool result (false);
set_status (error_function_name_not_implemented);
return result;
}
Note the structure. This is how all bool-returning functions must be written. They must always assume that the result is false. They must always give feedback. This is a general case because the function is not implemented and has nothing to show for it. Here's a more concrete example on structuring, but not so much on verbosity, but some (reset_error ()).
bool demonstration ()
{
bool result (false);

set_status (error_a); // Assumes a will return false.

if (a ())
{
set_status (error_b); // Assumes b will return false.

if (b ())
{
set_status (error_c); // Assumes c will return false..

if (c ())
{
reset_error (); // (set_status (status_none);)
result = true; // ..unless c doesn't.
}
}
}

return result;
}
Now, if the purpose of a function is to break a process into smaller pieces, resulting in a program flow which is easier to see, then that's the sole purpose of a function in questionable programming. Combine this with a descriptive use of words for functions, to make their use more clear. Observe that the function of functions is to act as language statements. You can view "if ()" the same way; "if" is a function which takes a bool as its input parameter, but in addition it's a language construct and gets the special treatment that allows the program to be structured in the C and C++ syntax way of things.

(A function that returns bool must work in an if-statement, but one that returns void will break an if-statement. No other return values exist.)

Questionable programming (QP) makes the program flow part clear. It effectively separates most of the data manipulation from the control logic, and it may make programs more verbose (it depends on the author).

So how does one return anything? If you know C++ then you already know; references. To return something you would use a reference. Of course, this is what a member function does; it manipulates internal state. It could also manipulate local state if it was passed a reference as a parameter, but in general, it would modify data members when called, though it is nothing wrong with modifying a parameter.

In summary
class song_t // "a song type (_t)"
{
public:

enum status_t // a status type (_t)
{
status_none
};

public:

song_t ()
: m_status (status_none)
{
}

void set_status (status_t const status)
{
m_status = status;
}

private:

status_t m_status;
};
One last thing in general C++ programming; use const wherever possible. The more the compiler knows beforehand..
int main (int const argc, char const * const * const argv)
{
}

LRESULT CALLBACK WindowProcedure
(
HWND const window,
UINT const message,
WPARAM const W,
LPARAM const L
)
{
// (...)
}

(PS: The verbosity part could probably need more words.)