Checking return values

Yes, I had to post one more, hopefully shorter, rant on logger.c. This has to do with “checking return values” in OpenSource code.

#ifndef MSG_NOSIGNAL
# define MSG_NOSIGNAL 0
#endif

 if (sendmsg(ctl->fd, &message, MSG_NOSIGNAL) < 0) {
     logger_reopen(ctl);
     if (sendmsg(ctl->fd, &message, MSG_NOSIGNAL) < 0)
         warn(_("send message failed"));
 }

Code like this is rampant in OpenSource. It passes the quick and dirty teen age “code review” but it is not production quality. If Linux and the OpenSource community wants to gain respect it needs to start writing production quality code. This means real QA not automated tests run via Jenkins which test nothing.

The above code is NOT checking the return value in a production quality manner. It isn’t catching the error and reporting it via some human readable log entry. Ideally it should catch the error and report it along with the human readable text associated with the error.

Error: 12345 in blah - Severe indigestion

The hapless schmoe who has to debug this has to modify the code before they can even gain access to the error in the debugger. This code adheres to the letter of “checking the return value” without adhering to the spirit of it in a production quality manner. Snippets like this make it nigh on impossible to port OpenSource code to a regular production platform.

Scarier still are the publicly traded corporations running production systems on operating systems containing this type of code. Then they wonder how they got hacked without knowing it. Well, quite simply, because error return values weren’t checked in a production quality manner.

 

It’s okay to hate the 12 year old boys who write the bulk of OpenSource code

Even if their biological clock states they are north of 40, they never got past 12 when it comes to coding. In case you can’t see the featured image, the source file came from here. And since they will _hopefully_ sweep that up, here is the first snippet.

/* this creates a timestamp based on current time according to the
 * fine rules of RFC3164, most importantly it ensures in a portable
 * way that the month day is correctly written (with a SP instead
 * of a leading 0). The function uses a static buffer which is
 * overwritten on the next call (just like ctime() does).
 */
static char const *rfc3164_current_time(void)
{
	static char time[32];
	struct timeval tv;
	struct tm *tm;
	static char const * const monthnames[] = {
		"Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug",
		"Sep", "Oct", "Nov", "Dec"
	};

	logger_gettimeofday(&tv, NULL);
	tm = localtime(&tv.tv_sec);
	snprintf(time, sizeof(time),"%s %2d %2.2d:%2.2d:%2.2d",
		monthnames[tm->tm_mon], tm->tm_mday,
		tm->tm_hour, tm->tm_min, tm->tm_sec);
	return time;
}

The second problem is they are returning the address of a variable which is inside of the function. Those bounding {} of the function delineate the scope. The static qualifier means the function is only visible from within this source file. So, to try and defeat this, our little 12 year old stuck a static qualifier on the character array (string). ___Maybe___ in the latest and greatest C standard (which I haven’t read) the compiler is now required to BOTH save the value AND keep it externally accessible while the rest of the routine is being garbage collected away.

You see, the definition which was/is in most of the compilers out there, especially those which stopped updating at C99 or a bit sooner, that’s not a requirement. The “requirement” was that the variable only be initialized once and that it retain its value across multiple function calls. There was/is nothing about it still being externally accessible.

When you are writing OpenSource, you aren’t writing for the latest and greatest, you are writing for whatever happens to be out there.

While we are at it, choose “time” for a variable name, especially for a string, is Uber stupid. I couldn’t believe it compiled. My gut tells me that if this code really is working on PC platforms it is working due to a bad architecture and compiler with shit garbage collection.

Don’t worry, I’m not going to cover it all, this would beĀ  a 2+ million word blog post if I did that.

#define NILVALUE "-"
static void syslog_rfc3164_header(struct logger_ctl *const ctl)
{
	char pid[30], *hostname;

	*pid = '\0';
	if (ctl->pid)
		snprintf(pid, sizeof(pid), "[%d]", ctl->pid);

	if ((hostname = logger_xgethostname())) {
		char *dot = strchr(hostname, '.');
		if (dot)
			*dot = '\0';
	} else
		hostname = xstrdup(NILVALUE);

	xasprintf(&ctl->hdr, "<%d>%.15s %s %.200s%s: ",
		 ctl->pri, rfc3164_current_time(), hostname, ctl->tag, pid);

	free(hostname);
}

You see, I started cross compiling this because adding all of the advertised features to the existing logger for OpenVMS was going to take several days to a week. Cross compiling this _should_ have been quicker. Yes, the current C compiler

$ cc/ver
HP C V7.1-015 on OpenVMS Alpha V8.3

Stopped around the time of the C99 standard. I won’t go so far as to say C99 is “all” there. I’ve banged into some partially implemented stuff with the networks, but haven’t dug in to see if they are part of the standard or not.

So the above snippet nests a call to our original snippet inside of a print routine. Once this got ported to OpenVMS I could step through the first snippet and see a perfectly formatted date string. When the sprintf() got done I had a big block-o-nothin where the date should be.

Old Timers, we know these things. Global data isn’t a bad thing, it exists because we need it. All of these little tricks trying to bob and weave around the Grim Reaper of garbage collection are just time bombs waiting to go off. So, a quick cut and paste moving that static char definition of time to global scope brought about the expected compilation error.

static char time[32]; /* place to store a time return value */
............^
%CC-W-MIXLINKAGE, In this declaration, "time" is declared with both internal and external linkage. The previous declaration is at l
ine number 378 in file SYS$COMMON:[SYSLIB]DECC$RTLDEF.TLB;1 (text module TIME).
at line number 112 in file RADIAN_ROOT:[2018-VMS-SYSLOG]logger.c;148

So, changing time[32] to time_str[32] and changing the few references to it brought about the desired outcome. What “could” have been the problem?

  • Grim Reaper garbage collection is faster on old DS-10 than modern Intel PC
  • DECC puts the value in a safe place when function goes out of scope and reassigns value when function is re-entered. The original storage location, however, is garbage collected.
  • Choosing “time” as a variable name, especially for a string, was an incredibly stupid thing to do and it collided with the OS level definition in a way which whacked the string value once outside of the function.
  • Compiler developers, having been told for years by unemployable academics without a single line of QA’ed code in production at any for-profit company that global data is bad, came up with a tweaking of the “static” definition to avoid making global what obviously needed to be global. This was done without concern for existing compilers or code in production.

You can probably come up with more.

The point is, the 12 year old boys always want to play with the new toys while the production coders will keep it simple. Production coders who go to independent QA departments know better than to push the envelop. Code they write today could be ported to a Z-80 chip with a 1980s compiler. You think I’m kidding? There are production systems today still running OS/2 Warp.

OS/2 Warp job posting

Oh, and for the record. I knew it would take 3 days to a week to add the bulk of the features to the existing ported code. This __should__ have taken one day to port. I’m now on day 3 tracking down stuff like this. Too close to finished to quit.

Thanks!