Skip to content

Arrogance and humility

05-May-09

Arrogance and humility are two qualities that are essential to programmers (perhaps, to everyone?). It’s hard to maintain a balance between these two opposite behaviors, so practice is essential.

Humility is important to prevent wasting your own and other people’s time. The simple manifestation of this is in troubleshooting. If something breaks in a system that you just perturbed, it’s your fault. You did it. Maybe you’ll find that the author of Foo clobbered the libraries used by Bar, but you broke it by installing Foo. Maybe Foo was installed by some installation script.  You broke it by allowing the script to be run. Maybe an intern ran the script.  You broke it by letting the intern have root.

This isn’t about blame: it’s about failure analysis. Because so much of computer systems is design (thoughts expressed as code, as config files, as permissions etc. are design in this sense), the most likely source of failure is in our actions, or worse, our assumptions.

This stands in contrast to traditional engineering fields where hardware failure is a more normal source of problems.  Software doesn’t wear out; it doesn’t need to be painted or have parts replaced.  Instead, it rots when unused.

But humility, as I am defining it, is not an ultimate end.  And achieving perfection — or even any worthwhile result — through humility is impossible.  In addition to humility you need the ability to try anyway, believing you can pull it off, even though failure is guaranteed on the first attempt, the second, the twentieth.   And eventualy success when writing new software usually comes through the redefinition of success to mean whatever it was you finally made.

In short you need arrogance.

More about Python

20-Feb-09

I just realized that there is an (obvious) biblical objection to the language Python.  Adam and Eve used C, naturally.  The Python came into the Garden of Eden as a tempter, bringing dynamic language features.  Adam and Eve succumbed to temptation and were thrown out of the garden, and now they… uh… analogy breaks down.

Now they have to code in Objective-C.

That’ll teach ’em.

New MFC App with ATL, VC6.0

11-Jan-09

OK, so we’re still using VC6.0.   Gluttons for punishment, I guess.

I recently created a new MFC app and had some… opportunities getting it to compile, so I thought I’d document them here for the three other people in the world not on VC9.

First: when you start a new MFC app and switch the project-level #define of _MBCS to UNICODE,_UNICODE, then compilation suddenly breaks with the error “msvcrtd.lib(crtexew.obj) : error LNK2001: unresolved external symbol _WinMain@16″.  Solution: add /entry:”wWinMainCRTStartup” to the linker command line in the text area under the Link tab.  In newer versions of VC, there is an “Advanced” linker tab that allows you to do this through the dialog interface.  Not present in VC6.

Second: when you add ATL to your MFC project (Insert | New ATL Object… ; click Yes on the dialog, then click cancel.  Now you’re cooking with ATL even though you have no ATL objects.)  Suddenly your debug builds don’t compile anymore.  Problem: the project-level #define of _CRTDBG_MAP_ALLOC causes errors like this:

Compiling…

StdAfx.cpp

c:\program files\microsoft visual studio\vc98\include\malloc.h(105) : error C2059: syntax error : 'constant'

c:\program files\microsoft visual studio\vc98\include\malloc.h(105) : error C2733: second C linkage of overloaded function '_calloc_dbg' not allowed

c:\program files\microsoft visual studio\vc98\include\malloc.h(105) : see declaration of '_calloc_dbg'
Solution: remove the _CRTDBG_MAP_ALLOC from your Project Settings and add it to your stdafx.h after your ATL headers.

STL lower_bound: set vs. list

01-Oct-08

Commentor Paul asks why we would use a list instead of a set, since a set has an efficient lower_bound() method and has cheaper insertion/deletion than a sorted vector; set insertion also preserves iterators.

First, recall the the lower_bound() on a list problem is purely a hypothetical, and I’m not using this data structure anywhere.  The point of the series is that the STL lower_bound() algorithm is suboptimal when used with bidirectional and forward iterators, not that you should use lists to store sorted data.

But to answer Paul’s question: they’re different data structures.  I would certainly choose a set if it was the most appropriate data structure for the job: if there were many random access insertions and deletions, only one instance of each key, and the data not dense enough to justify an array or a bit vector.  But a set is not a drop-in replacement for a list.

STL’s set is implemented as a red-black tree.  So the cost of insertion is O( lg N ) to find the insertion point, and potentially another O( lg N ) to rebalance the tree.  With a list, in contrast, it costs O( N ) to find the insertion point and O( 1 ) for each element inserted.

Now you’d have to make a heck of a lot of insertions at one point to make that pay off in the list’s favor.  In general.  

But the situation changes if the data structure is a concurrent shared resource.  With a set, you have to lock the whole data structure for an insertion, because the tree rebalancing operation can touch quite a few nodes and the tree is in an inconsistent state until the rebalance completes.

But with a list, you can use hand-over-hand locking during the search and also during the insertion, locking only the two nodes where the insertion occurs.  So a list offers better behavior with multiple concurrent accesses.

Finally, even if you use a set, you can’t take advantage of its efficient lower_bound method while using generic programming.  You’d have to know the data structure was a set.  In code:


set s<int>;
list l<int>;
// populate data structures
s.lower_bound( 5 );  // this is fast
lower_bound( l.begin(), l.end(), 5 ); // this is slow

// this is slow too because it can't use s.lower_bound()
lower_bound( s.begin(), s.end(), 5 );

Obviously if you’re in the scope that s is declared, you know it’s a set.  But if you have a generic routine that takes an iterator pair and then uses lower_bound(), you don’t get any benefit from using a set.

In a dynamic language such as Ruby, it doesn’t matter how the lower_bound method gets called: it ultimately would dispatch to the class-specific one if it existed.  But in C++ / STL generic programming, where you would use the lower_bound() template function defined in <algorithm>, all STL knows is what type of iterator it is, by looking at the iterator tag.  Both list and set have bidirectional iterators.

I’m not even sure it would work to try to provide specializations for lower_bound specific to sets —

template< class T >
set<T>::iterator lower_bound(
  set<T>::iterator first, set<T>::iterator last, const T& v )
{// and so on, for set<T>::const_iterator
// and don't forget multiset and reverse iterators too!

Because to my relatively untutored C++ eyes, that looks like a partial template specialization of a function, and those aren’t allowed.  And even if they were, how do you get from the iterator to the parent object in order to call its lower_bound() method?

Snapback2: –link-dest arg does not exist: ../hourly.1

19-Sep-08

I added a new host to snapback a while ago and started getting these horrible messages mailed to me by cron every six hours.  The entire body of the email was:

linkdest arg does not exist: ../hourly.1

I finally fixed it today.

The problem?  Snapback was backing up an empty directory.  One solution would be to remove the directory from snapback.  I chose to instead create an empty file ($ touch empty) in the directory because I don’t want to have to come back and re-set up snapback for that directory if it ever gets populated.

I’ll see if its fixed in four more hours.

You’ve been coding for the Intel platform too long when…

03-Sep-08

You know you’ve been coding for the Intel platform too long when “eax” looks more like a word than “wax”.

I typed “wax”; it looked wrong, so I semi-consciously backspaced out and retyped it as “eax”. Which looked more correct. Then my conscious mind realized I really did want “wax” so I changed it back.

Problems with Redundant dhcpd/dns (part 2)

25-Aug-08

As I mentioned last time, I ran into some challengesincredibly annoying problems setting up a dhcpd failover server.

1. Dhcpd doesn’t record full data for foreign leases.

At first I thought it would be enough to run dhcpd with a failover peer and use djb_update.pl to create the host list for tinydns.  But then I discovered that when the dhcpd master grants a lease, it communicates to the dhcpd slave.  But this communication apparently consists only of the slave’s IP address and ethernet address, and not its hostname.  So it’s not possible to rely on the dhcpd failover protocol to keep the name-IP mappings synchronized on the dhcpd servers.

2. Dhcpd servers race to respond.

Like little virtual fire engines and paramedics, the dhcpd servers race to answer a poor IPless host’s cry for help.  It took me a while to see this one because taurus and draco were initially on the same segment.  Then we did a server move and got hung up halfway, with taurus in its new location but draco still in the old one.  Then all of a suddent a bunch of hosts disappeared from dns.

The problem appears to be that the nearest dhcpd server usually wins in assigning IP addresses, and the way dhcpd failover is implemented is a lot more like load-balancing than I would have thought.  In my opinion, the failover peer should be quiet unless its peer is down (which it is aware of as part of the failover protocol).  In any case, the failover server is live and ready to fill requests, and if it happens to be the nearest server to a given host, it gives out the IP address.

So it’s not enough to be able to get a host list from the failover peer if the main server is down.  At any time, some IPs may be assigned from the master host and some from the failover host.

3. DHcpd doesn’t put hardcoded hosts into its leases file.

This is arguably not a design defect in dhcpd — rather, I can’t imagine that this was an accident — but it sure is annoying.  When you assign a permanent IP to a host using this syntax:

host foobar { hardware ethernet de:ad:be:ef:ca:fe; fixed-address 10.1.1.15; }

then the host suddenly disappears from DNS.

So I managed to briefly lose taurus, my secondary dhcpd server.  Because taurus would be the backup source of internal DNS data, I decided to fix its IP address.  (dnscache can’t accept hostnames in its servers file, for the obvious reason.)  But I neglected to add an entry for taurus in the static host file, and when the old lease expired, dhcpd did not record a new one.  Even though taurus was still getting all its host configuration information from dhcp, including IP address, routing, name servers, etc., dhcpd did not consider it to have a lease since its address could not expire.  Since taurus did not have a valid lease in dhcpd.leases, the djb_update script didn’t put an entry for taurus into the dhcp file.

The most annoying thing about this one is that in order to fix the IP address of a host it’s not enough to change the dhcpd.conf; you also better remember to change the static host table.  And the penalty for error is a delayed failure, because the host won’t disappear from DNS until the old lease expires.

4. djb_update lets the last live lease it reads win.

This is a reasonable design decision since djb_update expects to run on one servers dhcpd.leases file, not on the concatenation of two servers’ files.  But it means that if a host gets a lease from the failover server and then the master recovers, if we manually renew the lease and now get a response from the master, there appear to be two live leases for the host.  (I’d call that a bug in the dhcp protocol, that the host can abandon the lease without notifying the failover server.)

Sinec djb_update doesn’t expect that to happen, it just runs through the leases file making a name -> IP hash using all the valid leases.  If it sees two valid leases for the same name, it uses whichever lease it sees last.  That means that order matters when concatenating the lease files.

A more robust solution would be to honor the most recently written active lease; to add another condition when updating the lease data that the start time of the lease under consideration must be greater than the start time of the saved lease.

This is one of those problems where it takes more time to complain about it than to fix it.  I plan to hack this into my already-hacked-up djb_update.pl script.

Next time: the file layout, Makefiles, scripts, duct tape and baling wire necssary to make this thing work.

Redundant DHCPd and Tinydns Setup (part 1)

22-Aug-08

Here are the before and after pictures. We used to have two independent single points of failure.

Old DHCPd Setup

If boris was down, nobody could get an IP address, and internal name resolution didn’t work because the cache couldn’t see the tinydns server (on boris) that is authoritative for the internal network.

If draco the firewall was down, not only could nobody connect to the internet (after all, that is the point of having a firewall), and external name resolution didn’t work. But internal name resolution didn’t work either, because the single caching name server (dnscache) lived on draco.

That’s a result of the “djb way,” where name servers are separate from caches. Tinydns won’t answer queries for a domain which it’s not authoritative over, and dnscache won’t ever give an authoritative answer, but only process requests from clients, get answers from authoritative servers, and cache the results.

New dhcpd setup

In the new setup, there are no single points of failure for dhcp or name service. (There’s still a single firewall host.) We moved all functions off of boris, which is slated for decommissioning.

Draco keeps a dnscache, since it has easy access to outside name servers. It picks up a dhcpd server and associated tinydns. I made it the dhcpd master, because it’s the least-frequently-rebooted server.

Since dnscache and tinydns use the same port (53/udp, the domain service), I had to configure an ethernet alias on draco to serve the tinydns off of. (I couldn’t use localhost, because it needs to be visible to the other dnscache server if the other tinydns server is down.)

Taurus, a new machine still figuring out its identity (external web/intranet/wiki/login/database/…?), gets the dhcpd slave server and an associated tinydns. At first I thought it made sense to separate the dhcpd from the tinydns server, but eventually I recognized that if one dhcpd server was down, it would not do any good to be trying to fetch its dhcpd.leases file to build a dns table. The functions are so closely associated — dhcp gives named hosts an IP address while tinydns reports what the IP address is.

Scorpio, the backup server (BackupPC/snapback), picks up an additional function as an internal name cache. That was a last-minute decision when I realized I needed a second name cache, and I’m not entirely happy about it. Since scorpio carries backup data, I’d like to be able to unplug it and take it offsite (e.g., if I’m reconstructing a crashed machine from its backups) without disrupting the network. So that’s going to have to change. I suppose there’s no deep reason why I can’t put the dnscache on taurus, but I’d have to configure another alias…

That’s the overall layout for redundant dhcpd and internal name service using tinydns and dnscache. Next post: “challenges” (aka annoying problems) I encountered making this work.

djbdns and DHCP server

19-Jul-08

There is a script for extracting DNS data from dhcpd.leases that I’ve been using for a few years.  It generates a tinydns data file which can be merged with the static file.

But I am not content with only one DHCP server.  I have implemented a backup/failover server.  Now I have twice the problems!

The ISC DHCP server only records the hostname in the dhcpd.leases file if it granted the lease. Although most leases will be written by the primary server, it would be a mistake to assume that the failover peer will never write a lease.

So the solution appears to be to run a dhcp-serving copy of axfrdns on each machine, and then pull that data together, collate it, and serve a single unified copy.
A picture might help, but I’m on a blackberry right now and can’t draw one.

I hope to implement this next week and report how it came out.

DHCP and djbdns

17-Jul-08

DHCP assigns IPs to hosts; DNS reports IPs by hostname. The popular dhcp server is even written by the folks who brought you BIND.

But in our shop, we run djbdns.

I am not a full convert to the djb way. (We run postfix, for example.). I have never met the man in person. I would probably find him infuriating. But I find his claims about BIND and DNS convincing. And I’ve been using his software for the last eight years without a major glitch.

I selected djbdns when I was setting up DNS for my first real website. I had bought the DNS and BIND O’Reilly book and was puzzling through the man pages and just feeling stupider and stupider the more time I spent on it. I figured there must be a better way, so I looked around and found djbdns. The setup instructions were clear and simple. I followed them and everything worked. I felt smart and competent.

Users like software that makes them feel good. There’s a lesson for us all there, I suspect.

Anyway, one of the quirks of djbdns is it doesn’t honor UPDATE requests. Unsecure, you see. Which makes it hard to do DHCP….