Sambal / STL lower_bound: set vs. list

Commentor Paul asks why we would use a list instead of a set, since a set has an efficient lower_bound() method and has cheaper insertion/deletion than a sorted vector; set insertion also preserves iterators.

First, recall the the lower_bound() on a list problem is purely a hypothetical, and I’m not using this data structure anywhere. Â The point of the series is that the STL lower_bound() algorithm is suboptimal when used with bidirectional and forward iterators, not that you should use lists to store sorted data.

But to answer Paul’s question: they’re different data structures. Â I would certainly choose a set if it was the most appropriate data structure for the job: if there were many random access insertions and deletions, only one instance of each key, and the data not dense enough to justify an array or a bit vector. Â But a set is not a drop-in replacement for a list.

STL’s set is implemented as a red-black tree. Â So the cost of insertion is O( lg N ) to find the insertion point, and potentially another O( lg N ) to rebalance the tree. Â With a list, in contrast, it costs O( N ) to find the insertion point and O( 1 ) for each element inserted.

Now you’d have to make a heck of a lot of insertions at one point to make that pay off in the list’s favor. Â In general. Â

But the situation changes if the data structure is a concurrent shared resource. Â With a set, you have to lock the whole data structure for an insertion, because the tree rebalancing operation can touch quite a few nodes and the tree is in an inconsistent state until the rebalance completes.

But with a list, you can use hand-over-hand locking during the search and also during the insertion, locking only the two nodes where the insertion occurs. Â So a list offers better behavior with multiple concurrent accesses.

Finally, even if you use a set, you can’t take advantage of its efficient lower_bound method while using generic programming. Â You’d have to know the data structure was a set. Â In code:


set s<int>;
list l<int>;
// populate data structures
s.lower_bound( 5 );  // this is fast
lower_bound( l.begin(), l.end(), 5 ); // this is slow

// this is slow too because it can't use s.lower_bound()
lower_bound( s.begin(), s.end(), 5 );

Obviously if you’re in the scope that s is declared, you know it’s a set. Â But if you have a generic routine that takes an iterator pair and then uses lower_bound(), you don’t get any benefit from using a set.

In a dynamic language such as Ruby, it doesn’t matter how the lower_bound method gets called: it ultimately would dispatch to the class-specific one if it existed. Â But in C++ / STL generic programming, where you would use the lower_bound() template function defined in <algorithm>, all STL knows is what type of iterator it is, by looking at the iterator tag. Â Both list and set have bidirectional iterators.

I’m not even sure it would work to try to provide specializations for lower_bound specific to sets —

template< class T >
set<T>::iterator lower_bound(
  set<T>::iterator first, set<T>::iterator last, const T& v )
{// and so on, for set<T>::const_iterator
// and don't forget multiset and reverse iterators too!

Because to my relatively untutored C++ eyes, that looks like a partial template specialization of a function, and those aren’t allowed. Â And even if they were, how do you get from the iterator to the parent object in order to call its lower_bound() method?

Sambal

STL lower_bound: set vs. list

5 Comments