Skip to content

STL lower_bound: set vs. list

Commentor Paul asks why we would use a list instead of a set, since a set has an efficient lower_bound() method and has cheaper insertion/deletion than a sorted vector; set insertion also preserves iterators.

First, recall the the lower_bound() on a list problem is purely a hypothetical, and I’m not using this data structure anywhere.  The point of the series is that the STL lower_bound() algorithm is suboptimal when used with bidirectional and forward iterators, not that you should use lists to store sorted data.

But to answer Paul’s question: they’re different data structures.  I would certainly choose a set if it was the most appropriate data structure for the job: if there were many random access insertions and deletions, only one instance of each key, and the data not dense enough to justify an array or a bit vector.  But a set is not a drop-in replacement for a list.

STL’s set is implemented as a red-black tree.  So the cost of insertion is O( lg N ) to find the insertion point, and potentially another O( lg N ) to rebalance the tree.  With a list, in contrast, it costs O( N ) to find the insertion point and O( 1 ) for each element inserted.

Now you’d have to make a heck of a lot of insertions at one point to make that pay off in the list’s favor.  In general.  

But the situation changes if the data structure is a concurrent shared resource.  With a set, you have to lock the whole data structure for an insertion, because the tree rebalancing operation can touch quite a few nodes and the tree is in an inconsistent state until the rebalance completes.

But with a list, you can use hand-over-hand locking during the search and also during the insertion, locking only the two nodes where the insertion occurs.  So a list offers better behavior with multiple concurrent accesses.

Finally, even if you use a set, you can’t take advantage of its efficient lower_bound method while using generic programming.  You’d have to know the data structure was a set.  In code:

set s<int>;
list l<int>;
// populate data structures
s.lower_bound( 5 );  // this is fast
lower_bound( l.begin(), l.end(), 5 ); // this is slow

// this is slow too because it can't use s.lower_bound()
lower_bound( s.begin(), s.end(), 5 );

Obviously if you’re in the scope that s is declared, you know it’s a set.  But if you have a generic routine that takes an iterator pair and then uses lower_bound(), you don’t get any benefit from using a set.

In a dynamic language such as Ruby, it doesn’t matter how the lower_bound method gets called: it ultimately would dispatch to the class-specific one if it existed.  But in C++ / STL generic programming, where you would use the lower_bound() template function defined in <algorithm>, all STL knows is what type of iterator it is, by looking at the iterator tag.  Both list and set have bidirectional iterators.

I’m not even sure it would work to try to provide specializations for lower_bound specific to sets —

template< class T >
set<T>::iterator lower_bound(
  set<T>::iterator first, set<T>::iterator last, const T& v )
{// and so on, for set<T>::const_iterator
// and don't forget multiset and reverse iterators too!

Because to my relatively untutored C++ eyes, that looks like a partial template specialization of a function, and those aren’t allowed.  And even if they were, how do you get from the iterator to the parent object in order to call its lower_bound() method?