TIP 217: Getting Sorted Indices out of Lsort

Login
Author:         James P. Salsman <[email protected]>
State:          Final
Type:           Project
Vote:           Done
Created:        26-Aug-2004
Post-History:   
Keywords:       Tcl,lsort,parallel lists
Tcl-Version:    8.5
Tcl-Ticket:     1017532

Abstract

An -indices option is proposed for the lsort command, returning the indices of the given list's elements in the order that they would have otherwise been sorted.

Rationale

When corresponding parallel lists must be simultaneously sorted or accessed in the order given by sorting them all according to one used as a list of keys, it is necessary to obtain the indices of the key list's elements in the order that they would be sorted, without actually sorting the list. For example, a list of first names and a corresponding list of last names can be displayed in side-by-side Tk listboxes, in which case we may want to sort both lists by either one used as the sorting key, or we may want to simultaneously iterate over both in either order. To do so, merely sorting a list is unhelpful; we need to obtain the indices of the key list in the order that its corresponding elements would be sorted.

Tk listboxes, database I/O, and statistics applications often involve heavy use of parallel lists. For this and other reasons, many programming languages starting at least as early as APL, up to present-day, numerics-oriented languages such as MATLAB, have included the ability to directly obtain the indices required to access a list (or "vector") in sorted order. As shown below, the fastest known pure Tcl solution to this problem takes about five times as long as the given reference implementation, which adds virtually no overhead when it is not invoked.

Specification

The lsort command may accept a new option, -indices. When lsort is invoked with this option, it will return a list of integer indices of the elements of the list given as the final argument to lsort, in the order that the elements would have been sorted had the -indices option not been specified.

This means an alternative (though less efficient for single lists) mechanism for producing a sorted list could be:

  set resultList [list]
  foreach idx [lsort -indices $sourceList] {
    lappend resultList [lindex $sourceList $idx]
  }

Reference Implementation

The reference implementation is available from SourceForge http://sourceforge.net/tracker/index.php?func=detail&aid=1017532&group_id=10894&atid=310894
It may need to be applied with patch -l or patch --ignore-whitespace or it may not apply entirely.

That reference implementation is a 109-line context diff, involving adding 20 lines of code to tclCmdIL.c, a single auto int of new variable memory overhead, and no more than three additional integer comparisons and one integer assignment per use of lsort if the new option is not invoked.

Compared to the following pure Tcl implementation, the reference implementation is 2.4 to 6.7 times faster. This very efficient Tcl implementation was provided by Lars Hellström:

  proc lsort-indices {itemL} {
    set pairL [list]
    foreach item $itemL {
      lappend pairL [list $item [llength $pairL]]
    }
    set indexL [list]
    foreach pair [lsort -index 0 -real $pairL] {
      lappend indexL [lindex $pair 1]
    }
    set indexL
  } 

The following timing data are the mean time returned from 20 different lists of random reals, with 10 iterations each:

  List size  Ref. Imp.  Pure Tcl  Speedup
  ---------  ---------  --------  -------   
      100       13.1      47.9      3.7
      200       33.9     224.1      6.6
      300       45.0     303.1      6.7
      400       62.0     360.6      5.8
      800      142.3     655.0      4.6
     1600      486.2    1150.0      2.4
     5000     1582.5    4847.6      3.1 

At present, the Reference Implementation does not file the -indices switch alphabetically in the C list of lsort switches, or the C switch statement that interprets them. This simple needs to be corrected before final check-in.

Suggested Documentation

In the lsort man page, under DESCRIPTION, change the first sentence:

"This command sorts the elements of list, returning a new list in sorted order."

... to read:

"This command sorts the elements of list, and returns a new list in sorted order, unless the -indices option is specified, in which case a list of integers is returned, corresponding to the indices of the given list's elements in the order that they otherwise would have been sorted."

Under EXAMPLES, at the end of the section, include the following lines:

 Obtaining ordered indices:

  % lsort -indices [list a c b]
  0 2 1
  % lsort -indices -unique -decreasing -real -index 0 \
          {{1.2 a} {34.5 b} {34.5 c} {5.6 d}}
  2 3 0

Tcl-core Discussion

Here are some highlights from the discussion of this TIP on the Tcl-core mailing list. No assurance is given that the discussion is either completely or impartially represented here.

Lars Hellström http://sourceforge.net/mailarchive/message.php?msg_id=9346824
described a pure Tcl solution virtually identical to the one shown above, "which could be complicated enough to warrent a special [lsort] option." He also suggested a -keycommand option for sorting on keys generated on-the-fly. Finally, he pointed out a flaw concerning the example in the Rationale from the original version of this TIP, which has since been corrected.

In reply to Lars, James Salsman http://sourceforge.net/mailarchive/message.php?msg_id=9348381 provided timing data and an efficient alternative to the -keycommand idea using this TIP's -indices proposal.

Acknowledgements

Thanks also to George Peter Staplin and Richard Suchenwirth for their kind help and good ideas at the genesis of this TIP.

Copyright

This document has been placed in the public domain.