More recent work is investigating how PIM-like ideas may port into quantum cellular array (QCA) and other nanotechnology logic, where instead of "Processing-In-Memory" we have opportunities for "Processing-In-Wire" and similar paradigm shifts. A key part of this is focusing on the interchange between the underlying device physics, the design rules and metaphors best suited to using such devices, and how to recast "conventional" computing structures into such design methaphors in ways that optimize the overall system. One example of the former is developoment of an FPGA-like basic cell that is well-suited to QCA. An example of the latter is development of a very dense memory model optimized for future QCA devices at the molecular level where there is a recursive rather than array framework to the memory, and where there is no longer a separate "CPU," just traveling "memory requests" that have grown into complete light weight threads.
While at IBM one of his groups designed the first multi-processor PIM device with significant DRAM memory. This EXECUBE chip integrated 4 Mbits of DRAM with over 100K gates of logic to summport on a single chip a complete 8 way binary hypercube parallel processor which could run in both SIMD and MIMD modes. He also designed and built the RTAIS parallel processor which demonstrated a pure SIMD PIM-like architecture optimized for supporting a LINDA-like parallel processing model, with real time scheduling included. Prior parallel machines included the IBM 3838 Array Processor which for a time was the fastest single precision floating point processor marketed by IBM, and the Space Shuttle Input/Output Processor which has flown on every Shuttle mission, and probably represents the first true parallel processor to fly in space. The IOPalso represents one of the earliest examples of multi-threaded architectures. His Ph.D. thesis on the parallel solution of recurrence equations was one of the early works on what is now called parallel prefix operations, and applications of those results are still acknowledged as defining the fastest possible implementations of circuits such as adders with limited fan-in blocks (known as the Kogge-Stone adder).