diff --git a/Post-Scarcity-Hardware.md b/Post-Scarcity-Hardware.md index 027fecd..617f96f 100644 --- a/Post-Scarcity-Hardware.md +++ b/Post-Scarcity-Hardware.md @@ -10,7 +10,7 @@ And I've been thinking, particularly, about one issue: process spawning on a new # A map of the problem -What got me thinking about this was watching the behaviour of the Clojure map function on my eight core desktop machine. +What got me thinking about this was watching the behaviour of the [Clojure](http://clojure.org/) map function on my eight core desktop machine. Mapping, in a language with immutable data, in inherently parallelisable. There is no possibility of side effects, so there is no particular reason for the computations to be run serially on the same processor. MicroWorld, being a cellular automaton, inherently involves repeatedly mapping a function across a two dimensional array. I was naively pleased that this could take advantage of my modern hardware - I thought - in a way in which similarly simple programs written in Java couldn't... @@ -22,8 +22,9 @@ It turns out that Clojure's default *map* function simply serialises iterations Except... -Performance doesn't actually improve very much. Consider this function, which is the core function of the MicroWorld engine: +Performance doesn't actually improve very much. Consider this function, which is the core function of the [MicroWorld](http://blog.journeyman.cc/2014/08/modelling-settlement-with-cellular.html) engine: +
     (defn map-world
       "Apply this `function` to each cell in this `world` to produce a new world.
        the arguments to the function will be the world, the cell, and any
@@ -41,6 +42,7 @@ Performance doesn't actually improve very much. Consider this function, which is
                                      (cons world (cons % additional-args)))
                              row)))
                   world))))
+
As you see, this maps across a two dimensional array, mapping over each of the rows of the array, and, within each row, mapping over each cell in the row. As you can see, in this current version, I parallel map over the rows but serial map over the cells within a row. @@ -80,7 +82,7 @@ Maxes out one single core, takes about 3.6 times as long as the hybrid version. Now, I need to say a little more about this. It's obvious that there's a considerable set-up/tear-down cost for threads. The reason I'm using *pmap* for the outer mapping but serial *map* for the inner mapping rather than the other way round is to do more work in each thread. -However, I'm still simple-mindedly parallelising the whole of one map operation and serialising the whole of the other. This particular array is 2048 cells square - so over four million cells in total. But, by parallelising the outer map operation, I'm actually asking the operating system for 2048 threads - far more than there are cores. I have tried to write a version of map using Runtime.getRuntime().availableProcessors() to find the number of processors I have available, and then partitioned the outer array into that number of partitions and ran the parallel map function over that partitioning: +However, I'm still simple-mindedly parallelising the whole of one map operation and serialising the whole of the other. This particular array is 2048 cells square - so over four million cells in total. But, by parallelising the outer map operation, I'm actually asking the operating system for 2048 threads - far more than there are cores. I have tried to write a version of map using [Runtime.getRuntime().availableProcessors()](http://stackoverflow.com/questions/1980832/java-how-to-scale-threads-according-to-cpu-cores) to find the number of processors I have available, and then partitioned the outer array into that number of partitions and ran the parallel map function over that partitioning: (defn adaptive-map "An implementation of `map` which takes note of the number of available cores." @@ -89,11 +91,11 @@ However, I'm still simple-mindedly parallelising the whole of one map operation parts (partition-all (/ (count list) cores) list)] (apply concat (pmap #(map fn %) parts)))) -Sadly, as A A Milne wrote, 'It's a good sort of brake But it hasn't worked yet.' +Sadly, as [A A Milne wrote](http://licoricelaces.livejournal.com/234435.html), 'It's a good sort of brake But it hasn't worked yet.' But that's not what I came to talk about. I came to talk about the draft... -We are reaching the physical limits of the speed of switching a single processor. That's why our processors now have multiple cores. And they're soon going to have many more cores. Both Oracle (SPARC) and ARM are demoing chips with 32 cores, each 64 bits wide, on a single die. Intel and MIPS are talking about 48 core, 64 bit wide, chips. A company called Adapteva is shipping a 64 core by 64 bit chip, although I don't know what instruction set family it belongs to. Very soon we will have more; and, even if we don't have more cores on a physical die, we will have motherboards with multiple dies, scaling up the number of processors even further. +We are reaching the physical limits of the speed of switching a single processor. That's why our processors now have multiple cores. And they're soon going to have many more cores. Both Oracle ([SPARC](http://www.theregister.co.uk/2014/08/18/oracle_reveals_32core_10_beeellion_transistor_sparc_m7/)) and [ARM](http://www.enterprisetech.com/2014/05/08/arm-server-chips-scale-32-cores-beyond/) are demoing chips with 32 cores, each 64 bits wide, on a single die. [Intel and MIPS are talking about 48 core, 64 bit wide, chips](http://www.cpushack.com/2012/11/18/48-cores-and-beyond-why-more-cores/). A company called [Adapteva is shipping a 64 core by 64 bit chip](http://www.adapteva.com/products/silicon-devices/e64g401/), although I don't know what instruction set family it belongs to. Very soon we will have more; and, even if we don't have more cores on a physical die, we will have motherboards with multiple dies, scaling up the number of processors even further. # The Challenge @@ -101,15 +103,15 @@ The challenge for software designers - and, specifically, for runtime designers ## Looking for the future in the past, part one -Thinking about this, I have been thinking about the Connection Machine. I've never really used a Connection Machine, but there was once one in a lab which also contained a Xerox Dandelion I was working on, so I know a little bit about them. A Connection Machine was a massively parallel computer having a very large number - up to 65,536 - of very simple processors (each processor had a register width of one bit). Each processor node had a single LED lamp; when in use, actively computing something, this lamp would be illuminated. So you could see visually how efficient your program was at exploiting the computing resource available. +Thinking about this, I have been thinking about the [Connection Machine](http://en.wikipedia.org/wiki/Connection_Machine). I've never really used a Connection Machine, but there was once one in a lab which also contained a Xerox Dandelion I was working on, so I know a little bit about them. A Connection Machine was a massively parallel computer having a very large number - up to 65,536 - of very simple processors (each processor had a register width of one bit). Each processor node had a single LED lamp; when in use, actively computing something, this lamp would be illuminated. So you could see visually how efficient your program was at exploiting the computing resource available. -[Incidentally while reading up on the Connection Machine I came across this delightful essay on Richard Feynman's involvement in the project - it's of no relevance to my argument here, but nevertheless I commend it to you] +\[Incidentally while reading up on the Connection Machine I came across this [delightful essay](http://longnow.org/essays/richard-feynman-connection-machine/) on Richard Feynman's involvement in the project - it's of no relevance to my argument here, but nevertheless I commend it to you\] The machine was programmed in a pure-functional variant of Common Lisp. Unfortunately, I don't know the details of how this worked. As I understand it each processor had its own local memory but there was also a pool of other memory known as 'main RAM'; I'm guessing that each processor's memory was preloaded with a memory image of the complete program to run, so that every processor had local access to all functions; but I don't know this to be true. I don't know how access to main memory was managed, and in particular how contention on access to main memory was managed. What I do know from reading is that each processor was connected to twenty other processors in a fixed topology known as a hypercube. What I remember from my own observation was that a computation would start with just one or a small number of nodes lit, and flash across the machine as deeply recursive functions exploded from node to node. What I surmise from what I saw is that passing a computation to an unoccupied adjacent node was extremely cheap. -A possibly related machine from the same period which may also be worth studying but about which I know less was the Meiko Computing Surface. The Computing Surface was based on the Transputer T4 processor, a 32 bit processor designed specifically for parallel processing. Each transputer node had its own local store, and very high speed serial links to its four nearest neighbours. As far as I know there was no shared store. The Computing Surface was designed to be programmed in a special purpose language, Occam. Although I know that Edinburgh University had at one time a Computing Surface with a significant number of nodes, I don't know how many 'a significant number' is. It may have been hundreds of nodes but I'm fairly sure it wasn't thousands. However, each node was of course significantly more powerful than the Connection Machine's one bit nodes. +A possibly related machine from the same period which may also be worth studying but about which I know less was the [Meiko Computing Surface](http://www.new-npac.org/projects/cdroms/cewes-1999-06-vol1/nhse/hpccsurvey/orgs/meiko/meiko.html). The Computing Surface was based on the [Transputer T4](http://en.wikipedia.org/wiki/Transputer#T4:_32-bit) processor, a 32 bit processor designed specifically for parallel processing. Each transputer node had its own local store, and very high speed serial links to its four nearest neighbours. As far as I know there was no shared store. The Computing Surface was designed to be programmed in a special purpose language, [Occam](http://en.wikipedia.org/wiki/Occam_(programming_language)). Although I know that Edinburgh University had at one time a Computing Surface with a significant number of nodes, I don't know how many 'a significant number' is. It may have been hundreds of nodes but I'm fairly sure it wasn't thousands. However, each node was of course significantly more powerful than the Connection Machine's one bit nodes. ## A caveat @@ -133,9 +135,9 @@ This comes down to topology. I'm not at all clear how you even manage to have tw ## Looking for the future in the past, part two -In talking about the Connection Machine which lurked in the basement of Logica's central London offices, I mentioned that it lurked in a lab where one of the Xerox 1108 Dandelions I was employed to work on was also located. The Dandelion was an interesting machine in itself. In typical computers - typical modern computers, but also typical computers of thirty years ago - the microcode has virtually the status of hardware. While it may technically be software, it is encoded immutably into the chip when the chip is made, and can never be changed. +In talking about the Connection Machine which lurked in the basement of Logica's central London offices, I mentioned that it lurked in a lab where one of the [Xerox 1108 Dandelions](http://en.wikipedia.org/wiki/Interlisp) I was employed to work on was also located. The Dandelion was an interesting machine in itself. In typical computers - typical modern computers, but also typical computers of thirty years ago - the microcode has virtually the status of hardware. While it may technically be software, it is encoded immutably into the chip when the chip is made, and can never be changed. -The Dandelion and its related machines weren't like that. Physically, the Dandelion was identical to the Star workstations which Xerox then sold for very high end word processing. But it ran different microcode. You could load the microcode; you could even, if you were very daring, write your own microcode. In its Interlisp guise, it had all the core Lisp functions as single opcodes. It had object oriented message passing - with full multiple inheritance and dynamic selector-method resolution - as a single opcode. But it also had another very interesting instruction: BITBLT, or 'Bit Block Transfer'. +The Dandelion and its related machines weren't like that. Physically, the Dandelion was identical to the Star workstations which Xerox then sold for very high end word processing. But it ran different microcode. You could load the microcode; you could even, if you were very daring, write your own microcode. In its Interlisp guise, it had all the core Lisp functions as single opcodes. It had object oriented message passing - with full multiple inheritance and dynamic selector-method resolution - as a single opcode. But it also had another very interesting instruction: [BITBLT](http://en.wikipedia.org/wiki/Bit_blit), or 'Bit Block Transfer'. This opcode derived from yet another set, that developed for an earlier version of the same processor on which Smalltalk was first implemented. It copied an arbitrary sized block of bits from one location in memory to another location in memory, without having to do any tedious and time consuming messing about with incrementing counters (yes, of course counters were being incremented underneath, but they were in registers only accessible to the the microcode and which ran, I think, significantly faster than the 'main' registers). This highly optimised block transfer routine allowed a rich and responsive WIMP interface on a large bitmapped display on what weren't, underneath it all, actually terribly powerful machines. @@ -173,4 +175,4 @@ Obviously, there has to be some way for processor nodes to signal to the memory But... I do think that somewhere in these ideas there are features which would enable us to build higher performance computers which we could actually program, with existing technology. I wouldn't be surprised to see systems fairly like what I'm describing here becoming commonplace within twenty years. -[Note to self: when I come to rework this essay it would be good to reference Steele and Sussman, Design of LISP-based Processors.] \ No newline at end of file +\[Note to self: when I come to rework this essay it would be good to reference [Steele and Sussman, Design of LISP-based Processors](http://repository.readscheme.org/ftp/papers/ai-lab-pubs/AIM-514.pdf).\] \ No newline at end of file