For Azul Systems' certainly, the name of the game is throughput: we appear to be generously over-provisioned with bandwidth. We can sustain 30G/sec allocation on 600G heaps with max pause times on the order of 10's of milliseconds. Each of our 864 cpus can sustain 2 cache-missing memory ops (plus a bunch of prefetches); a busy box will see 2300+ outstanding memory references at any time. We have a lite microkernel style OS; we can easily handle 100K runnable threads (not just blocked ones). Our JVM & GC scales easily to the whole box. In short: the bottleneck is NOT the platform. We need our users to be able to write scalable concurrent code.
In short, users' don't write "TM-friendly" code. Neither do library writers. Many times a small rewrite to remove the conflict makes the HTM useful. But this blows the "dusty deck" code - people just want their old code to run faster. The hard part here is getting customers to accept that a code rewrite is needed. Once they are over that mental hump, once a code rewrite is "on the table" - then the customers go whole-hog. Why make the code xTM-friendly when they can make it lock-friendly as well, and have it run fine on all gear (not just HTM-enabled gear)? Also locks have well understood performance characteristics, unlike TM's which generally rely on a complex and not-well-understood runtime portion (and indeed all the STMs out there have wildly varying "sweet spots" such that code which performs well on one STM might be really unusably slow on another STM).
Really what the customers want to know is: "which locks do I need to 'crack' to get performance?". Once they have that answer they are ready and willing to write fine-grained locking code. And nearly always the fine-grained locking is a very simple step up in complexity over what they had before. It's not the case that they need to write some uber-hard-to-maintain code to get performance. Instead it's the case that they have no clue which locks need to be "cracked" to get a speedup, and once that's pointed out the fixes are generally straightforward. (e.g., replacing sync/HashMap with ConcurrentHashMap, striping a lock, reducing hold times (generally via caching), switching to AtomicXXX::increment, etc)
For those of you who don't know, Azul Systems is the company that made custom silicon to execute Java in hardware, and currently sells 300-800 core massively parallel Java machines.