Ben Cantrick (mackys) wrote,
Ben Cantrick

  • Music:

The how and why of NVidia's defective GPUs.

The defective parts appear to make up the entire line-up of Nvidia parts on 65nm and 55nm processes, no exceptions. The question is not whether or not these parts are defective, it is simply the failure rates of each line, with field reports on specific parts hitting up to 40 per cent early life failures. This is obviously not acceptable.

The part [of a chip] you don't see are the bumps, and they are the most critical part. This type of packaging is called flip-chip because the connectors between the die and the substrate are put on the bottom of the die, then the die is flipped over onto the substrate. The connectors are called bumps, and they are literally little balls of solder. A typical chip that is a little more than a centimetre on a side might have over 1000 bumps on it, so spacing is incredibly small and tolerances amazingly tight.

The bumps act like pins on a normal chip, they carry signals, power and ground to and from the die. They also are the primary attachment mechanism of the die to the substrate. The precision needed to put these things together should not be underestimated. Some bumps may pull a lot of Amps, others may pull very few, and this again changes over time and use. The bumps also have a limited current capacity each, too much and they melt or burn out, so there are far more than are strictly needed to supply the chip with power.

The Nvidia defective chips use a type of bump called high lead, and are now transitioning to a type called eutectic. Eutectic materials have two important properties, they have a low melting point, and all components crystallize at the same temperature. This means they are easier to work with, and form a good solid bond. Eutectic bumps may have lead in them, or they may not, some are gold/tin, other are lead based, it depends on what characteristics you want, and how much you want to pay. It is a property, not a formula.

Most if not all substrates use eutectic pads to attach the bumps to as well. If you use a eutectic pad with a eutectic bump, you get a much better connection than if you use a high-lead bump with a eutectic pad. This is reflected in much higher yields, lower assembly costs, and a physically stronger connection as well. At this time, we have no good explanation as to why Nvidia chose to go the high-lead bump on eutectic pad route.

Bumps overall are a multi-dimensional trade-off between cost, assembly yield, current capacity and mechanical resilience among other things. To call it a complex mess is being overly kind, package engineering is not for the faint of heart.

The problem is that the die gets hot, and heats the substrate secondarily. The silicon on the die has one rate of thermal expansion, the substrate has another, basically they get bigger at different rates. To complicate things further, remember the uneven and changing heating bit above? Parts of the die heat up and expand differently from other parts of the die. This changes quite quickly while things are in use.

Getting back to the stress, it is what makes bumps fracture. Think of the old trick of taking a fork and bending it back and forth. It bends several times, then it breaks. The same thing happens to bumps. Heating leads to stress, aka bending, and then it cools and bends back. Eventually this thermal cycling kills chips.

If you recall, high-lead bumps are stiffer than eutectic and more prone to stress fractures. The high-lead-to-eutectic substrate bond is also weaker than a eutectic-to-eutectic bond. What is happening to Nvidia is that the substrate to bump joint is cracking, and the chips die. High lead bumps are a poor choice to use in this application.
Tags: reddit
  • Post a new comment


    default userpic

    Your reply will be screened

    Your IP address will be recorded 

    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.
  • 1 comment