This is an automated archive made by the Lemmit Bot.
The original was posted on /r/maliciouscompliance by /u/subwaysmoothie on 2025-03-17 22:21:31+00:00.
Recently stumbled across this subreddit and remembered a story I thought you guys might want to hear. Unfortunately, my industry is kind of specific, so I will have to change some details and make some things vague to remain anonymous - but the core of the story is all there.
TL;DR: Design engineering makes bonehead decision to force me to remove a critical half of the testing procedure for one of the products we build. That decision has wide-reaching effects and causes a different product to experience a 100% failure rate, which forces design engineering into firefighting mode for months trying to determine the cause.
The compliance:
Years ago I worked as a junior manufacturing engineer for a certain company building certain, relatively complex products, and one of the stations I was responsible for was the first test station. We got the core mechanism and the electronic assembly of the products right off the production line, and I performed white-box testing to ensure everything looked right and worked properly before sending it to a different station for assembly and black-box testing.
(I’m trying to avoid talking about the specific testing methodologies used, hence “white box” and “black box” because that’s the best way I can describe it without being more specific. White box refers to testing it with direct access to all the internal components, so you can measure all the different parts, verify that individual parts work correctly, etc. Black box refers to testing it after the entire thing has already been put into an enclosure and you no longer have access to the internals, so really, you’re just verifying that the thing works and does everything it’s designed to do.)
The product I’ll talk about today - let’s give it a codename of “azure” - belongs to the greater family of “blue” products, which were all very similar but may have had different configurations, slightly different parts, etc. “Azure” was a new product, and in the first few production runs, we saw failure rates of 20%+ at black-box testing (the station after mine). When the engineers dug into it, they found that one of the relays on the electronic control board was fused shut on those 20% of units that failed. For whatever reason, they turned to me (they love blaming my station) and said that my station was causing it.
For a bit more background, my white-box testing station had powered-off and powered-on testing. In powered-on testing, we turned newly built units on for the first time, which also meant that if anything was wrong with it that would make it fail when powered on, it would fail at my station. That’s why I was always careful to make sure that the initial powered-off testing was thorough enough to cover as many of my bases as possible, so that when I powered it on for the first time, the unit wouldn’t blow up.
The design engineering team apparently didn’t believe that. They were apparently convinced that when I powered on “azure” units for the first time at my station, the initial power surge was sending a big current spike through the relays, which was causing them to fail. Apparently, their proposed solution was to simply eliminate powered-on testing during white box testing.
This was a terrible idea, so I argued against it:
- The power supply in my test jig is set to be as closely matched as possible to the actual power supplies we send out with these units into the field. That means if there was a power surge that was causing the failures, it’s a design issue and would be occurring with units out in the field if I didn’t power it on during white-box.
- Design engineering team said that, no, it must be an issue with my tester, because they didn’t believe there could be anything wrong with their design.
- I pulled up the datasheet for that relay and showed that it was physically impossible for that relay to fuse, in the circuit configuration it was placed in, with the amount of voltage my test jig could supply.
- Apparently, the design engineers ignored that entire page of my report - they didn’t think a junior manufacturing engineer’s analysis was even worth looking at, and trusted their own assumptions more.
- I had yet to see a single unit that was proven to have a good relay before my station and a fused one after my station, which would have been the concrete evidence I needed to believe that my station was fusing the relays.
- Design engineering said, “we don’t need concrete evidence, we’re sure this is what’s happening”.
- If we disable powered-on testing, we’ll lose a lot of test coverage.
- The design engineers just went, “whatever, we’ll catch any issues at black-box”. (This was a bad idea because our black-box tester, while it could tell us that the unit wasn’t working, could not tell us what part in the assembly was causing it not to work. Units that failed white-box had a >70% successful repair rate; units that failed black-box had <10%, at least without going back through white-box.)
- Finally, I argued that we had other products in the “blue” family that went through the exact same test jig, using the exact same relay in the exact same circuit configuration, and hadn’t seen any issues before.
That last argument I made was a big mistake. What I said was, “We have other ‘blue’ family products going through that same jig with no problems.” What the design engineers apparently heard was, “We have other ‘blue’ family products going through that same jig, and they’re all killing relays, and subwaysmoothie hasn’t noticed yet because he’s incompetent”. They came back at me twice as hard.
I argued this as much as I could for two weeks, before the order finally came down from my direct manager: as per directives from the design engineering team, all powered-on testing was to be disabled from the test jig for all “blue” family products. Not just “azure”.
(For what it’s worth, my manager was on my side for most of this, and only gave me the order to avoid any unnecessary trouble when it looked like the company leadership was going to get involved.)
Well, fine. I went ahead and disabled powered-on testing. As I predicted, all of the “blue” family products - “cyan”, “turquoise”, “cerulean”, etc - started seeing 3x the failure rate at black box testing and we were now stuck with a bunch of units that we didn’t know how to fix. But that’s besides the point - how about “azure”?
Same 20% failure rate. Nothing changed. As I called it, my station wasn’t killing the relays.
So the design engineers went and took another three months figuring out what the actual cause of all of the relay failures was, which, as it turns out, was some flaw with the way the black box test was being run combined with some other part on the assembly that was underspec (I dunno specifics, I wasn’t part of this conversation anymore). They spent a bunch of money and got it fixed, and never followed up with me saying “hey, looks like you were right, it wasn’t caused by powered-on testing at white box” - which, crucially, also means that I never got a directive to re-enable powered on testing.
So we ran like that for a few months, me licking my lips all the time, because I knew what was coming and it was delicious.
The fallout:
See, there was another product in the “blue” family that I’ll call “navy”. “Navy” was a bit of an oddball, because the client had some requirement that demanded microcontroller B be installed, as opposed to microcontroller A on all of the other “blue” family products. That was the only difference, which meant I used the same test jig for it.
We sourced microcontroller A from a vendor who also pre-loaded it with the firmware we needed flashed on it. Years ago, we had also apparently done the same with microcontroller B. But the vendor for B that could preprogram them for us had shut down, and we could not find a single other vendor who could preload the firmware for us. That’s when we turned to internal solutions. Someone found out that the test jig at my station (then managed by someone else) had direct access to the microcontroller’s programming interface, so they developed a way to flash the firmware from my test jig. That meant we could now buy blank B units directly from the manufacturer, then flash it with the firmware ourselves. This was a great solution because not only were blank B units cheaper, flashing it during powered-on testing wouldn’t add an extra step to our production process since it would just be a part of the white box testing step.
Of course, flashing the firmware required the unit to be powered on.
All of this happened years before I had joined the company, and before most of the current crop of design engineers were involved with this project. This was, in fact, documented, but all of these products had gone through hundreds of ECNs (basically formal engineering notifications that “something” has changed with the product) and nobody was reading through hundreds of them to familiarize themselves with the entire history of the product.
When they demanded I disable powered-on testing on all “blue” family products, this microcontroller programming step for “navy” was also disabled, meaning any “navy” units we built would make their way over to black box testing with a blank microcontroller. I knew, but that was only because I knew exactly what my test did. I also knew that this was documented in an ECN from 8 ...
Content cut off. Read original on https://old.reddit.com/r/MaliciousCompliance/comments/1jdpm2r/you_want_me_to_disable_half_of_my_entire_testing/