this post was submitted on 14 Dec 2023
198 points (98.1% liked)

Asklemmy

43962 readers
1491 users here now

A loosely moderated place to ask open-ended questions

Search asklemmy ๐Ÿ”

If your post meets the following criteria, it's welcome here!

  1. Open-ended question
  2. Not offensive: at this point, we do not have the bandwidth to moderate overtly political discussions. Assume best intent and be excellent to each other.
  3. Not regarding using or support for Lemmy: context, see the list of support communities and tools for finding communities below
  4. Not ad nauseam inducing: please make sure it is a question that would be new to most members
  5. An actual topic of discussion

Looking for support?

Looking for a community?

~Icon~ ~by~ ~@Double_A@discuss.tchncs.de~

founded 5 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
[โ€“] dan@upvote.au 16 points 11 months ago* (last edited 11 months ago) (3 children)

I broke the home page of a big tech (FAANG) company.

I added a call to an API created by another team. I did an initial test with 2% of production traffic + 50% of employee traffic, and it worked fine. After a day or two, I rolled out to 100% of users, and it broke the home page. It was broken for around 3 minutes until the deployment oncall found the killswitch I put in the code and turned it off. They noticed the issue quicker than I did.

What I didn't realise was that only some of the methods of this class had Memcache caching. The method I was calling did not. It turns out it was running a database query on a DB with a single shard and only 4 replicas, that wasn't designed for production traffic. As soon as my code rolled out to 100% of users. the DBs immediately fell over from tens of thousands of simultaneous connections.

Always use feature flags for risky work! It would have been broken for a lot longer if I didn't add one and they had to re-deploy the site. The site was continuously pushed all day, but building and deploying could take 45+ mins.

[โ€“] jjjalljs@ttrpg.network 14 points 11 months ago

Always use feature flags for risky work! It would have been broken for a lot longer if I didnโ€™t add one and they had to re-deploy the site. The site was continuously pushed all day, but building and deploying could take 45+ mins

This reminds me of the old saying: everyone has a test environment. Some people are lucky enough to have a separate production environment, too.

[โ€“] Vendetta9076@sh.itjust.works 5 points 11 months ago (1 children)

I work on a SOC team and were really trying to hammer the usage of feature flags into our devs.

[โ€“] WhyAUsername_1@lemmy.world 4 points 11 months ago (1 children)
[โ€“] dan@upvote.au 7 points 11 months ago* (last edited 11 months ago) (1 children)

Feature flags are just checks that let you enable or disable code paths at runtime. For example, say you're rewriting the profile page for your app. Instead of just replacing the old code with the new code, you'd do something like:

if (featureIsEnabled('profile_v2')) {
  // new code
} else {
  // old code
}

Then you'd have some UI to enable or disable the flag. If anything goes wrong with the new page after launch, flip the flag and it'll switch back to the old version without having to modify the code or redeploy the site.

Fancier gating systems let you do things like roll out to a subset of users (eg a percentage of all users, or to 50% of a particular country, 20% of people that use the site in English, etc) and also let you create a control group in order to compare metrics between users in the test group and users in the control group.

Larger companies all have custom in-house systems for this, but I'm sure there's some libraries that make it easy too.

At my workplace, we don't have any Git feature branches. Instead, all changes are merged directly to trunk/master, and new features are all gated using feature flags.

[โ€“] WhyAUsername_1@lemmy.world 3 points 11 months ago (1 children)
[โ€“] Vendetta9076@sh.itjust.works 2 points 11 months ago (1 children)

Everything Dan said and more. They're sometimes also called canaries, although thats not quite the same thing. There's been a ton of times where services have been down for hours instead of minutes because a dev never built in a feature flag.

[โ€“] Hadriscus@lemm.ee 2 points 11 months ago (1 children)

Canaries, relating to mine work ?

[โ€“] Vendetta9076@sh.itjust.works 3 points 11 months ago

Thats where the term derives from, yes

[โ€“] cashews_best_nut@lemmy.world 2 points 11 months ago

What language? PHP, python?