Feature flags are a useful tool to conduct A/B experiments and to deploy changes in a controlled manner. To ensure that its use doesn’t disappoint users when a change causes a crash or degrades the user experience, Lyft created Safe Mode specifically to prevent crash loops on startup.
When a crash on startup was introduced by enabling a feature flag or changing other remote configurations, we usually had to send a hotfix to get users out of infinite crash loops because we had no way to send configuration updates to the app. pushing when it crashed so early in its life cycle.
The key to safe mode deployment is Bugsnag, an app stability monitoring platform that provides specific support for feature flag management and A/B experimentation. Specifically, Bugsnag allows developers to declare a list of feature flags used, which is sent along with a crash report. Bugsnag is also able to identify crashes on launch by providing an API to mark events during launch.
Lyft launches Bugsnag very early in the app’s lifecycle, right in its
main function, and configures it so that an app launch is considered complete when
applicationDidFinishLaunching returns on iOS and once the main screen is displayed on Android.
Safe mode also starts in
main, right after Bugsnag is initialized, and it asks the latter to see if the previous session crashed before the app fully launched. In this case it logs a
safe_mode_engaged analytic event and enters a shadow state where it first detects which feature flags were used in the previous session and then locks their configurations to local defaults.
This basically puts the potentially problematic functions/code paths in their default “safe” state and allows the user to use the app as they normally would (albeit with functionality disabled).
Once an app launches successfully after a crash, it refreshes its feature flags to give them another chance at the next launch. If the crash function is not fixed before then, the app will crash and safe mode will be enabled again.
A Grafana dashboard is fed with all crash-related events, so technicians can easily detect a spike and take action quickly. In addition, Lyft engineers also made sure that safe mode itself would not be the cause of instability. So, before rolling it out, they set a specific feature flag that intentionally crashed on launch to test the whole approach in internal, alpha versions of the app.
According to Lyft, safe mode has proven effective in reducing pain points with feature flags. They plan to extend it to handle app crashes, which usually don’t cause an app crash and are harder to detect; to be more effective in determining which specific attribute flag caused a crash and avoid disabling all of them; and to automatically disable feature flags that caused too many crashes.