Mobile app reliability: beyond Crashes and ANRs #1447

gandharva · 2024-10-30T06:04:01Z

gandharva
Oct 30, 2024
Maintainer

Mobile app reliability is more nuanced than many of us realise. While we obsess over crash-free rates and ANRs, there are many other types of reliability issues that users face and are often missed by our traditional metrics.

Think about reliability from a user's perspective. When does an app feel unreliable? It's not just when it crashes. It's when:

The launch animation freezes mid-way
A button press yields no response
Network requests timeout silently
Loading spinners spin endlessly
The feed shows yesterday's content

Each of these scenarios represents a different type of reliability failure. Yet traditional monitoring often misses these "soft failures" entirely.

Moving Beyond Binary Metrics
Traditional crash monitoring is too simplistic. It only tells you if your app is running or crashed. But real app reliability is more complex.

Consider this: Your app might be "running" but taking 10 seconds to load content. No crash reported, but users are frustrated.

This is just one example. Your app might have twenty such issues; few examples listed above.

Here's the real problem: These issues rarely affect the same users. While each problem might impact only 5% of your users, different groups experience different issues. The result? A much larger percentage of your user base is having a poor experience.

It's death by a thousand paper cuts – each issue seems small in isolation, but together they create a significant reliability problem.

This is where Service Level Indicators (SLIs) come in. Rather than tracking simple up/down states, effective SLIs measure success rates of specific user interactions:

Navigation success rates
Content load success rates
UI latency
Data freshness ratios
Network request success

The key insight? Track what users actually experience.

How do you do that? By connecting three critical data streams:

User Journey Data: Navigation flows, Button interactions, Screen transitions, Feature usage patterns, Session duration etc
Device Performance Data: Memory usage, Frame rates (FPS), App Not Responding (ANR) events, Network request latency, Battery consumption
Context Metadata: Device models, OS versions, App versions, Network conditions, Geographic location

When these data streams align, you get a complete picture of user experience. For instance, you might discover that users on specific Android devices experience frame drops during scroll animations, or iOS users on older devices face longer load times for image-heavy screens.

The challenge? Traditionally, this data lives in different tools and dashboards. Consider consolidating your monitoring stack to see these patterns more clearly. Modern monitoring solutions can capture this data automatically, helping you spot reliability issues before users report them.

Setting Meaningful Reliability Goals

Once we have meaningful measurements, we can set realistic Service Level Objectives (SLOs). Target metrics that actually matter:

99% of location pins are accurately set within 1 second
97% of menu pages load within 1.5 seconds
99% of payment confirmations display within 1.5 seconds
99% of maps load within 1 second; for smoother navigation
98% of seat selection screens load within 1 second
97% of video streams buffer within 0.5 seconds

These goals directly reflect user experience and give us actionable targets for improvement.

The Role of Client-Side Telemetry
Server logs can't tell us if a button press felt responsive or if an animation stuttered. Only client-side telemetry can capture these crucial user experience metrics.

The key areas to monitor:

User interaction success rates
UI thread performance
Network request latency
Cache hit rates
Error recovery success

Remember to respect user privacy when collecting this data – only gather what you need, and be transparent about it.

Making Data-Driven Reliability Decisions
The real power of good reliability metrics is in decision making. When should you pause feature development to focus on performance? When is reliability "good enough"? These decisions become much clearer with solid data.

If your SLOs consistently show high reliability, you might have room to move faster on features. If metrics are trending down, it might be time to invest in performance optimisation.

What reliability metrics matter most for your app? How do you balance reliability work against new feature development?

⭐ If you like this post, please checkout Measure. It's an open source tool to monitor mobile apps. It captures crashes, ANRs, navigation events, API requests, and much more to create detailed session timelines that help find patterns and get to the root cause of issues. Check it out here and feel free to star it for updates!

rizwansworld · 2024-11-19T10:53:53Z

rizwansworld
Nov 19, 2024

A good article. Felt the same about time-consuming loaders and laggy animations being irritating to the users during development.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mobile app reliability: beyond Crashes and ANRs #1447

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Mobile app reliability: beyond Crashes and ANRs #1447

gandharva Oct 30, 2024 Maintainer

Replies: 1 comment

rizwansworld Nov 19, 2024

gandharva
Oct 30, 2024
Maintainer

rizwansworld
Nov 19, 2024