The case for opt-out telemetry

2018-07-09 opensource, software, telemetry

As I got some interesting reactions out of my previous post for poking at ActivityPub, I feel compelled to turn this into a little series where I write about topics apparently everyone in the privacy-focused bubble is concerned about, but where I have a fundamentally different opinion than “most” people. Should be fun.

So, let us dive right in and have a chat about telemetry in software, why I think more projects should use it, and why I believe telemetry should be opt-out per default.

If you have ever talked to me about this topic before, you can safely skip the following 2100 words, as you will not learn anything new. Also, this post is targeted towards people who currently have a strong opinion against any form of telemetry in software, so if you already feel like telemetry is cool… cool.

Making decisions for projects

Let’s start with leading projects. If you are involved in a project’s leadership, you probably are interested in making your project accessible and interesting to as many people as possible by building the things your users want. Some people just go ahead and make a good guess at what users want to have. Ruling a project based on “good guesses” might work for smaller projects with no goal of building a large and stable user base, but for larger projects, this approach clearly is not a very good one. Some people are not happy with blind assumptions and would like their userbase to be able to contribute to the project development in a meaningful way.

Also, always realize that your opinion is just that: your opinion. You are unfortunately not the center of the universe. Sometimes, the majority of users might disagree with what you think is right for a given project. No matter how hard you try to make your dreams come true, they might never. And if they did, it might actually harm the project in a way you might not even consider.

Asking your community to provide feedback

So, when there is no group of people making all the decisions, there is a fairly standard approach, especially for open source projects: asking your users directly for feedback. In some projects, this happens by asking them to open GitHub issues when they have an idea. Other projects might set up dedicated discussion forums, mailing lists, or other means of communication.

This is a great way to exchange ideas with active parts of your community, get new people into contributing to your project, or having some in-depth discussions about individual issues and implementation details. However, as great and fun this might be, it is essential to understand that these channels are very biased, and if you base all your decisions on what got discussed there, you might end up missing a lot of people.

Your communities are flawed, no matter how hard you try

This is a simple fact that most projects tend to forget about sometimes. No matter how hard you try to make these discussion channels open and accessible to everyone, you will most likely end up with a very limited subset of users ever participating there, or even looking into these channels.

To illustrate, I will take one of my favorite projects as an example: diaspora*. We are a small project in the grand scale of things, but compared to most open source projects, we are relatively large. At diaspora*, we use Discourse for project discussions. We do everything there, from collecting feature ideas, technical discussions about implementation details, to providing support for people using it and people who want to set up a new pod. We also use Discourse to get a general feel on the project’s directions, to see if people are interested in our ideas or not, and to judge on how many people may be affected by a refactor, a feature removal, or any other structural change.

That’s fine, but how accurate is our opinion here? If we have a look at our user base¹, we do have a total of 700k’ish accounts on the network right now. Now, if we compare that to our Discourse instance, we can see that there are, as of right now, only 1,259 accounts ever created. If we now look at a more meaningful number, the people who were active in the last 6 months, diaspora* still shows 57,783 active users, while on Discourse, this number drops to a mere 117, so only 0.2% of all active users ever tried to participate in these discussions. More importantly, if we’d ask for feedback on a change with a potentially significant impact on the UX, only 0.2% of our users actually would participate there.

Effectively, we are ignoring 99.8% of our users when making important project decisions.

That is already pretty bad, but it gets even worse. Looking at the people who contribute on these channels, they are almost all part of a group of people who are generally also interested in software development. The number of non-technical people in these channels is horribly small. And unfortunately, this isn’t just our problem, most projects are in the exact same situation.

Look at the pie, not just some crumbs

Sometimes, you really need to know what your users think. Everyone has limited resources, and you need to prioritize certain things above others. Nobody can do everything, unfortunately. So what happens if there is a feature that you feel like really needs improvement? What happens if there is that one annoying feature you either have to rebuild from scratch or remove it altogether? That is where you really want to get your users involved, to ask them what they think is more important. Ask them if that annoying feature is used at all, or if folks could live perfectly fine without it.

This is where developers turn to their community, create a new issue in GitHub, or open a thread in Discourse.

But as we have outlined earlier, this may or may not be completely flawed. You simply don’t know. So if you ask your community “Hey, do you use this feature or am I free to remove it?”, you have multiple possible outcomes. In the best case, the opinion of your community matches up with what everyone else feels as well. However, in the worst case, you are voting out a feature that is highly unpopular in your community-bubble, but crucial to the other 99.8% out there. You thought your decision was based on your user’s wishes, but in the end, you made a lot of them really angry.

This is not fiction. It has happened before, and it will happen again.

Ask your users, without asking them

What if we could somehow collect feedback from everyone, without having them actually do anything special? Yes, you are right! Telemetry! Instead of actively going out to people and ask for their opinion, you can use data on which to base your decision.

If you consider removing a feature, but you have no idea if people depend on it, then well, count the usages! As soon as you have numbers on how people use your application, it is easy to base such decisions on that data. Maybe, it turns out that you indeed can safely remove that one feature, but you discover that another small hidden feature is used a lot. Perhaps it’s worth to spend some more time refining that instead. Or maybe you find out that something you felt very strongly about and put a lot of effort into is not used at all. That could be an inspiration to reconsider one’s efforts here. Why is the usage so low? Is there some bad UX? Is there just no need for it? You would be surprised about how much you can learn from your users.

Using actual data instead of feedback from a limited group takes the “guessing” out of decision making. There is no need to think about how biased potential responses are, as you have statistics on how everyone is behaving.

But… privacy?!

There are some obvious privacy implications here, and there is a right and a wrong way of approaching them.

Changing your application’s terms of service and adding a paragraph saying “we are going to monitor what you click on and when you do it, and all this data is sent to us alongside a unique identifier that allows us to track you forever” is the wrong way.

Sadly, this is what most users these days think about when they hear about telemetry. They believe telemetry is this super-surveillance machine that tracks everything they do, stores the data forever, and invades their privacy. But it does not have to be this way.

When you collect telemetric data, there is no need to be able to identify individuals. For developing your product, it is irrelevant whether Alice or Bob clicked “like” 42 times, all that matters is the fact that the “like” button is a thing that is used quite a bit. It also does not matter that Alice spent 50 minutes on your site today, but only 15 tomorrow. If you have enough datasets, you will get a meaningful average anyway. Make sure your user’s privacy is not violated by collecting data that could be used to track individuals².

For users, it is absolutely essential that when you collect data, they have a way of looking at what you gather. Build a nice dashboard within your application that presents the data in an easy-to-understand manner with some explanation. Add a button to get access to the raw data, so all the interested people have a chance to look at that as well. When making project decisions, include your data results in that. If you remove a feature because it showed low usage, make that public. If you decide to push something else because it actually got used a lot, make that public as well. It is crucial that people understand why you collect the data, and how it is used.

Okay got it. But make it opt-in!

This is where a lot of people will get really angry with me. They might understand why data-driven decision-making is attractive to projects, and they might also understand why collecting feedback the traditional way might be biased. You can come up with excellent privacy policies, and you can come up with any proof in the world showing your collected data is not linkable to any individual. But whatever you do, they think opt-out telemetry is evil, and everything should be opt-in.

I get it, I really do. Yes, it is uncomfortable to collect data without having the user explicitly click a “yes” button. The reality, though, is that you have to make this uncomfortable move if you are interested in actually meaningful data.

We already established that basing decisions on a limited, potentially biased subset of user feedback is dangerous. No matter how friendly your opt-in call is, most users will decline regardless. They do not reject opt-in telemetry because they disagree with collecting data, they merely refuse because accepting it would require more effort. If you just ask them to enable “telemetry,” they have no idea what data you collect, and they will decline because they feel spied on. If you write a lengthy text explaining everything, most people will also decline, because they do not want to read this text, when all they want is to use your application. There is no way around that. Sorry.

So… allow users to look into the data you collect. Allow users to see how their data impacts your decisions. Allow users to turn data collection off if they really want to. But understand there is a good reason why telemetry in software sometimes is opt-out, not opt-in.

Please stop being angry at developers for using telemetry as another way of collecting feedback and driving projects. It is important. Most likely, they also have privacy concerns and work as hard as they can to ensure the data is safe. It is their user’s data, after all. They don’t want to invade people’s privacy, they want to build great projects.

Funny enough, this actually is very hard. While we have some statistics about how large our userbase is, submitting this information is actually entirely optional, and there might be a large number of nodes simply not showing these numbers. ↩
This could be a topic for a post on its own, but I will not get into that. There is a lot of interesting literature out there, and you will find a lot if you ask your favorite search engine. Differential privacy is a good starting point for your research. ↩