Yes, it is exactly as the title suggested. We'd like to start collecting usage statistics for our NannyML library.
What does that mean, usage statistics? Why do you need this? What will they be used for? How are you getting this information? Do I have to agree to this?
We aim to answer all these questions in this blog post. Our goal here is to inform you and - most importantly - learn how you feel about this. We believe this will greatly help us improve NannyML without any compromise to your privacy, but we’d like to hear about any concerns you might have first.
- We want to collect anonymous statistics about what functions you use in NannyML
- We won’t collect any information about your identity or your datasets
- We will use this information to improve NannyML and as a way to illustrate traction for our (future) investors
- You can easily opt out, we provide multiple ways to do so
- You can always reach out to us and request any data related to you be removed
Keep reading if you want the long answer!
What does that even mean, usage statistics?
We first need to explain what we consider to be usage statistics. Starting on November 27th, we’ll start collecting statistics about the general usage of the NannyML library. Every time one of our essential functions is used a data package will be shipped to an external service.
The essential functions are the following:
- Fitting any calculator or estimator
- Calculating/estimating using a calculator/estimator
- Plotting results
- Writing results to filesystem, pickle, or database
- Running NannyML using the CLI
The data that is collected and shipped has three different parts:
- Environment data: tells us more about the computational environment NannyML is running in. This includes figuring out if NannyML is running in a Python application, a notebook, or a container. What version of Python is it being used with? What operating system? What version of NannyML is being used?
- Execution data: data that helps us understand what functionality of NannyML you're using and how it is performing. This is limited to the name of the key function you're running, the metrics or methods you're calculating and the time it took NannyML to finish that function call. We’ll also check if any errors occurred during that run.
- Identification data: A fingerprint is created based on the hardware present in your machine and used as a unique identifier. Running NannyML from the same machine twice means the same unique identifier will be used - in theory, this doesn't apply to Docker. This allows us to detect repeat usage patterns without the need for personal identification.
The following snippet is an actual usage statistic data package sent from NannyML running in a container:
What about personal data?
Apart from the hardware ID, there is nothing to link back to your machine, let alone to your identity. You have our word on this: we will never collect any Personally Identifiable Information. And don't just take our word: verify it! We invite you to review the implementation.
What about my dataset?
We deliberately avoid logging any arguments when running a key function to minimize the risk of leaking unwanted and unnecessary information. One exception: the names of metrics and methods being used in the calculators or estimators.
We collect no information about the structure, size, or contents of your datasets. Your datasets remain yours only.
We have good reasons for wanting to collect these usage statistics.
Improving NannyML and prioritizing new features
It is an easy claim to make. We are serious about it though. Looking at the aggregate usage statistics can teach us what kind of functionality is used frequently and if there is functionality not used at all.
It can help us improve the user experience by looking at patterns within the usage events, tackle long processing times, and help prevent feature-breaking exceptions.
By distributing NannyML as a library to run on your system as opposed to a service hosted by us, we have no other way to gain these insights.
Surviving as a company
We care about the impact of ML models performing sub-optimally at NannyML. It is our vision that the core functionality we build, i.e. the algorithms distributed as the NannyML library should be available to everybody, for free, forever. This was the main driver for building an open-source library. But the world of tech startups has always been a tough one, and even more so in the last few years.
Because we work in open source, the NannyML library doesn't generate any revenue. We're depending on external investors to provide us with the resources to continue our work, survive, and maybe even thrive.
We want to verify if NannyML is worth putting all our effort into, and investors want to verify if it is worth putting their resources into. Aggregate usage analytics provide the actual figures needed to secure funding, as well as motivation.
We'll give a very brief overview of how we've implemented usage analytics.
- We've created a usage_logging module within the library. It contains all the functionality related to usage analytics. Feel free to browse the source code.
- We instrument our library by adding a log_usage decorator to our key functions, sometimes also providing some additional data (e.g. metric names).
- Upon calling one of these key functions, the decorator will capture the required information. Our usage_logging module will then try to send it over to Segment, a third-party service provider specializing in customer data.
- The usage events are aggregated and turned into insights in Mixpanel, another third-party service provider specializing in self-service product analytics.
To opt in or not to opt in, that's the question
Whilst our team at NannyML saw the need for usage analytics, we did have some deeper discussions about how to present this to you, the end user.
Do we disable usage analytics collection by default and have the end user explicitly opt in? Whilst it felt very intuitive and "correct” to do so, we asked ourselves the following question. “Would I go through the trouble of explicitly enabling this every time I use NannyML?". Our answer was no, we probably wouldn't bother. And if we wouldn't, it is only fair we don't expect you to.
We settled on opt-out behavior, so usage analytics will be enabled by default for the following reasons:
- We don't collect any information that can identify our users
- We don't collect any information about the data NannyML is used on
- We provide an easy way to turn usage analytics off, without any limitations on the product
- We believe that if you keep using NannyML, you probably want us to survive as a company
How to disable usage logging
It should be easy to disable logging. We provide three ways of doing so. The first way - using environment variables - is universally applicable and easy to set up, so we recommend using that one.
Setting the NML_DISABLE_USER_ANALYTICS environment variable
You can set this variable before running NannyML as a script, CLI command, or container. Its value doesn’t matter, as long as the environment variable is present.
Providing a .env file
NannyML will check for .env files, allowing you to provide environment variables without dealing with shells. Just create a .env file in the directory of your script and NannyML will pick it up automatically.
Turning off user analytics in code
If you don't like toying with environment variables, you can just disable (or enable) the usage analytics within your code before running anything. You can only do this when using NannyML as a library.
We'll be waiting for your comments and feedback for a couple of weeks. We’ll enable the functionality
We're open to all constructive remarks and feedback. Feel free to reach out to us in any way you feel comfortable with. Spark a discussion on our community Slack or reach out to us via direct message, book a slot to have a call, send us an e-mail, or create an issue on our GitHub repository.
Thanks for reading up here! We hope you've gained some insights into the motivations and reasoning behind this prickly topic.