Pkg.jl telemetry should be opt-in

mixologic · on July 5, 2020

I feel like many developers fail to understand the difference between the ethos of Free/Libre/Open source software, and the realities of running a networked service.

Services are not free (as in beer) - they always take time, money, and labor to provide. A PkgServer.jl is exactly the kind of thing that has to be sustained somehow.

It's not possible to use a networked service without exchanging some information with that service, which may or may not be useful for the service providers to collect, so that they can provide a better service (Read: make it cost less)

The idea that one should be entitled to use a service, for free, and at the same time ask that the service does not collect any data, or make it opt-in by default, is akin to demanding free beer that people can optionally pay for.

Caveat: My bias is from being a service provider for a packaging endpoint, a security updates endpoint, and a community CI service. Any telemetry data we can get our hands on to help us make informed decisions about what to support, and what to drop support is absolutely invaluable.

rnhmjoj · on July 5, 2020

As a user of free software I have the opposite bias. If a software is built around a community, which should be the case of free software, I think it's better if decisions are taken by asking the users, be it forum discussions, polls or whatever.

As you mentioned, reducing the costs usually amounts to taking something out: this can piss off users, particularly if they were not informed and no discussion took place. An example that comes to mind was the decision by Mozilla to stop supporting the ALSA driver in Firefox based on the telemetry showing little usage.

Ths example also shows that often data are biased and making a decision solely based on data is not ideal: ALSA is (was) the default choice on most GNU/Linux and BSD distributions, where firefox is usually built and distributed by the maintainers with telemetry disabled.

> is akin to demanding free beer that people can optionally pay for.

This is similar to how donations work: with donations you can't force people to give you money but you can be very insistent and it can leave users with a bad taste in their mouth. Also, if you are implying the service providers are entitled to collect all the data they can, I think this must have limitations. Running code on the client side from which the provider only can (directly) benefit should require permission, because you grant users access to server and so should the users grant you access to their machine.

mixologic · on July 6, 2020

Gathering telemetry data is asking the users, and happens to be the most cost effective way to do so. Forum discussions do not scale and only satisfy the needs of the loudocracy, and polls are going to suffer from a similar participation bias.

Disabling telemetry by default ended up harming the users of those distributions by making their usage of ALSA invisible. Again, that choice by the distro maintainers isn't something that Mozilla had any control over. Voluntarily withholding telemetry data is similar to abstaining from an election. You dont get to complain about who got elected if you didn't vote, likewise you dont get to complain about lack of support for your use cases if you don't allow upstream projects to know that you have them.

storedbox · on July 6, 2020

> Gathering telemetry data is asking the users

No, it absolutely is not.

orf · on July 6, 2020

Firefox disabling ALSA after the people that use it most refused to tell Mozilla that they did in fact use it seems like a “leopards ate my face” moment.

Telemetry IS asking their users in a way that scales past putting a random poll in some random forum that only a specific subset of people will ever interact with.

You can have some idealistic view of how feedback should be holistically gathered by “just damn listening to people!”, but this entirely glosses over the specifics of how it will scale and how accurate it will be which is vitally important when you go past 10,000 users.

floatingatoll · on July 6, 2020

Listening to people biases the outcome towards those who feel confident speaking up, and have time and energy and ability to do so. Having those circumstances apply is a privilege, and does not implicitly grant those who have that privilege the right to control the outcome.

Telemetry provides unbiased data about the vast majority of users, especially including those not privileged enough to make their needs known (for any one of thousands of reasons) — with the notable exception of various Linux distributions who have decided that they do not wish their users to be represented in the data, and certain power users who simultaneously demand the right not to contribute unbiased data yet seem to feel entitled to have their needs met.

I didn’t know that Firefox disabled ALSA but it doesn’t surprise me at all that it impacted users whose telemetry is disabled for idealistic reasons. Open source is discovering that the economics of free and the burden of long tail support are not compatible goals, and there will be many more pain points in many more open source products along this road. In the past, the squeaky wheel would get the grease; now, sometimes, the car might be converted to a tricycle and the squeaky wheel disposed of as superfluous to the greater good.

Most of the time, I take for granted that this is where the rubber meets the road on their ideals: The distro’s ideals demand that their users not be represented via telemetry, and so their users pay that price in exchange for the ideals they (often unknowingly) signed up for with that distribution. If upstream fails to take them into account as a result, the distro already accepted on behalf of its users that upstream will make decisions that aren’t compatible with their users as a result of disabling telemetry. Idealism is not free of costs.

In my more cynical moments, I wonder if this is because they fear that unbiased data would show that their use cases are in the severe minority (like, 0.0001%), and so feel think there’s a higher chance of having their needs met by withholding telemetry and then banging the outrage drums loudly to make their numbers look larger than life. Game theory suggests it’s a possibility, anyways.

_7bxa · on July 6, 2020

It feels like as user you can say whatever you want without considering concrete realities "Julia needs to know how many people use the software for funding".

The legal document also clearly indicates that Julia is collecting a very small amount of info, so there's no trickery either.

throwawaw · on July 5, 2020

This is an extraordinarily level-headed and well-reasoned version of the "how much telemetry" conversation, from both "sides". The Julia community comes off looking really good here.

pwdisswordfish2 · on July 5, 2020

   Moreover, at present, we have no idea how many people use each solver (and on which platform!). Knowing how many people installed which solver
   would allow us to prioritize support from our finite developer time.

Why not just let users vote on that? The support is for the users, no? Instead the developers want to minimise the amount of time they spend on maintenance based on the number of users who could potentially complain. The reason for this is (as we are about to be told) so they can spend more time working on platforms where they believe commercial solver developers could provide for-profit support services "(or $$)".

   This would also allow us to lobby the commercial solver developers to provide official support (or $$). To quote one company "We'll want to
   provide official support at some point, but it looks like the scales haven't tilted quite yet." It'd be nice to know whether 100, 1000, 10000, or
   100000 people per month use their software; that might change their mind.

The truth comes out. Collecting data via "frictionless" telemetry allows someone else, e.g., commercial solver developers, to make money. Nothing wrong with that if we let users know about these intentions, however when devlopers try to operate under the guise of "free", "non-profit", "open source", etc. while, truthfully, they have commercial motives, then it seems to me they are doing everything they can to avoid tipping users off that this aims to be a commercially-oriented project. Instead of just being transparent about their motives and letting users decide, they want to sneak something by (most) users. The issue raised here is not the collecting statistics (nothing wrong with that), it is the less transparent, opt-out nature of it: telemetry. Deceptiveness, stealth. The message coming from this discussion is "Don't tip (majority of) users off that we are collecting data." And why is that? Because the developers know this is something most users do not want.

   Finally, if it is opt-in, the vast majority of users will not opt-in. This leaves us no better off than we were before. Opt-out is a good
   compromise.

The discussion should have ended right here. If providing usage statistics is something that the Julia developers already know the vast majority of users do not want to do, then sneaking it by them via opt-out telemetry is wrong, and it tells us much about the people behind Julia. If users do not want it, and you know that, then why the heck are you doing it anyway? Anyone reading this will know why, but most users will probably never read what we are reading here.

The rest of this discussion devolves into "Everyone else is doing it". The lone dissenter finally gives in to peer pressure.

I remember when using download statistics was enough. Developers still maintained software. No "trade-offs" were needed.

mbauman · on July 6, 2020

This is quite the caricature of what's happening in that thread.

> The truth comes out.

The truth was never hidden. It's all laid out quite plainly here: https://julialang.org/legal/data/.

> If providing usage statistics is something that the Julia developers already know the vast majority of users do not want to do

There are three reasons someone might not opt-in: they don't want to, they don't know about it, or they simply don't care. To ignore the latter two is simply disingenuous.

> The rest of this discussion devolves into "Everyone else is doing it". The lone dissenter finally gives in to peer pressure.

That's certainly not my read of the 200+ post thread.

pwdisswordfish2 · on July 6, 2020

You are certainly entitled to your developer perspective.

What are the reasons users do not read the /legal/data page on the Julia website? What are the reasons users do not read 200+ posts from developers debating the use of opt-out telemetry? To ignore such reasons, assuming they exist, would also be disingenuous.

If you put the choice clearly before users and they knowingly, affirmatively choose to submit usage data, then no 200+ post thread is necessary. Instead, the choice is not left to users. It is made by developers, and the fact of the use of opt-out telemetry is found on a webpage that developers know users do not read.

spenczar5 · on July 5, 2020

> I remember when using download statistics was enough. Developers still maintained software. No "trade-offs" were needed.

Do you remember applying for grants to fund software? It's tough out there, right now. Harder than it once was - software is more expensive, and funding agencies are more critical.

mlubin · on July 5, 2020

> I remember when using download statistics was enough.

No download statistics are currently available for Julia packages. That's essentially the issue that the Pkg.jl telemetry is trying to address.

ViralBShah · on July 5, 2020

Julia packages are github repos, where all we get are the traffic stats for the last 2 weeks for clones. It doesn't even provide the number of downloads of released software (the tarballs), or even basic stats that you could get from webserver access logs.

jcsuggestions · on July 7, 2020

You could instead possibly: - move the project and ecosystem completely off GitHub - sell the project to a FAANG - go with the MathWorks model and just straight up sell closed source proprietary code (they have 5000+ employees and growing, you have ~30, maybe 40 if you count the JuliaLab that essentially work for JC as well)

As much as people in the Julia community have dunked on Matlab in the past, at least MathWorks has their business model worked out, people understand the tradeoffs being made, and the dark pattern being used is just closed source software instead of exploiting PII.

dklend122 · on July 7, 2020

How are anonymous UUIDs PII and how is JC exploiting them?

They have no special access nor do I see a first order effect that benefits them. Just that the open source Julia ecosystem will benefit and that will feed back into JC's market.

thayne · on July 6, 2020

> If providing usage statistics is something that the Julia developers already know the vast majority of users do not want to do,

I think it is more likely the majority of users don't care, so do whatever the default is. If the vast majority didn't want telemetry enough to opt out anyway, then there wouldn't be any point in having it at all.

pwdisswordfish2 · on July 6, 2020

Edit: The issue of no download statistics is an issue for the developers, not an issue for users. Why is it an "issue" for developers of a "non-commercial" project in the first place? Because the developers actually have commercial ideas for the project. Users did not create this "issue", developers did. Instead of resolving the issue created by using a third party service for downloads by making a compromise, e.g., forgo using a third party to host downloads, or asking users to help them resolve it, e.g., by agreeing to submit usage stats, a new problem is passed on to users instead: opt-out, default telemetry. This is a user perspective. Compare and contrast this with a developer perspective.

parsimo2010 · on July 5, 2020

There are a few concerns I have as a occasional Julia user. When I update my packages is this going to be a silent change, or can we get a notification and a Y/N option to opt out when updating? How visible and easy is it to change this setting after updating if I change my mind?

I don’t have a specific concern about the Julia team using my data, but I have general concerns about companies collecting telemetry. Can’t they get a rough estimate of active users by counting unique IP addresses over the past X months which doesn’t require opting people in to telemetry?

Edit: I think I read the link incorrectly. This person is arguing that users should have to actively opt-in, not that they are opted in automatically. They are arguing for a change that would increase privacy, and I need to opt-out in my current installation. I didn’t know I was sending telemetry right now.

dnautics · on July 5, 2020

Downthread there's an comment that addresses your point by JohnMylesWhite:

> I think this is the crux of the issue: you’re already doing that across the Internet since your IP address is part of many (most?) normal HTTP requests. It’s not perfectly uniquely identifiable, but it’s not so far away from being that and it’s being submitted without even the possibility of opt-out in most cases / for most people.

> So I think the core issue this thread should resolve: would it be better for Julia to just do everything via logging IP addresses? That’s what everyone else in OSS is already doing (seemingly without almost any concerns), so perhaps the problem is just that Julia is talking about how to best do things rather than just doing them? That feels quite perverse to me, but it’s my big fear after reading this thread.

mirekrusin · on July 6, 2020

This whole thing is making noise about nothing.

Other package managers like npm already do this kind of telemetry, and without opt-out option, because they are centralised.

Julia packages are not, there is no central server.

They want to access primitive view of the package usage so they know where the whole ecosystem is, how to prioritise and help with getting funding – as they mention in the thread it's difficult to raise money if you can't present basic userbase numbers.

nwvg_7257 · on July 5, 2020

You are not sending telemetry right now. That is a feature which will be activated in the upcoming 1.5 release. It will display a notification.

KenoFischer · on July 5, 2020

While this is true of the feature mentioned, do note that packages are currently hosted on various third party hosting services that can and do track substantially similar information. In 1.5, we're moving to our own infrastructure for serving packages (which should give better performance and allow things like incremental updates). This thread is about what information gets sent along with those requests.

fiddlerwoaroof · on July 5, 2020

If you make an HTTP request, you are sending “telemetry” information in the form of endpoints, headers and IP information. The server may not track this information, but it’s exploitable

parsimo2010 · on July 5, 2020

My issue is that I’ve come to terms with the fact that the IP address of every connection can be tracked server side- I can use a VPN to get a little anonymity but can’t stop a server from logging connections and downloads. But telemetry adds data on top of that, and it seems like a lot of software wants to track me. I’d feel it was okay if I was required to register an account and log in before downloading/updating packages, that’s a noticeable action that lets my brain process the idea that I’m able to be tracked. But sending “anonymous” metadata with almost no action on my part rubs me the wrong way. Lots of devs try to optimize things so they are low friction for users, but I think the Julia user base is a little different than normal software and wouldn’t mind a little friction if it meant they had better control of their privacy.

fiddlerwoaroof · on July 5, 2020

As far as I can tell, this isn’t adding anything to IP sharing: the package manager just attaches a persistent UUID to every request. In fact, it is more private than IPs because it can’t be tied to an ISP or geographical region.

ninjin · on July 6, 2020

As someone that was active in the previous HN thread [1], this one, and in the Discourse one this position has popped up several times and it perplexes me. Attaching a persistent UUID on top of a protocol that carries your IP can not be more private as you are giving away additional information that would have to be inferred statistically from the IP alone. Now, we can argue other benefits of the UUID, but simply calling it a day by ignoring the fact that you are already giving away your IP is just baffling to me. Am I being thick here? What am I missing?

[1]: https://news.ycombinator.com/item?id=23706271

detaro · on July 6, 2020

> What am I missing?

I'm guessing there's an unspoken assumption that given a UUID the server-side would not log IPs. It then comes down to trust that they'd stick to that.

ninjin · on July 6, 2020

Thank you, that could be it. Then again, there would at least have to be a separate log somewhere on the same box with IPs to counter abuse. I think creating and using a UUID without explicit opt-in is still a red line for me, but I do concede that I could very well be too paranoid for the good of myself and the community as a whole.

I should probably get back into the Discourse thread to see if I can contribute constructively, but the amount of back and forth between mostly “My freedom!” and “Tū quoque!” [1] in the thread over the weekend – apart from me being far too busy to take the time to summarise it all – has kept me away, although it looks way better over the last few hours. With the little free time I have I would rather work on my Julia code. '^^

[1]: https://en.wikipedia.org/wiki/Tu_quoque

fiddlerwoaroof · on July 6, 2020

Yeah, it sounds like they’re designing a way for package authors to get usage stats: imo, this extra piece of data doesn’t really help the server owners de-anonymize because it’s less identifying than the data the server is already collecting as an http server (especially if it’s in an unlogged part of the request like a header or a post body). But, even if it is a privacy risk relative to the server owners, it’s preferable that data derived from this uuid be shared with package authors, rather than IP-based data, because it’s based on a less-identifying datasource, which means that even if someone were to breach the database, they’d have less ability to de-anonymize people.

Also, I find this whole discussion to be somewhat irrelevant when talking about a service serving up arbitrary code to be executed on your machine: if you don’t trust the server owners, you really shouldn’t be executing the code they serve up.

SweetestRug · on July 5, 2020

This strongly resonates with me. As as sometime who has considered Julia, I would be happy to sign up so that devs could use the information. Opt-out instead of opt-in is a dark pattern; I am uncomfortable seeing it used. Julia needs adopters to grow it's community. This move frankly makes me less likely to use the language in the future. I am certain I am not the only person who feels this way.

Tarrosion · on July 5, 2020

The back-and-forth in that thread is a great discussion. One thing I hadn't realized is that many other popular languages are already doing something similar. See this post for a bit more detail: https://discourse.julialang.org/t/pkg-jl-telemetry-should-be...

KenoFischer · on July 5, 2020

Hi HN, please note that this is an active discussion thread in the Julia community. You are all more than welcome to chime in, but we do try to keep discussions as productive as possible, so if you do decide to comment, I'd ask that

1) You familiarize yourself with the actual proposal and the improvements that are currently underway and

2) Be kind

A number of people have put in an enormous amount of effort to try and get this right - please remember that they are indeed people.

papaf · on July 5, 2020

Is the telemetry available to users? I glady opted into Synchthing telemetry after seeing this page: https://data.syncthing.net/

When the data is available to the community, just like the source code, its a much easier sell.

KenoFischer · on July 5, 2020

The plan is to make aggregate usage data available publicly and potentially share more detailed usage data with individual package authors. The exact format is TBD since it'll depend on the quality of the data that we get (this is not active yet, except on the preview build). The raw logs will be accessible to core developers with a reasonable need to access (e.g. they're working on the infrastructure or running the analytics), but will not be public.

j88439h84 · on July 5, 2020

How about deleting the IP addresses within 48 hours like 1.1.1.1 and 8.8.8.8 do?

https://developers.google.com/speed/public-dns/privacy

staticfloat · on July 5, 2020

We do have a limited retention policy for the package server logs we keep (which include client IP addresses). It's not publicly stated anywhere right now, but one reason why we need to keep IP addresses is for abuse mitigation. We have been hit in the past by users that do things like download large (100MB+) files from our package cache servers multiple times a second for days on end. This is a particularly easy case to catch (since it easily pops to the top of any analysis you'd care to run, across any timespan) but there are more subtle forms that require a longer time window of analysis (e.g. users that download once per hour, all month) that would be lost in the noise without the ability to see what's going on.

This comment is not meant to serve as an official policy, just pointing out one of the reasons why we can't delete IP addresses like 1.1.1.1 and 8.8.8.8 do; because the abuse vectors for a server that serves the community large resources is very different from that of a DNS server.

Most of the "abuse" we see is not malicious in nature, but is instead users that have some kind of very poorly-configured autoinstaller on a cluster. In the case of a catastrophic issue like the one mentioned above, we null-routed the IP address, reached out to the abuse contact for that IP, and worked with the user to architect a better system. Everyone is happy now, and we can continue to provide a high quality service for the community without breaking the bank.

edw · on July 5, 2020

How about hashing IPs? You could still see if someone were on your abuse list if abusers.contains(hashfn(req.addr)).

KenoFischer · on July 5, 2020

Doesn't help for two reasons 1) If the has has enough bits to be useful for blocking, it's trivial to reverse 2) Even if it did make the IPs anonymous, we want to be able to email the NOC at whoever is sending the abusive traffic, so they can go investigate

j88439h84 · on July 5, 2020

> we want to be able to email the NOC at whoever is sending the abusive traffic, so they can go investigate

If you block their traffic with HTTP 429 Too Many Requests, they can email you instead.

KenoFischer · on July 5, 2020

We prefer not to break researchers' workflow because the group next door misconfigured their server. Happens all the time. We only sinkhole IPs if the traffic is malicious or on track to exceed or budget.

codedokode · on July 5, 2020

Hash of IPv4 address can be easily reverted because there is a limited number of addresses.

KenoFischer · on July 5, 2020

I don't work on this particular thing, so I can't say precisely what the planned retention period is. I suspect 48 hours is too short, since people do take weekends ;). It'll probably become clear with experience what retention periods work. DNS servers are in a very unique position of course since they essentially get your browsing history.

ptx · on July 5, 2020

Since the data is not being made public, presumably it is judged to be sensitive to some extent?

So if follows then that users are right to be concerned and would have every reason to not opt in if they were presented with the choice.

KenoFischer · on July 5, 2020

What's sensitive or not depends very much on what other information the entity doing the analysis has available. Of course raw log records are more sensitive than aggregate data. For example, if somebody is wiretapping your internet connection, then even if the connection is encrypted raw logs would let them draw conclusions from timing. To some extent you're trusting the Julia project (or at least the people who have access) to not clandestinely be in the wiretapping business, but then again you're already trusting it with arbitrary code execution on your machine, so if it were in that business, you'd have bigger problems ;).

In any case that's why it's important to be transparent about what is sent, and for what purpose and who has access, so people can make informed decisions. Ironically, I think people are jumping on the authors of this particular piece of functionality precisely because they tried to be very transparent.

ptx · on July 5, 2020

Yes, they are transparent about deciding not to offer the user the choice in a straight-forward upfront way (i.e. opt in) because "the vast majority of users will not opt-in". In other words, deciding that what the users want is not as important as the marketing stats.

And, as you say, users trust the developers with access to their systems and data. Deciding unilaterally to sacrifice user privacy to benefit other interests might be seen as a breach of that trust.

orf · on July 6, 2020

It’s actually more that “the vast majority of users do not care” and so go with the defaults.

An opt-out is the way that makes sense.

dnautics · on July 5, 2020

It's a fascinating discussion! I don't use Julia much anymore due to job change, I hope all language package teams get to read the back and forth.

_7bxa · on July 6, 2020

If you download the software, it seems reasonable for it to get basic information.

User's need to consider the developer perspective. Julia is a product with millions of hours sunk into it. It needs to sustain itself, since it's open source.

I doubt the telemetry is being used for profit anyway, but anything we can do to help Julia is good. "Donations" aren't sustainable and can't fund a large software project.

Also it's not hidden, so I fail to see the issue. If the information is in a legal document and the source code, then you know exactly what's going on. There's no shady business.

kanonieer · on July 5, 2020

Telemetry deservedly has a terrible reputation due to its usage in proprietary software. In open source software, it's not a deal breaker for me as I have means to get rid of it.

But given the landscape of privacy issues, I wouldn't vote for an opt-out telemetry in any of the OS projects I'm involved with.

CyberDildonics · on July 5, 2020

I skimmed the link but still have the same question - is there really a justification for having any telemetry turned on by default? I think most people wouldn't want any network traffic unless they instigated it, let alone unique identifiers and package information.

KenoFischer · on July 5, 2020

Note that this is about metadata for package requests, so you're downloading something from a server already. The question is what information is in that request.

shcheklein · on July 5, 2020

It helps developing and prioritizing features faster. What is so harmful about it? Assuming it's anonymized properly, if no one resells it, if it's explicit (doesn't matter opt-in or opt-out).

systemvoltage · on July 5, 2020

Why telemetry at all? I don't expect a programming language to have telemetry as an default feature.

I want to hammer this rule into everyone regardless of the domain you're working in when it comes to privacy:

- Explicitly ask the user. Respect their privacy. Explain why you would like to collect data, may be show past examples of what you've done with the data and don't deploy dark patterns or default behavior.

It is not that hard. No backlash. No problem at all if you ask the user. Sure, that would lead to less than optimal telemetry for the collecting party but there should not be any way around this. Want more data? Incentivize users, may be give them free subscription for helping out with the beta testing. Give them a discount. Treat data just like a commodity that costs money to obtain responsibly. Right now, everyone is a data-cartel trying to hoard as much as possible.

Why is this so hard to understand? This is opposite of "level-headed". I usually allow PyCharm to collect telemetry, I allow Apple to use Siri requests for improving it. It is because they do this as respectfully as possible without deceiving the user.

mbauman · on July 5, 2020

I encourage you to read https://julialang.org/legal/data

This is very minimal data that gets sent along with requests that you’re already making to a (user-selectable) package server.

wlesieutre · on July 5, 2020

> I allow Apple to use Siri requests for improving it. It is because they do this as respectfully as possible without deceiving the user.

Let's not forget that despite Apple's otherwise good privacy record, Siri was saving your recordings to be listened to and reviewed by 3rd party contractors without giving you any opportunity to opt out. It was only late last year after their competitors were called out for the same issues that Apple provided an opt-out option.

And given how proactive they were with privacy warnings about donating voicemail transcriptions to improve voicemail accuracy, it was pretty reasonable to think "Surely if they were saving the Siri input recordings and letting people listen to them, they would have warned me about that and asked me if it's OK."

https://www.cnbc.com/2019/10/28/ios-13point2-has-new-siri-pr...

It's possible they still had better privacy protections in place for handling the recordings once they have them on their servers(compared to Amazon and others), but even the contents of voice recordings can be enough to de-anonymize them depending on what you've said.

umvi · on July 5, 2020

Have any zealous "opt in" folks ever been in a position where they need to somehow obtain statistical information about their user base (to raise funding, to make business decisions, to know what features are most being used, etc)? Opt in is like hard mode and practically worthless, <1% of your user base will opt in.

ninjin · on July 6, 2020

Yes, as an academic and a co-creator of the de-facto annotation tool in my field [1] I certainly have (although having Google Analytics for the website is something I regret…). Now, we have the “luxury” of citation counts as a proxy for academic usage, but I know next to nothing about how our tool is/has been used in industry apart from what pops up on the mailing list. To be fair though, I suspect we could have had bigger impact and maybe keep the project more “alive” with both more efforts on our part and if we had decided to raise funds additional user metrics could have helped, but I am not shedding tears over this.

[1]: http://brat.nlplab.org/

Now, I absolutely sympathise with any developers in this situation. But I think the underlying issue is that we lack a good way to give consent and are stuck with awful solutions like opt-out and ridiculous pop-ups. Is there not good work on this out there or are we forever going to have to endure sub-optimal solutions?

edarchimbaud · on July 7, 2020

Hi, I'm Kili's CTO. We have a free version of our annotation tool: https://kili-technology.com/. It's the most versatile tool on the market (text, image, video, voice), with native python integration, the ability to use ML to speed up annotation, and great support. Let me know if we can help. Edouard

ninjin · on July 8, 2020

Wow… Just wow… The Internet these days is just a cesspool at times due to this kind of behavior and I wish I could downvote this into oblivion. Your attempt at marketing disgusts me and know that I now have an awful impression of Kili as a company. Your need to drive customers to your company does not justify this kind of behavior, no matter the quality of your product.

memexy · on July 8, 2020

What are you talking about?

ninjin · on July 8, 2020

The comment by edarchimbaud that I replied to? Where he blatantly inserted himself into a discussion about telemetry, package development, funding based on concrete metrics, etc. Just to namedrop his company and tool?

memexy · on July 8, 2020

I didn't see anything blatant. What was blatant about it?

ninjin · on July 9, 2020

Imagine the following exchange:

A: “Recently I have been thinking about personal responsibility.”

B: “Why so?”

A: “I believe there is a strong correlation between a sense of personal responsibility and success later in life.”

C: “That is interesting, I think I read a study about this once. Here is a link!”

B: “Myself, I learned about personal responsibility – in particular financial – when I as a child ran my own little business. What I did was to deliver apples and later fruit for a small fee to the neighbourhood on my bike when I was about twelve. It did not make me rich of course, just enough to buy a video game in the end. But I do think it gave me solid experience in life. Later on in high school I started designing local webpages.”

D: “Hi! I am D, I am head of research at Foobar Corp and we have a new apple breed: http://foo.bar/baz It is the best apple on the market: crisp, juicy, and perfect for pies! Let me know if we can help!”

A, B, and C: “Eh?!”

Now, you are perfectly in your right to disagree. But I think D is being a dick here and inserting themselves blatantly solely to attract attention to their product and adding nothing to the discussion or the community as a whole – possible because D has signed up for some god awful “business intelligence” tool that just scans various websites for mentions of “apples” so that they can insert a generic, re-usable message.

Regardless, I will not monitor this conversation further as I feel we are at this point deviating far far from the topic of this “dead” thread. If you still feel the need to discuss this matter, feel free to dig up my e-mail on my personal website. Trust me, I am fairly easy to locate with my username and a keyword or two from my profile – or just look at the about page using the link to the tool I mentioned earlier in this conversation.

rightbyte · on July 5, 2020

If you ask nicely and don't pretick the yes or no box I have a hard time believing <1% would opt-in.

This is just excuses. Usage statistics could be tied to downloads or public source code analysis. No need for tracking.

orf · on July 6, 2020

You can have a hard time believing it, but it’s still true. And the point still stands if it’s 5%.

Most people just don’t care, and will go with the defaults.

rightbyte · on July 6, 2020

My point is if there is not default no-one will go with the default. Just force an active choice e.g. the first time the package manager is invoked from CLI and assume opt-out for scripts running it to not break scripts.

swebs · on July 6, 2020

Yes. And that 1% is still plenty if you only need to know aggregate usage patterns necessary for the examples you gave.

Now if your goal was to collect personalized data in order to get targeted ad revenue...

ssivark · on July 5, 2020

> Why telemetry at all? I don't expect a programming language to have telemetry as an default feature.

That expectation is incorrect if you’ve ever used a package server or pulled packages from some website including Github (for ANY language). HTTP requests do communicate your IP address, and it is standard practice to store them and use them for analytics.

systemvoltage · on July 5, 2020

No problem if they do it on the server side. Don't pollute the user space with telemetry without asking.

If I download julia binaries from their website, they can collect IP information if the local laws allow it. Once it is in my possession, it is reprehensible to do anything without asking me first.

improbable22 · on July 5, 2020

In case this isn't clear, the telemetry being discussed is only doing anything when you ask the package manager to connect to a server, to download things.

systemvoltage · on July 5, 2020

It is clear. There is a UUID generated on the user-side to identify them.

It’s one thing to collect statistics of downloads on the server side and another thing to profile me. It’s pretty clear to me.

umvi · on July 5, 2020

How does a single random number "profile you" ...?

gnud · on July 5, 2020

With that identifier, an individual can be tracked across different networks. This might well make you identifiable.

(Not that I think Julia does this)

bencollier49 · on July 5, 2020

Wow, if this is done without prompting the user, then it's illegal in the EU and UK. IP addresses are considered PII.

KenoFischer · on July 5, 2020

As mentioned in the thread, the people who implemented these features obtained appropriate legal advice from lawyers specializing in this area and implemented their recommendations.

staticfloat · on July 5, 2020

The GDPR explicitly allows for the processing of personal information without consent in the event that such processing is required for ensuring network security and availability, see [1], [2] and [3] for more reading on this. Note that I am not a lawyer, and you should consult a lawyer (as we did) to ensure that all policies fall within GDPR laws.

That is precisely what the logged IP addresses are used for (an example: nginx access logs), and is one of the reasons why we would much rather use a random number generated by the client machine than an IP address; because the bits themselves have no meaning, unlike IP addresses.

As mentioned in the linked thread, NumFocus has worked with a legal team that specializes in this type of law, this plan is all in compliance with the GDPR.

[1] https://gdpr-info.eu/recitals/no-49/ (The actual GDPR text regarding security concerns) [2] https://blogs.akamai.com/2018/08/dispelling-the-myths-surrou... (Akamai legal team confirming that this interpretation of logging IP addresses for security purposes is valid) [3] https://law.stackexchange.com/a/28609 (Stack exchange post pointing out that even more exceptions exist beyond just security)

bencollier49 · on July 6, 2020

From the first paragraph of TFA:

'The goal is to answer the question “How many Julia users are there?”'

This is a commercial concern, nothing to do with security, and to my understanding at least, is not a valid reason for collecting PII. There doesn't appear to be a security argument for collecting this data without consent.

philzook · on July 5, 2020

I think the discussion is a bit more nuanced than that. They do not appear to be recording IPs. They directly reference carefully complying to GDPR.

chrispeel · on July 5, 2020

Yes, IP addresses will be logged https://discourse.julialang.org/t/pkg-jl-telemetry-should-be...

seemslegit · on July 5, 2020

This is an inherently bad-faith practice that should be punishable for open-source and commercial vendors alike.