I’ll note too that even absent Heritage Foundation threats, this can be useful to spur development of the project (i.e. for people who don’t want a permanent account but don’t feel comfortable having their IP permanently, publicly attached to edits). Probably the reason it hasn’t been done in the past is it’s almost certainly going to make it easier for bad actors to fly under the radar. Before, you either had to show your IP address (which can reveal your location and will usually uniquely identify who edited something for at least a little bit; you also can’t use a VPN without special permission) or you had to register a single account (where if you created multiple, a sockpuppet investigation would often find out).
So there’s an inherent trade-off, but I think right-wing threats of stochastic terrorism really tipped the scales.
There are only 846 administrators on the English Wikipedia. This is across 7 million articles, 118,000 active registered users, about two edits per second, about a million files just on Wikipedia (most of them are hosted on Wikipedia’s sister project, Wikimedia Commons), and over 60 million total pages (articles, talk pages, user pages, redirects, help pages, templates, etc.). So although they have this data, it’s not useful if somebody doesn’t notice and investigate it. Administrators are stretched thin with administrative functions, and that’s not even accounting for many of them participating as normal editors too (tangent: besides obvious violations of policies, administrators have no more say over Wikipedia’s content than any other editor).
Contrary to the idea that new editors sometimes get of Wikipedia as a suffocating police state run by the administrators, usually when edits get reverted it’s because regular editors notice this and revert it citing policies or guidelines without any administrator involvement (every editor has this power). If an administrator intervenes, it’s usually because a non-admin noticed and reported (what they perceive as) bad behavior to an admin, two editors are locked in a stalemate, or there’s some routine clerical issue to be resolved.
Sockpuppeting, copyright violations, etc. are often (even usually) found by regular editors who notice something amiss and decide to dig a bit deeper. Even with automated tools that will flag an edit that replaces the article with the n-word 500 times in a row, and even given that some non-admin editors have tools which let them detect some issues, there’s just only so much that 850-ish people can find on a website that massive. For example, one time a few years back, I just randomly stumbled across an editor who was changing articles about obscure historic battles between India and Pakistan to have wildly pro-Pakistan slants – where treacherous India was the aggressor, but brilliant, strong, and courageous Pakistan stood their ground and sent pathetic India home crying with shit in their diapers. The bias was oozing from the page (with poor, if any, citations to match), and I can imagine this would fly under the radar for a while on a handful of articles that collectively get maybe 30 pageviews a day.
Well you say you can use a VPN, but you may often see that you’re not able to edit using a VPN IP if that IP block has been used for vandalism in the past. So then you’d have to potentially revert to a coffee shop or library which would still identify your location.
Point of clarification: I said that you can’t use a VPN, and that’s because those IPs are blocked. As noted, you need to ask for a special exception, which for most people isn’t navigable and may not even be granted without a good stated reason and/or trust built up through good edits.
I might have to go lookup their implementation. I feel like a good way of addressing your concern would be a secure hash of the IP address combined with a persistent random number.
The same IP would always map to the same output and you wouldn’t be able to just pre-compute it and bypass everything.
What’s the persisted random number? Sounds like a salt, but usually each user has their own salt right? I assume we are not talking about logged in users here? Or are we?
Since the goal is to create a correlation ID that maintains privacy, you need the result to be consistent. Hashing four billion IPs might take a minute, but it’s fundamentally doable in a reasonable time.
By using some much large value that you keep secret, you’re basically padding the input to make the search space large enough that it’s not realistically able to be enumerated.
Normally each user would have their own salt so that if two users have the same password, they hash to different values. In this case, you would want two users with the same IP to map to the same value, and simply for that value to not lead to an actual IP address.
I’ll note too that even absent Heritage Foundation threats, this can be useful to spur development of the project (i.e. for people who don’t want a permanent account but don’t feel comfortable having their IP permanently, publicly attached to edits). Probably the reason it hasn’t been done in the past is it’s almost certainly going to make it easier for bad actors to fly under the radar. Before, you either had to show your IP address (which can reveal your location and will usually uniquely identify who edited something for at least a little bit; you also can’t use a VPN without special permission) or you had to register a single account (where if you created multiple, a sockpuppet investigation would often find out).
So there’s an inherent trade-off, but I think right-wing threats of stochastic terrorism really tipped the scales.
TL;DR: Wikipedia has been doxing its own editors since inception.
Doesn’t Wiki still have the data? So a bad actor’s behavior pattern can be seen at aggregate behind the scenes?
There are only 846 administrators on the English Wikipedia. This is across 7 million articles, 118,000 active registered users, about two edits per second, about a million files just on Wikipedia (most of them are hosted on Wikipedia’s sister project, Wikimedia Commons), and over 60 million total pages (articles, talk pages, user pages, redirects, help pages, templates, etc.). So although they have this data, it’s not useful if somebody doesn’t notice and investigate it. Administrators are stretched thin with administrative functions, and that’s not even accounting for many of them participating as normal editors too (tangent: besides obvious violations of policies, administrators have no more say over Wikipedia’s content than any other editor).
Contrary to the idea that new editors sometimes get of Wikipedia as a suffocating police state run by the administrators, usually when edits get reverted it’s because regular editors notice this and revert it citing policies or guidelines without any administrator involvement (every editor has this power). If an administrator intervenes, it’s usually because a non-admin noticed and reported (what they perceive as) bad behavior to an admin, two editors are locked in a stalemate, or there’s some routine clerical issue to be resolved.
Sockpuppeting, copyright violations, etc. are often (even usually) found by regular editors who notice something amiss and decide to dig a bit deeper. Even with automated tools that will flag an edit that replaces the article with the n-word 500 times in a row, and even given that some non-admin editors have tools which let them detect some issues, there’s just only so much that 850-ish people can find on a website that massive. For example, one time a few years back, I just randomly stumbled across an editor who was changing articles about obscure historic battles between India and Pakistan to have wildly pro-Pakistan slants – where treacherous India was the aggressor, but brilliant, strong, and courageous Pakistan stood their ground and sent pathetic India home crying with shit in their diapers. The bias was oozing from the page (with poor, if any, citations to match), and I can imagine this would fly under the radar for a while on a handful of articles that collectively get maybe 30 pageviews a day.
TL;DR: Too few admins.
Well you say you can use a VPN, but you may often see that you’re not able to edit using a VPN IP if that IP block has been used for vandalism in the past. So then you’d have to potentially revert to a coffee shop or library which would still identify your location.
Point of clarification: I said that you can’t use a VPN, and that’s because those IPs are blocked. As noted, you need to ask for a special exception, which for most people isn’t navigable and may not even be granted without a good stated reason and/or trust built up through good edits.
Oh whoops, my bad I must have been reading too quickly. Thanks for clarifying!
I was surprised I was blocked from editing even after logging in. They do hate some IP blocks.
Make a list of necessary changes then go to your local cafe.
Sounds like a nice plan.
I might have to go lookup their implementation. I feel like a good way of addressing your concern would be a secure hash of the IP address combined with a persistent random number.
The same IP would always map to the same output and you wouldn’t be able to just pre-compute it and bypass everything.
What’s the persisted random number? Sounds like a salt, but usually each user has their own salt right? I assume we are not talking about logged in users here? Or are we?
Since the goal is to create a correlation ID that maintains privacy, you need the result to be consistent. Hashing four billion IPs might take a minute, but it’s fundamentally doable in a reasonable time.
By using some much large value that you keep secret, you’re basically padding the input to make the search space large enough that it’s not realistically able to be enumerated.
Normally each user would have their own salt so that if two users have the same password, they hash to different values. In this case, you would want two users with the same IP to map to the same value, and simply for that value to not lead to an actual IP address.
So you just use one salt for all IP addresses, but you keep it secret.
Essentially.
I’m sure there’s other ways to accomplish the goal but that’s the first one that came to mind.