Proposal for making a public dataset

I have a very specific thing that I want to discuss with other admins, but probably isn’t a general thing.

I have written a proposal for making a public dataset, which could be queried to get a gauge for the “state of the community”. I’d like feedback from admins on how to do this in a way that respects our communities: Toot Dataset - Google Docs

Specifically, is the pseudonymization I propose going far enough?
What are some of your concerns about having a queryable dataset of toots?

3 Likes

I split this out from the other thread, as it wasn’t along the lines of the actual question there, or at least not what I wanted to convey. Think I need to edit my post. :wink:

1 Like

I can’t remember if account ids can be tracked back to the user publicly, or if it’d require admin access on the machine in question – but it may eventually be identity-compromising, and if you’re not looking to compare datasets specific to individuals, wouldn’t provide useful information anyway.

It might be a good idea to replace account_ids with a generated UUID instead, systematically, and then the data would still show reply-tos and such, and allow for comparing individual-level datasets while still sanitizing the data a little further.

(this is in a comment on the Gdoc as well – but I can remember missing a few hundred of those from my editor sometimes, so I figured it was worth duplicating here, esp. since admins might have more light to shed on the privacy implications of the account ids)

1 Like

Thanks for the feedback. I’ve updated the proposal to remove IDs.

Using a UUID is a good idea, or possibly a cryptographic hash with secret salt. Using a hash would prevent the need to store the mapping between real ID and UUID somewhere.

1 Like

Any admins reading this thread: Please let me know if you are interested in having your instance’s public timeline included and/or you would like early access to get some analytics about how people are using the Mastodon network.

1 Like

I have only the barest dev experience, so I figured you’d probably have better ideas about how to implement it. :slight_smile: It has occurred to me that doing so may defeat the purpose if the public access dataset is still reasonably easy to match up to actual posted toots, though. I haven’t any idea if you were hoping to do ‘individual’ comparisons and stuff at all, but I think the public nature of the actual posts might make it prudent to lag the publicly available dataset behind ‘real-time’.

I suppose the best analogy I have is the satellite imagery in Gmaps – close to real-time may have some dreadful unintended consequences, so they don’t use that, and let public tools have access to delayed-but-still-useful satellite imagery, while still having some ‘closed’ areas that don’t have real updates at all.

I’d think single-user instances (or low-user-count instances, double digits or less might be necessary) could be those ‘closed’ areas – the toot data is one-to-one for the streams involved, so there’d be little deterrent to someone trying to use it to match up toots to those users. So maybe you’d want to set up a ‘singles’ or a ‘low’ element in the data as if it were an instance, pack all those toots into that one, and just show it as if it were another ‘instance’ instead?

I’m spitballing, though. :slight_smile:

1 Like

I hadn’t considered a delay to be desirable, but I think you are right. About one day seems like a reasonable delay. Enough to get recent trends when desired, but not too immediate.

Regarding smaller instances: I think crawling the public timelines of the largest instances will get pretty good coverage, especially with how federation works. I believe with my proposed schema, you might be able to track which toots come from smaller instances (they are is_local==false in all rows), but not know which instance they came from. That seems like a good balance to me.

1 Like

One day might be workable. I would suppose another option would be to do a stripped-down ‘trend output’ for up-to-date stuff and use a longer delay for the more detailed dataset, but there’s a lot of options, and they’re all going to have trade-offs. I sometimes overfocus on potential abusability – comes with my general paranoia/hypervigilance issues. I know development’s got to take a wider view into consideration. I’m really just glad it’s an element you’re thinking about; so many developers don’t. :smile:

you might be able to track which toots come from smaller instances (they are is_local==false in all rows), but not know which instance they came from

Aha! That would make sense as a trade-off. :slight_smile: I hadn’t put those pieces together yet.

Thank you for listening to my feedback about this idea. I’m glad you’re doing it, TBH – I think Twitter’s ‘trending’ stuff is poorly designed and suffers from root Twitter flaws more than anything else – and Masto could use a better mousetrap for this particular mouse. :smile:

I am highly interested in building a dataset that’s useful, but also protects the users of Mastodon. I don’t run an instance. Rather, I collect information from the fediverse and store it locally to be later available publicly.

I would very much like an RFC document on handling collected data from Mastodon. I don’t know too much about the back-end or the risks involved, so documented expertise here would help me immensely.

2 Likes

Upside: Did you take a look at the Google Doc I posted? I’d love some feedback if anything could be improved. I make some recommendations there, though I could be better at stating some of the reasons why I’m suggesting changing/removing properties from the reported API data.

Some of the general principles I followed in suggesting the schema:

  • Keep user names & instance names psuedonymized. To be useful for network analysis users need identifiers, but it should be very different from the original so that someone can’t easily use the dataset to find an individual user. Since instances could contain very few people, the same applies to those.
  • Remove URLs everywhere (even in content). Links often go to users or instances (for media, for example).

I haven’t started with data collection yet, but I hope to start soon.

Yes it’s the first thing I read, and it’s awesome work! I advise all thread readers to really go through it. It’s helping me a lot :wink:. I’m not knowledgeable enough to offer critique, though. Some strikethrus disappoint me, because I really want to use those fields (which is not a valid gripe!).

[Note: emphasis below is for readability, not because I’m shouting]

Mostly what I want is general guidance (from anyone, not necessarily you). Principles that one employs to arrive at the conclusions you made. You seem to have your head screwed on straight in this matter, and I don’t, so I’d like to develop a decision making strategy like yours when I handle the data I collect.

The reason I want a narrated, more general guide is because:

  • General principles can guide us even when information changes
  • I collect info beyond the api (about/more, reachability, admin handles, etc.)
  • I need to break some rules (like instance anonymization for my instance picker), and want to know what I’m getting into.
  • Mastodon has a unique ethic regarding privacy, which I wish to honor (but don’t know what it could be!!)

For example, I didn’t consider low user count to be a factor in identifying people, until you brought it up. I’d like to know stuff like that before being told.

Not that I’m asking you personally to write an essay. It’s just a general note to the Mastodon community — some of us do want to be careful with toot/instance data, but we feel lost how to do it…

1 Like

I think it is great to be able to find other users/accounts and toots easier, but I can also understand the privacy issues. And the latter should always have precedent.

Friendica and Hubzilla have a public user directory. Especially Hubzilla is interesting, because that directory is decentralized. Just for some inspiration: GitHub - redmatrix/hubzilla: build community websites that can interact with one another

1 Like