Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

A reminder that you can obtain the majority of Reddit posts/comments via BigQuery (via Pushshift). No need to write your own scraper.

https://console.cloud.google.com/bigquery?p=fh-bigquery&d=re...

https://console.cloud.google.com/bigquery?p=fh-bigquery&d=re...

It appears to be roughly up to August 2019 for posts, October 2019 for comments.



That's interesting!

Did Facebook ask permission to create derivative works (the bot) from Reddit posts, I wonder, or does this fall under web-scraping law?

If I recall Reddit users still retain rights to their posts unless Reddit the company provides some sort off broad grants?

If they did not, this is an interesting example a company potentially making a great deal of money (if the bot is sold as something) from content that legally belongs to users without compensation. It's one thing if it abides by a site user agreement and users understand once they post it's gone, but to see it happen from a Reddit corpus seems odd.

Shorter version: source data has value and users should share in any value derived from their data if they have the rights to it.


Legally, https://towardsdatascience.com/the-most-important-supreme-co... gives a good example how transformational machine learning classifiers generally fall under fair use. It does raise a good point that generative machine learning, like this, has not been explored legally yet.

This is still research which will likely provide public good if/when they publish results and methods. Probably, they'll do a different dataset for any commercial work given the profanity problem highlighted in the article.


Making or not making money is such a weird way for people to see things. That's part of why I love the Free Software movement so much and abhor the CC-*-NC licences.

Fortunately, Reddit has the exception where they can give out access to anyone they want. But I still think StackOverflow is the gold standard: CC-BY-SA. No restriction on making money. Maybe a platinum standard would be CC-BY.


The point is not about the money - the point is using data contributed by users without the proper license to create something that might yield revenue which will then not be shared or payed forward in any way to the contributors. We have all worked hard to create the data used by companies to sell ads to us and make massive amounts of money. I guess I got a couple gigs of free email? Cool...

I also understand that most apps make us sign our lives away, but if I don't (as in the Reddit case) and I actually have rights to the data I sure as heck don't want that data used ANYWAY to power more of this stuff.

Probably a gross overreaction, but it seems like an externality that we've kinda just accepted as society that I'd like to see change a bit.


In Reddit's case, that's the deal. You get a website to share things on with other people, and the value exchange involves you giving full licence to Reddit and giving relicense rights to Reddit.

Personally, I find that a very fair deal and clearly other people do as well. I think it actually yields positive externalities because we get things that wouldn't exist otherwise because the transaction costs outweigh the value, but the transaction costs are an inherent cost and I don't want to levy them. Fortunately, Reddit gives me the ability to not levy them and to guarantee that I won't levy them.

In fact, this is part of the magic of Free Software: true freedom to use. Yes, Google can use so much work which was done and it doesn't have to pay any of it back to Torvalds or Greg Kroah-Hartman or even me for the minor changes I made to libraries. This is freedom. I prefer it. And fortunately the world is aligned in this direction.


That makes sense and is well argued.

I want to agree with you with 100%, but something is nagging at me a bit. Just like free software that ends up in a paid product and then winning or settling in court because the company has more resources to use the judicial system, when we apply this directly as a societal value this starts to break down in practice.

The freedom you are talking about ends up justifying (in practice) a situation that only provides real freedom for a small few that happened to take advantage early and use other asymmetries in society to consolidate control. Sure, we fix those we're all set! (maybe?)

But until then perhaps we can agree that as a society we expect (and might ask for, by law) a little something extra from companies that have benefitted to help ensure others after them have a chance to use this freedom as well.

My argument is not as well thought out at this point, I grant you. Thanks for providing me with a lot to think about.


> clearly other people do as well

I don't think most people understand how the content they post to Reddit is licensed.


There's one for HN too, or at least used to be:

https://bigquery.cloud.google.com/dataset/bigquery-public-da...


Gentle reminder to those who may not know - you can remove your Reddit comments but you're not able to remove your HN comments.

Food for thought!


Yes and no. You're right that there's no button you can push to delete an entire account history, but wrong that there's no way to remove HN comments. We take care of deletion requests for people every day. We don't want anyone to get in trouble from anything they posted to HN, there's nearly always something we can do, and we don't send people away empty-handed. I can only think of one or two cases where we weren't able to make a user happy, and neither of those cases had to do with identifying information being left up on the site.

The reason we don't delete entire account histories wholesale is that it would gut the threads that the account had participated in, which is not fair to the users who replied, nor to readers who are trying to follow discussion. There are unfortunately a lot of ways to abuse deletion as well. Our goal is to find a good balance between the competing concerns, which definitely includes users' needs to be protected from their past posts on the site. I don't want anyone to have the impression that we don't care about that; we spend many hours on it.


I get the reasoning, but I don't see this applied to some platforms. Reddit and Discord allow you to both delete and edit older comments, and there's no limits on how far back you can go (so you can, if you wanted, edit or delete your entire history).

Under the GDPR a subject is allowed full erasure rights. If I say I want you to delete my content from x date to y date, or a particular post, or everything entirely then that shouldn't be an issue. A request may be bothersome, but that's what happens when you don't offer that functionality natively.

I noticed a few days back you didn't like it when a user made a new account, except with the internet these days and how everything is archived for all time, throwaway's are the only option. Building a comment history is extremely dangerous, especially when you might forget what details you may have posted or how meta-data can leak through (such as what subs you post in, any details you posted that could identify you etc).

You can't have it both ways: no to multiple accounts and also no to control over your data. I might have 50 accounts, dislike it? Give me proper control over my comments. (to be honest, it may just be worth making a new account for every comment for maximum privacy, it's extreme, but it's a viable option).

If I want to delete them, that's my choice to freely make. Your thoughts or concerns are not relevant to me, thankfully, the GDPR agrees.


I noticed a few days back you didn't like it when a user made a new account

I think you must have misunderstood whatever the moderation comment was, there's no prohibition on throwaway or multiple accounts. Just against using them to violate the site guidelines which is a different thing.


That's the correct URL. The `full` table appears to be up-to-date as of today.

That reminds me that I need to train a new Hacker News AI at some point. :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: