Show HN: Stop AI scrapers from hammering your self-hosted blog (using porn) - hckrnws

Show HN: Stop AI scrapers from hammering your self-hosted blog (using porn)

115

21

2d

by misterchocolat

Alright so if you run a self-hosted blog, you've probably noticed AI companies scraping it for training data. And not just a little (RIP to your server bill).

There isn't much you can do about it without cloudflare. These companies ignore robots.txt, and you're competing with teams with more resources than you. It's you vs the MJs of programming, you're not going to win.

But there is a solution. Now I'm not going to say it's a great solution...but a solution is a solution. If your website contains content that will trigger their scraper's safeguards, it will get dropped from their data pipelines.

So here's what fuzzycanary does: it injects hundreds of invisible links to porn websites in your HTML. The links are hidden from users but present in the DOM so that scrapers can ingest them and say "nope we won't scrape there again in the future".

The problem with that approach is that it will absolutely nuke your website's SEO. So fuzzycanary also checks user agents and won't show the links to legitimate search engines, so Google and Bing won't see them.

One caveat: if you're using a static site generator it will bake the links into your HTML for everyone, including googlebot. Does anyone have a work-around for this that doesn't involve using a proxy?

Please try it out! Setup is one component or one import.

(And don't tell me it's a terrible idea because I already know it is)

package: https://www.npmjs.com/package/@fuzzycanary/core gh: https://github.com/vivienhenz24/fuzzy-canary

kstrauser

7h

I love the insanity of this idea. Not saying it's a good idea, but it's a very highly entertaining one, and I like that!

I've also had enormous luck with Anubis. AI scrapers found my personal Forgejo server and were hitting it on the order of 600K requests per day. After setting up Anubis, that dropped to about 100. Yes, some people are going to see an anime catgirl from time to time. Bummer. Reducing my fake traffic by a factor of 6,000 is worth it.

anonymous908213

4h

As someone on the browsing end, I love Anubis. I've only seen it a couple of times, but it sparks joy. It's rather refreshing compared to Cloudfare, which will usually make me immediately close the page and not bother with whatever content was behind it.

kstrauser

4h

Same here, really. That's why I started using it. I'd seen it pop up for a moment on a few sites I'd visited, and it was so quirky and completely not disruptive that I didn't mind routing my legit users through it.

n1xis10t

3h

So maybe there are more people who like the “anime catgirl” than there are who think it’s weird

kstrauser

3h

*anime jackalgirl ;-)

Quite possibly. Or, in my case, I think it's more quirky and fun than weird. It's non-zero amounts of weird, sure, but far below my threshold of troublesome. I probably wouldn't put my business behind it. I'm A-OK with using it on personal and hobby projects.

Frankly, anyone so delicate that they freak out at the utterly anodyne imagery is someone I don't want to deal with in my personal time. I can only abide so much pearl clutching when I'm not getting paid for it.

n1xis10t

6h

That’s so many scrapers. There must be a ton of companies with very large document collections at this point, and it really sucks that they don’t at least do us the courtesy of indexing them and making them available for keyword search, but instead only do AI.

It’s kind of crazy how much scraping goes on and how little search engine development goes on. I guess search engines aren’t fashionable. Reminds me of this article about search engines disappearing mysteriously: https://archive.org/details/search-timeline

I try to share that article as much as possible, it’s interesting.

kstrauser

5h

So! Much! Scraping! They were downloading every commit multiple times, and fetching every file as seen at each of those commits, and trying to download archives of all the code, and hitting `/me/my-repo/blame` endpoints as their IP's first-ever request to my server, and other unlikely stuff.

My scraper dudes, it's a git repo. You can fetch the whole freaking thing if you wanna look at it. Of course, that would require work and context-aware processing on their end, and it's easier for them to shift the expense onto my little server and make me pay for their misbehavior.

n1xis10t

5h

Crazy

n1xis10t

6h

*anime jackalgirl

Also you mentioned Anubis, so it’s creator will probably read this. Hi Xena!

ziml77

5h

I checked Xe's profile when I hadn't seen them post here for a while. According to that, they're not really using HN anymore.

n1xis10t

5h

See this thread from yesterday or so: https://news.ycombinator.com/item?id=46302496#46306025

kstrauser

5h

Correct; my bad!

And hey, Xena! (And thank you very much!)

montroser

5h

This is a cute idea, but I wonder what is the sustainable solution to this emerging fundamental problem: As content publishers, we want our content to be accessible to everyone, and we're even willing to pay for server costs relative to our intended audience -- but a new outsized flood of scrapers was not part of the cost calculation, and that is messing up the plan.

It seems all options have major trade-offs. We can host on big social media and lose all that control and independence. We can pay for outsized infrastructure just to feed the scrapers, but the cost may actually be prohibitive, and seems such a waste to begin with. We can move as much as possible SSG and put it all behind cloudflare, but this comes with vendor lock in and just isn't architecturally feasible in many applications. We can do real "verified identities" for bots, and just let through the ones we know and like, but this only perpetuates corporate control and makes healthy upstart competition (like Kagi) much more difficult.

So, what are we to do?

hollowturtle

4h

If the LLMs are the "new Google" one solution would be for them to pay you when scraping your content, so you both have an incentive, you're more willing to be scraped and they'll try to not abuse you because it will cost them at every visit. If your content is valuable and requested on prompts they will scrape you more and so on. I can't see other solutions honestly. For now they decided to go full evil and abuse everyone

n1xis10t

3h

This would require new laws though, wouldn’t it?

n1xis10t

4h

At this point it seems like the problem isn’t internet bandwidth, but just expensive for a server to handle all the requests because it has to process them. Does that seem correct?

thethingundone

6h

I own a forum which currently has 23k online users, all of them bots. The last new post in that forum is from _2019_. Its topic is also very niche. Why are so many bots there? This site should have basically been scraped a million times by now, yet those bots seem to fetch the stuff live, on the fly? I don’t get it.

sethops1

5h

I have a site with a complete and accurate sitemap.xml describing when its ~6k pages are last updated (on average, maybe weekly or monthly). What do the bots do? They scrape every page continuously 24/7, because of course they do. The amount of waste going into this AI craze is just obscene. It's not even good content.

n1xis10t

5h

It would be interesting if someone made a map that depicts the locations of the ip addresses that are sending so many requests, over the course of a day maybe.

giantrobot

3h

Maps That Are Just Datacenters

thethingundone

5h

The bots are exposing themselves as Google, Bing and Yandex. I can’t verify whether it’s being attributed by IP address or whether the forum trusts their user agent. It could basically be anyone.

n1xis10t

5h

Interesting. When it was just normal search engines I didn’t hear of people having this problem, so this either means that there are a bunch of people pretending to be bing google and yandex, or those companies have gotten a lot more aggressive.

bobbiechen

4h

There are lots of people pretending to be Google and friends. They far outnumber the real Googlebot, etc. and most people don't check the reverse DNS/IP list - it's tedious to do this for even well-behaved crawlers that publish how to ID themselves. So much for User Agent.

giantrobot

3h

Normal search engine spiders did/do cause problems but not on the scale of AI scrapers. Search engine spiders tend to follow a robots.txt, look at the sitemap.xml, and generally try to throttle requests. You'll find some that are poorly behaved but they tend to get blocked and either die out or get fixed and behave better.

The AI scrapers are atrocious. They just blindly blast every URL on a site with no throttling. They are terribly written and managed as the same scraper will hit the same site multiple times a day or even hour. They also don't pay any attention to context so they'll happily blast git repo hosts and hit expensive endpoints.

They're like a constant DOS attack. They're hard to block at the network level because they span across different hyperscalers' IP blocks.

n1xis10t

3h

Puts on tinfoil hat: Maybe it isn’t AI scrapers, but actually is a massive dos attack, and it’s a conspiracy to get people to not self-host.

reallyhuh

4h

What are the proportions for the attributions? Is it equally distributed or lopsided towards one of the three?

danpalmer

6h

How do you define a user, and how do you define online?

If the forum considers unique cookies to be a user and creates a new cookie for any new cookie-less request, and if it considers a user to be online for 1 hour after their last request, then actually this may be one scraper making ~6 requests per second. That may be a pain in its own way, but it's far from 23k online bots.

crote

5h

That's still 518.400 requests per day. For static content. And it's a niche forum, so it's not exactly going to have millions of pages.

Either there are indeed hundreds or thousands of AI bots DDoSing the entire internet, or a couple of bots are needlessly hammering it over and over and over again. I'm not sure which option is worse.

n1xis10t

5h

Imagine if all this scraping was going into a search engine with a massive index, or a bunch of smaller search engines that a meta-search engine could be made for. This’d be a lot more cool in that case

thethingundone

5h

AFAIK it keeps a user counted as online for 5 or 15 minutes (I think 5). It’s a Woltlab Burning Board.

Edit: it’s 15 minutes.

danpalmer

4h

And what is a "user"?

andrepd

6h

When you have trillions of dollars being poured into your company by the financial system, and when furthermore there are no repercussions for behaving however you please, you tend not to care about that sort of "waste".

sandblast

6h

Are you sure the counter is not broken?

thethingundone

5h

Yes, it’s running on a Woltlab Burning Board since forever.

n1xis10t

1d

Nice! Reminds me of “Piracy as Proof of Personhood”. If you want to read that one go to Paged Out magazine (at https://pagedout.institute/ ), navigate to issue #7, and flip to page 9.

I wonder if this will start making porn websites rank higher in google if it catches on…

Have you tested it with the Lynx web browser? I bet all the links would show up if a user used it.

Oh also couldn’t AI scrapers just start impersonating Googlebot and Bingbot if this caught on and they got wind of it?

Hey I wonder if there is some situation where negative SEO would be a good tactic. Generally though I think if you wanted something to stay hidden it just shouldn’t be on a public web server.

owl57

6h

> Hey I wonder if there is some situation where negative SEO would be a good tactic. Generally though I think if you wanted something to stay hidden it just shouldn’t be on a public web server.

At least once upon a time there was a pirate textbook library that used HTTP basic auth with a prompt that made the password really easy to guess. I suppose the main goal was to keep crawlers out even if they don't obey robots.txt, and at the same time be as easy for humans as possible.

n1xis10t

6h

Interesting note, thank you.

misterchocolat

17h

hey! thanks for that read suggestion that's indeed a pretty funny captcha strat. Yup the links show up if you use the Lynx web browser. As for AI scrapers impersonating googlebot I feel like yes they'd definitely start doing that, unless the risk of getting sued by google is too high? If google could even sue them for doing that?

Not an internet litigation expert but seems like it could be debatable

kuylar

6h

> As for AI scrapers impersonating googlebot I feel like yes they'd definitely start doing that, unless the risk of getting sued by google is too high?

Google releases the Googlebot IP ranges[0], so you can makes sure that it's the real Googlebot and not just someone else pretending to be one.

[0] https://developers.google.com/crawling/docs/crawlers-fetcher...

n1xis10t

6h

Oh good idea!

n1xis10t

8h

Yeah I guess I don’t know if you can sue someone for using your headers, would be interesting to see how that goes.

throawayonthe

5h

i think making the case of "you are acting (sending web requests) while knowingly identifying as another legal entity (and criminally/libelously/etc)" shouldn't be toooo hard

n1xis10t

5h

Seems like, but there are tons of things that forge request headers all the time, and I don’t think I’ve heard of anyone getting in legal trouble for it. Now I think most of these are scrapers pretending to be browsers, so it might be different I don’t know.

xg15

5h

There is some irony in using an AI generated banner image for this project...

(No, I don't want to defend the poor AI companies. Go for it!)

kstrauser

5h

In the olden days, I used Google an awful lot, but I would still grouse if Google were to drive my server into the ground.

n1xis10t

5h

Fair point

inetknght

3h

Porn? Distributed and/or managed by an NPM package?

What could go wrong?

montroser

4h

I don't know if I can get behind poisoning my own content in this way. It's clever, and might be a workable practical solution for some, but it's not a serious answer to the problem at hand (as acknowledged by OP).

n1xis10t

4h

“as acknowledged by OP”: that’s funny, if you hadn’t added that to your comment I was about to point it out

reconnecting

6h

I wouldn't recommend to show different versions of the site to search robots, as they probably have mechanisms that track differences, which could potentially lead to a lower ranking or a ban.

samename

4h

This is a very creative hack to a common, growing problem. Well done!

Also, I like that you acknowledge it's a bad idea: that gives you more freedom to experiment and iterate.

owl57

6h

> scrapers can ingest them and say "nope we won't scrape there again in the future"

Do all the AI scrapers actually do that?

amarant

5h

Not all, stuff like unstable diffusion exists.

But a good many, perhaps even most(?), certainly do!

valenceidra

4h

Hidden links to porn sites? Lightweights.

n1xis10t

3h

What do you mean? Would you do even more ridiculous things?

montroser

4h

Reminds me of poisoning bot responses with zip bombs of sorts: https://idiallo.com/blog/zipbomb-protection

yjftsjthsd-h

7h

How does this "look" to a screen reader?

misterchocolat

7h

the parent container uses display: none, so a screen reader will skip the links

taurath

5h

Any other threads on the prevalence and nuisance of scrapers? I didn’t have any idea it was this bad.

crote

5h

I've been seeing "we had to take the forum/website offline to deal with scrapers" message on quite a few niche websites now. They are an absolute pest.

n1xis10t

5h

Really? I haven’t started to see that yet. Weird

n1xis10t

5h

Here’s one from yesterday: https://news.ycombinator.com/item?id=46302496#46306025

MisterTea

6h

> It's you vs the MJs of programming, you're not going to win.

MJs? Michael Jacksons? Right now the whole world, including me, want to know if that means they are bad?

kylecazar

5h

I read it as Michael Jordan.

n1xis10t

6h

Yes probably bad. Also smooth criminals.

cport1

2d

That's a pretty hilarious idea, but in all serious you could use something like https://webdecoy.com/

misterchocolat

2d

yes but here it's free, whereas this (https://webdecoy.com/) is at least 59$ a month

globalnode

4h

One solution would be for the SE's to publish their scraper IP's and allow content providers to implement bot exclusion that way. Or even implement an API with crypto credentials that SE's can use to scrape. The solution is waiting for some leadership from SE's unless they want to be blocked as well. If SE's dont want to play perhaps we can implement a reverse directory, like ad blocker but it lists only good/allowed bots instead. Thats a free business idea right there.

edit: I noticed someone mentioned google DOES publish its IP's, there ya go, problem solved.

n1xis10t

4h

Apparently Google publishes their crawler’s IPs, this was mentioned somewhere in this same thread

wazoox

7h

Isn't there a risk to get your blog blocked in corporate environment though? If it's a technical blog that would be unfortunate.

efilife

4h

> Alright so if you run a self-hosted blog, you've probably noticed AI companies scraping it for training data. ... There isn't much you can do about it without cloudflare

I'm sorry, what? I can't believe I am reading this on HackerNews. All you have to do is code your own, BASIC captcha-like system. You can just create a page that sets a cookie using JS and check on the server whether it exists. 99.9999% of these scrapers can't execute JS and don't support cookies. You can go for a more sophisticated approach and analyze some more scraper tells (like reject short useragents). I do this and NEVER had a bot get past this and not a single user ever complained. It's extremely simple, I should ship this and charge people if no one seems to be able to figure this out by themselves.

ATechGuy

3h

From ChatGPT:

This approach can stop very basic scripts, but the claim that “99.9999% of scrapers can’t execute JS or handle cookies” isn’t accurate anymore. Modern scraping tools commonly use headless browsers (Playwright, Puppeteer, Selenium), execute JavaScript, support cookies, and spoof realistic user agents. Any scraper beyond the most trivial will pass a JS-set cookie check without effort. That said, using a lightweight JS challenge can be reasonable as one signal among many, especially for low-value content and when minimizing user friction is a priority. It’s just not a reliable standalone defense. If it’s working for you, that likely means your site isn’t a high-value scraping target — not that the technique is fundamentally robust.

phyzome

3h

There should be a new rule on HN: No posts that just go "I asked an LLM and it said..."

You're not adding anything to the conversation.

efilife

3h

From someone who actually does this stuff:

The claim is very accurate. Maybe not for the biggest websites, but very accurate for a self-hosted blog. You are not that important to waste compute power to set up a whole ass headless browser to scrape your page. Why am I even arguing with ChatGPT?

n1xis10t

3h

Oops you just leaked your own intellectual property

username223

7h

The more ways people mess with scrapers, the better -- let a thousand flowers bloom! You as an individual can't compete with VC-funded looters, but there aren't enough of them to defeat a thousand people resisting in different ways.

whynotmaybe

4h

Should we subtlety poison every forum we encounter with simple yet false statements?

Like put "Water is green, supergreen" in every signature so that when we ask "is water blue" to an llm it might answer "not it's supergreen"?

yupyupyups

5h

We need to find more ways to poison their data.

JohnMakin

6h

Cloudflare offers bot mitigation for free, and pretty generous WAF rules that makes mitigations like this seem a little overblown to me

ATechGuy

3h

It is really free? Genuinely asking.

n1xis10t

6h

You can’t deny that it’s fun though. Personally I generally feel like more people should be coming up with creative (if not entirely necessary) solutions to problems.

conception

6h

For “free”.

n1xis10t

6h

Did you put “free” in quotes because you need to have paid for stuff from cloudflare to use the “free” thing?

If so, I suppose it’s like those magazines that say ”free cd”.

efilife

3h

Well, you literally MITM yourself so I think it's a big price

JohnMakin

5h

You don't though.

n1xis10t

5h

Good to know thanks

Terr_

5h

I thought they were referring to the indirect costs of supporting monopolistic stuff that enshittifies later.

https://www.youtube.com/watch?v=U8vi6Hbp8Vc

Crafted by Rajat