Currently AI bots are crawling website all around the world. For a
website hosting git content this adds a lot of extra load and traffic:
The site has lots of sections, repositories have a lot of files,
branches, tags, commit ids, etc...
Multiply that and you have a nearly unlimited number of unique urls. The
bots try to get each and every of these.
To speed up the learing process on their side a swarm of hundreds,
thousands or more ip addresses is active at the same time, ultimately
DDOS'ing the websites, making it inaccessible. 😳🤬
Well, there is one single file all of these AI bots are not interested
in: robots.txt 🤬🤬
On top some use random user agent strings, making filtering impossible.
🤬🤬🤬
For a short term sulution I deploy the repository content as static
files, hopefully making these accessible at least. We will see.
This reverts commit 8231c3e833.
Truned out this workaround was not sufficient, see the follow-up in
commit 191cc1b952 for details.
But possibly the second one does it on its own? Reverting this for
a test run.
Turns out the workaround in $WaitForFile (commit
8231c3e833) is not sufficient. It helps
sometimes, but not always. Possibly depends on CPU speed and bandwidth
of internet connection... Who knows!? 🤪
But! Reading the file goes beyond the known file size. That's suspicious
and indicates this exact issue. So add a delay, and keep reading until
sizes are equal.
This used to require a key=value store, separated with commas. An
example for `netwatch-notify` is:
/tool/netwatch/add comment="notify, name=example.com" host=93.184.215.14;
Now JSON is supported as well, so you could use:
/tool/netwatch/add comment="{\"notify\":true,\"name\":\"example.com\"}" host=93.184.215.14;
Looks more clumsy here, but may be of help in more complex setups...
Well, turns out that waiting for existence of a file is not sufficient.
Chances are that a file is available just partly, so wait until the size
no longer changes... Let's hope that works as expected. 🤞