30 votes
5 answers

How to set robots.txt globally in nginx for all virtual hosts

I am trying to set robots.txt for all virtual hosts under nginx http server. I was able to do it in Apache by putting the following in main httpd.conf: <Location "/robots.txt"> SetHandler ...
  • 727
23 votes
5 answers

How Can I Encourage Google to Read New robots.txt File?

I just updated my robots.txt file on a new site; Google Webmaster Tools reports it read my robots.txt 10 minutes before my last update. Is there any way I can encourage Google to re-read my robots....
  • 1,436
14 votes
5 answers

Which bots and spiders should I block in robots.txt?

In order to: Increase security of my website Reduce bandwidth requirements Prevent email address harvesting
  • 243
10 votes
4 answers

How to create robots.txt file for all domains on Apache server

We have a XAMPP Apache development web server setup with virtual hosts and want to stop serps from crawling all our sites. This is easily done with a robots.txt file. However, we'd rather not ...
  • 203
8 votes
3 answers

How do I use robots.txt to disallow crawling for only my subdomains?

If I want my main website to on search engines, but none of the subdomains to be, should I just put the "disallow all" robots.txt in the directories of the subdomains? If I do, will my main domain ...
  • 201
6 votes
6 answers

What happens if a website does not have a robots.txt file?

If the robots.txt file is missing in the root directory of a website, how are things treated as: the site is not indexed at all the site is indexed without any restrictions It should logically be ...
  • 435
6 votes
4 answers

How do you create a single robots.txt file for all sites on an IIS instance

I want to create a single robots.txt file and have it served for all sites on my IIS (7 in this case) instance. I do not want to have to configure anything on any individual site. How can I do this?
5 votes
6 answers

Blocking bot

I want to block all request from search bot. It is very traffic intensive (2GB/day). I first blocked one C class IP range, but it seems this bot appear from different IP ranges. For example:...
  • 268
5 votes
1 answer

Nginx robots.txt configuration

I can't seem to properly configure nginx to return robots.txt content. Ideally, I don't need the file and just want to serve text content configured directly in nginx. Here's my config: server { ...
  • 225
3 votes
3 answers

How to prevent discovery of a secure URL?

If I have a url that is used for getting messages and I create it like so: and this URL allows for other services to POST messages to. Is it possible ...
3 votes
3 answers

robots.txt is redirecting to default page

Hullo, Typically, if I type into my address bar, "", any browser will display the content of robots.txt. As you can see, this is pretty standard behaviour. I have just ...
  • 165
3 votes
2 answers

Is it good idea to ban [closed]

Site are crawled by anonymous bot hosted on amazon ec2. This robot doesn't respect robots.txt and creates high load on web server so I added check if reverse IP for request ends with "" ...
  • 187
3 votes
2 answers

Meaning of Disallow: /*? in robots.txt

Yahoo's robots.txt contains: User-agent: * Disallow: /p/ Disallow: /r/ Disallow: /*? What does the last line mean? ("Disallow: /*?")
3 votes
2 answers

Robots.txt - no follow, no index

Please can someone explain to me the difference between setting allow and disallow in a robots.txt file and create No follow, No index meta tags! Is it possible to set no follow and no index within ...
3 votes
1 answer

Baidu Spider causing 3Gb of traffic a day - but I do business in China

I'm in a difficult situation, the Baidu spider is hitting my site causing about 3Gb a day worth of bandwidth. At the same time I do business in China so don't want to just block it. Has anyone else ...
  • 387
3 votes
1 answer

Why is googlebot requesting robots.txt from my SSH server?

I run ossec on my server and periodically I receive a warning like this: Received From: myserver->/var/log/auth.log Rule: 5701 fired (level 8) -> "Possible attack on the ssh server (or version ...
  • 806
2 votes
3 answers

robots.txt and other .txt returning 404 on IIS?

We have an IIS site running Dotnetnuke that we took over from another group. We have added a robots.txt file to the root but it returns a 404. Actually any txt file in the root seems to return 404. ...
2 votes
2 answers

Rewrite robots.txt based on host with htaccess

I'm trying to rewrite a filename based on the server's domain. This code below is wrong / not working, but illustrates the desired effect. <If "req('Host') != '*'"> ...
  • 157
2 votes
1 answer

What's with random-character queries coming from googlebot, e.g., vvytnoxvontwusz.html?

One of my sites has been getting queries from googlebot, on the order of: example-log: - - [06/Apr/2016:15:36:56 -0700] "GET /vvytnoxvontwusz.html HTTP/1.1" 404 15136 "-" "Mozilla/5.0 (...
2 votes
3 answers

Block Offline Browsers

Is there a way to block offline browsers (like Teleport Pro, Webzip, etc...) that are showed in the logs as "Mozilla"? Example: Webzip is showed in my site logs as "Mozilla/4.0 (compatible; MSIE 8.0; ...
  • 21
2 votes
1 answer

Use robots.txt to prevent crawlers from getting old versions of Trac pages

looking at my Apache access.log I see that crawlers tend to get old versions of pages and documents, like: - - [10/Jun/2011:10:36:31 +0200] "GET /wiki/News?version=14 HTTP/1.1" 200 6073 ...
2 votes
1 answer

Google-Bot fell in love with my 404-page

Every day my access-log looks kind of this: - - [21/Oct/2013:14:37:00 +0200] "GET /robots.txt HTTP/1.1" 200 112 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +
2 votes
1 answer

Ideal robots.txt for a gitweb installation? [closed]

I host a few git repositories at using gitweb (and gitolite). Occasionally, a search engine spider comes along and begins to hammer the interface. While I generally do want my git ...
2 votes
3 answers

How much HDD space would I need to cache the web while respecting robot.txts? [closed]

I want to experiment with creating a web crawler. I'll start with indexing a few medium sized website like Stack Overflow or Smashing Magazine. If it works, I'd like to start crawling the entire web. ...
2 votes
1 answer

Remove ?=collcc from url

Google Webmasters Tools has notified me about too many duplicated URLs. Some parameters have been added that I don't know about and I need to remove it, for example:
2 votes
2 answers

IIS Spikes in anonymous users - crippling my server

I have a server running windows server 2008 R2, recently my websites have becoming unresponsive at least once a day, seemingly at random intervals. I have installed some monitoring software and ...
  • 1,205
1 vote
2 answers

Blocking bad bots

I found this script and was wondering if this is just overkill and even worth using? Is it better for me to just use mod_security? # Generated using services # Begin ...
1 vote
2 answers

How to create a global robots.txt that gets appended to each domain's own robots.txt on Apache?

I know can create ONE robots.txt file for all domains on an Apache server*, but I want to append to each domain's (if pre-existing) robots.txt. I want some general rules in place for all domains, but ...
  • 1,935
1 vote
3 answers

Should I ban spiders?

A rails template script that I've been looking at automatically adds User-Agent: and Dissalow: in robots.txt thereby banning all spiders from the site What are the benefits of banning spiders and why ...
  • 397
1 vote
7 answers

Robots.txt command

I have a bunch of files at (A,B,C change around, NAME is static) and I basically want to add a command in robots.txt so crawlers don't follow any such links that have NAME ...
1 vote
2 answers

Thousands of robots.txt 404 errors from bots trying to crawl old multisite

Current situation is that we are getting thousands and thousands of 404 errors from bots looking for robots.txt in different places on our site due to domain redirects. Our old website was a ...
1 vote
3 answers

Dynamic robots.txt based on hostname

Is there a way to swap out a robots.txt file in nginx based on hostname? I currently have and pointing at the same nginx server, but I don't want Google indexing ...
  • 1,396
1 vote
1 answer

Does a forward web proxy exist that checks and obeys robots.txt on remote domains?

Does there exist a forward proxy server that will lookup and obey robots.txt files on remote internet domains and enforce them on behalf of requesters going via the proxy? e.g. Imagine a website at ...
  • 590
1 vote
1 answer

Googlebot cant access my site webmaster tools reply Unreachable robots.txt

When I try to fetch my site as a googlebot in webmaster tools it return Unreachable robots.txt, after investigate I understood google bot can see my server: tcpdump | grep google It returns that ...
1 vote
2 answers

Is there a way to disallow robots crawling through IIS Management Console for entire site

Can I do the same as robots.txt through IIS settings? Telling User-agent: * Disallow: / in host header or through web.config?
  • 166
1 vote
2 answers

Weird entry in access.log on Apache 2.2

I'm running Apache 2.2, and my server runs well. Noticed this weird anomaly in my access.log file, how should I prevent it? robots.txt doesn't seem to be working. - - [17/Apr/2011:12:17:00 +...
1 vote
1 answer

Tons of Access from Google Proxy

I freaquently have a lots of access from google proxy. It says it is Google Favicon bot and I've checked it by host command. User-agent is like following. "Mozilla/5.0 (X11; Linux x86_64) ...
  • 11
1 vote
1 answer

Redirect "robots.txt" on specific domain

I want to redirect all requests on "robots.txt" if the domain contains "". It should be server-wide, because when we develop a website and publish it over our test-domain, ...
1 vote
1 answer

High no of hits by facebook crawler on server

There are daily about 3000 404 hits or more from facebook crawler. Log is as X.X.X.X Y.Y.Y.Y - - [24/May/2017:03:43:35 +0000] "GET /health-and-medicine/trumps-2018-budget-cuts-funding-for-cancer-...
1 vote
1 answer

How to Disallow Particular Path in robots.txt

I want to disallow /path but also wanna allow /path/another-path in robots.txt. I already tried: Disallow: /path Or: Disallow: /path$ But doesn't work, I mean it blocked /path/another-path too. ...
1 vote
1 answer

If denying crawlers access to a directory via robots.txt, will it still index a file in that directory if I direct link?

I am denying indexing to a folder called pdf via robots.txt. However, I do direct link to a few files that exist in that directory. Will search engines such as Google index those files, or ignore ...
  • 1,443
1 vote
2 answers

robots.txt file with more restrictive rules for certain user agents

I'm a bit vague on the precise syntax of robots.txt, but what I'm trying to achieve is: Tell all user agents not to crawl certain pages Tell certain user agents not to crawl anything (basically, ...
1 vote
3 answers

Does GoogleBot respect User-agent: *

I blocked a page in robots.txt under User-agent: *, and tried to do a manual removal of that URL from Google's cache in the webmasters tools. Google said it wasn't being blocked in my robots.txt, so I ...
  • 113
1 vote
0 answers

Traefik, docker swarm and portainer. Serving robots.txt file

I'm playing around with my homelab and I'm trying to include robots.txt file. I'm launching traefik and portainer using this docker_compose file. This is using Docker swarm mode version: "3.3&...
1 vote
0 answers

What are they trying to get with "GET /public-projects"

I haven't even shared my website with anyone yet and I have already started seeing attempts to GET /public-projects. However, I couldn't get any information about it, what are they trying to get? The ...
  • 111
1 vote
0 answers

How to block bad url path that is not part of my site from showing in google search?

I have got a site that is running on Node.js (Express) , and Apache httpd. Hundreds of requests are coming in from malicious IP's, which I'm proactively blocking. (I have a script that looks at the ...
  • 123
1 vote
1 answer

robots.txt route requires a backslash when behind an Application Load Balancer

I have a rails site using an AWS ALB and all routes appear to work except one, robots.txt. I am getting the error "ERR_TOO_MANY_REDIRECTS", link to example: ...
1 vote
1 answer

Custom robots.txt being overwritten in Azure IIS 8 by something

We have a custom robots.txt in the root of our IIS cloud service Azure website that does not display correctly when navigating to . A “different” robots.txt file displays ...
  • 11
1 vote
0 answers

How to block fake google spider and fake web browser access?

Recently I found that someguys are trying to mirror my website. They are doing this in two ways: Pretend to be google spiders . Access logs are as following: - - [05/May/2015:20:23:16 +...
  • 151
1 vote
1 answer

apache robots.txt with SSL

I have an .htaccess file with a rewrite rule to get a redirect of every HTTP request to HTTPS. But now I have a problem that my robots.txt is not recognized by some online checker. If I remove the ...
