Skip to main content

Questions tagged [robots.txt]

Convention to prevent webcrawlers from indexing your website.

Filter by
Sorted by
Tagged with
30 votes
5 answers
81k views

How to set robots.txt globally in nginx for all virtual hosts

I am trying to set robots.txt for all virtual hosts under nginx http server. I was able to do it in Apache by putting the following in main httpd.conf: <Location "/robots.txt"> SetHandler ...
anup's user avatar
  • 727
23 votes
5 answers
22k views

How Can I Encourage Google to Read New robots.txt File?

I just updated my robots.txt file on a new site; Google Webmaster Tools reports it read my robots.txt 10 minutes before my last update. Is there any way I can encourage Google to re-read my robots....
qxotk's user avatar
  • 1,436
14 votes
5 answers
2k views

Which bots and spiders should I block in robots.txt?

In order to: Increase security of my website Reduce bandwidth requirements Prevent email address harvesting
DaveC's user avatar
  • 243
10 votes
4 answers
23k views

How to create robots.txt file for all domains on Apache server

We have a XAMPP Apache development web server setup with virtual hosts and want to stop serps from crawling all our sites. This is easily done with a robots.txt file. However, we'd rather not ...
Mike B's user avatar
  • 203
8 votes
3 answers
11k views

How do I use robots.txt to disallow crawling for only my subdomains?

If I want my main website to on search engines, but none of the subdomains to be, should I just put the "disallow all" robots.txt in the directories of the subdomains? If I do, will my main domain ...
tkbx's user avatar
  • 201
6 votes
6 answers
31k views

What happens if a website does not have a robots.txt file?

If the robots.txt file is missing in the root directory of a website, how are things treated as: the site is not indexed at all the site is indexed without any restrictions It should logically be ...
Lazer's user avatar
  • 435
6 votes
4 answers
14k views

How do you create a single robots.txt file for all sites on an IIS instance

I want to create a single robots.txt file and have it served for all sites on my IIS (7 in this case) instance. I do not want to have to configure anything on any individual site. How can I do this?
Tim Erickson's user avatar
5 votes
6 answers
13k views

Blocking yandex.ru bot

I want to block all request from yandex.ru search bot. It is very traffic intensive (2GB/day). I first blocked one C class IP range, but it seems this bot appear from different IP ranges. For example:...
Ross's user avatar
  • 268
5 votes
1 answer
10k views

Nginx robots.txt configuration

I can't seem to properly configure nginx to return robots.txt content. Ideally, I don't need the file and just want to serve text content configured directly in nginx. Here's my config: server { ...
Denys S.'s user avatar
  • 225
3 votes
3 answers
224 views

How to prevent discovery of a secure URL?

If I have a url that is used for getting messages and I create it like so: http://www.mydomain.com/somelonghash123456etcetc and this URL allows for other services to POST messages to. Is it possible ...
lamp_scaler's user avatar
3 votes
3 answers
1k views

robots.txt is redirecting to default page

Hullo, Typically, if I type into my address bar, "oneofmysites.com/robots.txt", any browser will display the content of robots.txt. As you can see, this is pretty standard behaviour. I have just ...
Parapluie's user avatar
  • 165
3 votes
2 answers
6k views

Is it good idea to ban amazonaws.com [closed]

Site are crawled by anonymous bot hosted on amazon ec2. This robot doesn't respect robots.txt and creates high load on web server so I added check if reverse IP for request ends with "amazonaws.com" ...
valodzka's user avatar
  • 187
3 votes
2 answers
44k views

Meaning of Disallow: /*? in robots.txt

Yahoo's robots.txt contains: User-agent: * Disallow: /p/ Disallow: /r/ Disallow: /*? What does the last line mean? ("Disallow: /*?")
user avatar
3 votes
2 answers
2k views

Robots.txt - no follow, no index

Please can someone explain to me the difference between setting allow and disallow in a robots.txt file and create No follow, No index meta tags! Is it possible to set no follow and no index within ...
user avatar
3 votes
1 answer
846 views

Baidu Spider causing 3Gb of traffic a day - but I do business in China

I'm in a difficult situation, the Baidu spider is hitting my site causing about 3Gb a day worth of bandwidth. At the same time I do business in China so don't want to just block it. Has anyone else ...
d.lanza38's user avatar
  • 387
3 votes
1 answer
319 views

Why is googlebot requesting robots.txt from my SSH server?

I run ossec on my server and periodically I receive a warning like this: Received From: myserver->/var/log/auth.log Rule: 5701 fired (level 8) -> "Possible attack on the ssh server (or version ...
Brian's user avatar
  • 806
2 votes
3 answers
6k views

robots.txt and other .txt returning 404 on IIS?

We have an IIS site running Dotnetnuke that we took over from another group. We have added a robots.txt file to the root but it returns a 404. Actually any txt file in the root seems to return 404. ...
schooner2000's user avatar
2 votes
2 answers
4k views

Rewrite robots.txt based on host with htaccess

I'm trying to rewrite a filename based on the server's domain. This code below is wrong / not working, but illustrates the desired effect. <If "req('Host') != '*.mydevserver.com'"> ...
Jay's user avatar
  • 157
2 votes
1 answer
525 views

What's with random-character queries coming from googlebot, e.g., vvytnoxvontwusz.html?

One of my sites has been getting queries from googlebot, on the order of: example-log:66.249.79.216 - - [06/Apr/2016:15:36:56 -0700] "GET /vvytnoxvontwusz.html HTTP/1.1" 404 15136 "-" "Mozilla/5.0 (...
Jim Miller's user avatar
2 votes
3 answers
432 views

Block Offline Browsers

Is there a way to block offline browsers (like Teleport Pro, Webzip, etc...) that are showed in the logs as "Mozilla"? Example: Webzip is showed in my site logs as "Mozilla/4.0 (compatible; MSIE 8.0; ...
Alex's user avatar
  • 21
2 votes
1 answer
94 views

Use robots.txt to prevent crawlers from getting old versions of Trac pages

looking at my Apache access.log I see that crawlers tend to get old versions of pages and documents, like: 119.63.196.86 - - [10/Jun/2011:10:36:31 +0200] "GET /wiki/News?version=14 HTTP/1.1" 200 6073 ...
Andrea Spadaccini's user avatar
2 votes
1 answer
1k views

Google-Bot fell in love with my 404-page

Every day my access-log looks kind of this: 66.249.78.140 - - [21/Oct/2013:14:37:00 +0200] "GET /robots.txt HTTP/1.1" 200 112 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot....
32bitfloat's user avatar
2 votes
1 answer
263 views

Ideal robots.txt for a gitweb installation? [closed]

I host a few git repositories at git.nomeata.de using gitweb (and gitolite). Occasionally, a search engine spider comes along and begins to hammer the interface. While I generally do want my git ...
Joachim Breitner's user avatar
2 votes
3 answers
340 views

How much HDD space would I need to cache the web while respecting robot.txts? [closed]

I want to experiment with creating a web crawler. I'll start with indexing a few medium sized website like Stack Overflow or Smashing Magazine. If it works, I'd like to start crawling the entire web. ...
2 votes
1 answer
1k views

Remove ?=collcc from url

Google Webmasters Tools has notified me about too many duplicated URLs. Some parameters have been added that I don't know about and I need to remove it, for example: http://example.com/5454/my-utr....
user994461's user avatar
2 votes
2 answers
1k views

IIS Spikes in anonymous users - crippling my server

I have a server running windows server 2008 R2, recently my websites have becoming unresponsive at least once a day, seemingly at random intervals. I have installed some monitoring software and ...
Paul Hinett's user avatar
  • 1,205
1 vote
2 answers
3k views

Blocking bad bots

I found this script and was wondering if this is just overkill and even worth using? Is it better for me to just use mod_security? # Generated using http://solidshellsecurity.com services # Begin ...
Tiffany Walker's user avatar
1 vote
2 answers
2k views

How to create a global robots.txt that gets appended to each domain's own robots.txt on Apache?

I know can create ONE robots.txt file for all domains on an Apache server*, but I want to append to each domain's (if pre-existing) robots.txt. I want some general rules in place for all domains, but ...
Gaia's user avatar
  • 1,935
1 vote
3 answers
733 views

Should I ban spiders?

A rails template script that I've been looking at automatically adds User-Agent: and Dissalow: in robots.txt thereby banning all spiders from the site What are the benefits of banning spiders and why ...
marflar's user avatar
  • 397
1 vote
7 answers
212 views

Robots.txt command

I have a bunch of files at www.example.com/A/B/C/NAME (A,B,C change around, NAME is static) and I basically want to add a command in robots.txt so crawlers don't follow any such links that have NAME ...
user avatar
1 vote
2 answers
2k views

Thousands of robots.txt 404 errors from bots trying to crawl old multisite

Current situation is that we are getting thousands and thousands of 404 errors from bots looking for robots.txt in different places on our site due to domain redirects. Our old website was a ...
Beatchef's user avatar
1 vote
3 answers
3k views

Dynamic robots.txt based on hostname

Is there a way to swap out a robots.txt file in nginx based on hostname? I currently have www.domain.com and backup.domain.com pointing at the same nginx server, but I don't want Google indexing ...
Noodles's user avatar
  • 1,396
1 vote
1 answer
713 views

Does a forward web proxy exist that checks and obeys robots.txt on remote domains?

Does there exist a forward proxy server that will lookup and obey robots.txt files on remote internet domains and enforce them on behalf of requesters going via the proxy? e.g. Imagine a website at ...
wodow's user avatar
  • 590
1 vote
1 answer
316 views

Googlebot cant access my site webmaster tools reply Unreachable robots.txt

When I try to fetch my site as a googlebot in webmaster tools it return Unreachable robots.txt, after investigate I understood google bot can see my server: tcpdump | grep google It returns that ...
Ahmad Ahmadi's user avatar
1 vote
2 answers
2k views

Is there a way to disallow robots crawling through IIS Management Console for entire site

Can I do the same as robots.txt through IIS settings? Telling User-agent: * Disallow: / in host header or through web.config?
jpkeisala's user avatar
  • 166
1 vote
2 answers
471 views

Weird entry in access.log on Apache 2.2

I'm running Apache 2.2, and my server runs well. Noticed this weird anomaly in my access.log file, how should I prevent it? robots.txt doesn't seem to be working. 127.0.0.1 - - [17/Apr/2011:12:17:00 +...
subarufan86's user avatar
1 vote
1 answer
417 views

Tons of Access from Google Proxy

I freaquently have a lots of access from google proxy. It says it is Google Favicon bot and I've checked it by host command. User-agent is like following. "Mozilla/5.0 (X11; Linux x86_64) ...
sasa's user avatar
  • 11
1 vote
1 answer
584 views

Redirect "robots.txt" on specific domain

I want to redirect all requests on "robots.txt" if the domain contains ".our-internal-devel-domain.de". It should be server-wide, because when we develop a website and publish it over our test-domain, ...
chmod777's user avatar
1 vote
1 answer
369 views

High no of hits by facebook crawler on server

There are daily about 3000 404 hits or more from facebook crawler. Log is as X.X.X.X Y.Y.Y.Y - - [24/May/2017:03:43:35 +0000] "GET /health-and-medicine/trumps-2018-budget-cuts-funding-for-cancer-...
YATIN GUPTA's user avatar
1 vote
1 answer
624 views

How to Disallow Particular Path in robots.txt

I want to disallow /path but also wanna allow /path/another-path in robots.txt. I already tried: Disallow: /path Or: Disallow: /path$ But doesn't work, I mean it blocked /path/another-path too. ...
user avatar
1 vote
1 answer
71 views

If denying crawlers access to a directory via robots.txt, will it still index a file in that directory if I direct link?

I am denying indexing to a folder called pdf via robots.txt. However, I do direct link to a few files that exist in that directory. Will search engines such as Google index those files, or ignore ...
kylex's user avatar
  • 1,443
1 vote
2 answers
152 views

robots.txt file with more restrictive rules for certain user agents

I'm a bit vague on the precise syntax of robots.txt, but what I'm trying to achieve is: Tell all user agents not to crawl certain pages Tell certain user agents not to crawl anything (basically, ...
Carson63000's user avatar
1 vote
3 answers
409 views

Does GoogleBot respect User-agent: *

I blocked a page in robots.txt under User-agent: *, and tried to do a manual removal of that URL from Google's cache in the webmasters tools. Google said it wasn't being blocked in my robots.txt, so I ...
user40696's user avatar
  • 113
1 vote
0 answers
136 views

Traefik, docker swarm and portainer. Serving robots.txt file

I'm playing around with my homelab and I'm trying to include robots.txt file. I'm launching traefik and portainer using this docker_compose file. This is using Docker swarm mode version: "3.3&...
Adam Radomski's user avatar
1 vote
0 answers
28 views

What are they trying to get with "GET /public-projects"

I haven't even shared my website with anyone yet and I have already started seeing attempts to GET /public-projects. However, I couldn't get any information about it, what are they trying to get? The ...
fersarr's user avatar
  • 111
1 vote
0 answers
50 views

How to block bad url path that is not part of my site from showing in google search?

I have got a site that is running on Node.js (Express) , and Apache httpd. Hundreds of requests are coming in from malicious IP's, which I'm proactively blocking. (I have a script that looks at the ...
xDG's user avatar
  • 123
1 vote
1 answer
1k views

robots.txt route requires a backslash when behind an Application Load Balancer

I have a rails site using an AWS ALB and all routes appear to work except one, robots.txt. I am getting the error "ERR_TOO_MANY_REDIRECTS", link to example: https://www.mamapedia.com/robots.txt ...
6557457iD9e's user avatar
1 vote
1 answer
897 views

Custom robots.txt being overwritten in Azure IIS 8 by something

We have a custom robots.txt in the root of our IIS cloud service Azure website that does not display correctly when navigating to www.oursite.com/robots.txt . A “different” robots.txt file displays ...
Brian's user avatar
  • 11
1 vote
0 answers
2k views

How to block fake google spider and fake web browser access?

Recently I found that someguys are trying to mirror my website. They are doing this in two ways: Pretend to be google spiders . Access logs are as following: 89.85.93.235 - - [05/May/2015:20:23:16 +...
Meteor's user avatar
  • 151
1 vote
1 answer
1k views

apache robots.txt with SSL

I have an .htaccess file with a rewrite rule to get a redirect of every HTTP request to HTTPS. But now I have a problem that my robots.txt is not recognized by some online checker. If I remove the ...
user224013's user avatar