Questions tagged [robots.txt]
Convention to prevent webcrawlers from indexing your website.
88
questions
30
votes
5
answers
81k
views
How to set robots.txt globally in nginx for all virtual hosts
I am trying to set robots.txt for all virtual hosts under nginx http server.
I was able to do it in Apache by putting the following in main httpd.conf:
<Location "/robots.txt">
SetHandler ...
23
votes
5
answers
22k
views
How Can I Encourage Google to Read New robots.txt File?
I just updated my robots.txt file on a new site; Google Webmaster Tools reports it read my robots.txt 10 minutes before my last update.
Is there any way I can encourage Google to re-read my robots....
14
votes
5
answers
2k
views
Which bots and spiders should I block in robots.txt?
In order to:
Increase security of my website
Reduce bandwidth requirements
Prevent email address harvesting
10
votes
4
answers
23k
views
How to create robots.txt file for all domains on Apache server
We have a XAMPP Apache development web server setup with virtual hosts and want to stop serps from crawling all our sites. This is easily done with a robots.txt file. However, we'd rather not ...
8
votes
3
answers
11k
views
How do I use robots.txt to disallow crawling for only my subdomains?
If I want my main website to on search engines, but none of the subdomains to be, should I just put the "disallow all" robots.txt in the directories of the subdomains? If I do, will my main domain ...
6
votes
6
answers
31k
views
What happens if a website does not have a robots.txt file?
If the robots.txt file is missing in the root directory of a website, how are things treated as:
the site is not indexed at all
the site is indexed without any restrictions
It should logically be ...
6
votes
4
answers
14k
views
How do you create a single robots.txt file for all sites on an IIS instance
I want to create a single robots.txt file and have it served for all sites on my IIS (7 in this case) instance.
I do not want to have to configure anything on any individual site.
How can I do this?
5
votes
6
answers
13k
views
Blocking yandex.ru bot
I want to block all request from yandex.ru search bot. It is very traffic intensive (2GB/day).
I first blocked one C class IP range, but it seems this bot appear from different IP ranges.
For example:...
5
votes
1
answer
10k
views
Nginx robots.txt configuration
I can't seem to properly configure nginx to return robots.txt content. Ideally, I don't need the file and just want to serve text content configured directly in nginx. Here's my config:
server {
...
3
votes
3
answers
224
views
How to prevent discovery of a secure URL?
If I have a url that is used for getting messages and I create it like so: http://www.mydomain.com/somelonghash123456etcetc and this URL allows for other services to POST messages to. Is it possible ...
3
votes
3
answers
1k
views
robots.txt is redirecting to default page
Hullo,
Typically, if I type into my address bar, "oneofmysites.com/robots.txt", any browser will display the content of robots.txt. As you can see, this is pretty standard behaviour.
I have just ...
3
votes
2
answers
6k
views
Is it good idea to ban amazonaws.com [closed]
Site are crawled by anonymous bot hosted on amazon ec2. This robot doesn't respect robots.txt and creates high load on web server so I added check if reverse IP for request ends with "amazonaws.com" ...
3
votes
2
answers
44k
views
Meaning of Disallow: /*? in robots.txt
Yahoo's robots.txt contains:
User-agent: *
Disallow: /p/
Disallow: /r/
Disallow: /*?
What does the last line mean? ("Disallow: /*?")
3
votes
2
answers
2k
views
Robots.txt - no follow, no index
Please can someone explain to me the difference between setting allow and disallow in a robots.txt file and create No follow, No index meta tags!
Is it possible to set no follow and no index within ...
3
votes
1
answer
846
views
Baidu Spider causing 3Gb of traffic a day - but I do business in China
I'm in a difficult situation, the Baidu spider is hitting my site causing about 3Gb a day worth of bandwidth. At the same time I do business in China so don't want to just block it.
Has anyone else ...
3
votes
1
answer
319
views
Why is googlebot requesting robots.txt from my SSH server?
I run ossec on my server and periodically I receive a warning like this:
Received From: myserver->/var/log/auth.log
Rule: 5701 fired (level 8) -> "Possible attack on the ssh server (or version ...
2
votes
3
answers
6k
views
robots.txt and other .txt returning 404 on IIS?
We have an IIS site running Dotnetnuke that we took over from another group. We have added a robots.txt file to the root but it returns a 404. Actually any txt file in the root seems to return 404.
...
2
votes
2
answers
4k
views
Rewrite robots.txt based on host with htaccess
I'm trying to rewrite a filename based on the server's domain.
This code below is wrong / not working, but illustrates the desired effect.
<If "req('Host') != '*.mydevserver.com'">
...
2
votes
1
answer
525
views
What's with random-character queries coming from googlebot, e.g., vvytnoxvontwusz.html?
One of my sites has been getting queries from googlebot, on the order of:
example-log:66.249.79.216 - - [06/Apr/2016:15:36:56 -0700] "GET /vvytnoxvontwusz.html HTTP/1.1" 404 15136 "-" "Mozilla/5.0 (...
2
votes
3
answers
432
views
Block Offline Browsers
Is there a way to block offline browsers (like Teleport Pro, Webzip, etc...) that are showed in the logs as "Mozilla"?
Example:
Webzip is showed in my site logs as "Mozilla/4.0 (compatible; MSIE 8.0; ...
2
votes
1
answer
94
views
Use robots.txt to prevent crawlers from getting old versions of Trac pages
looking at my Apache access.log I see that crawlers tend to get old versions of pages and documents, like:
119.63.196.86 - - [10/Jun/2011:10:36:31 +0200] "GET /wiki/News?version=14 HTTP/1.1" 200 6073 ...
2
votes
1
answer
1k
views
Google-Bot fell in love with my 404-page
Every day my access-log looks kind of this:
66.249.78.140 - - [21/Oct/2013:14:37:00 +0200] "GET /robots.txt HTTP/1.1" 200 112 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot....
2
votes
1
answer
263
views
Ideal robots.txt for a gitweb installation? [closed]
I host a few git repositories at git.nomeata.de using gitweb (and gitolite). Occasionally, a search engine spider comes along and begins to hammer the interface. While I generally do want my git ...
2
votes
3
answers
340
views
How much HDD space would I need to cache the web while respecting robot.txts? [closed]
I want to experiment with creating a web crawler. I'll start with indexing a few medium sized website like Stack Overflow or Smashing Magazine. If it works, I'd like to start crawling the entire web. ...
2
votes
1
answer
1k
views
Remove ?=collcc from url
Google Webmasters Tools has notified me about too many duplicated URLs. Some parameters have been added that I don't know about and I need to remove it, for example:
http://example.com/5454/my-utr....
2
votes
2
answers
1k
views
IIS Spikes in anonymous users - crippling my server
I have a server running windows server 2008 R2, recently my websites have becoming unresponsive at least once a day, seemingly at random intervals.
I have installed some monitoring software and ...
1
vote
2
answers
3k
views
Blocking bad bots
I found this script and was wondering if this is just overkill and even worth using?
Is it better for me to just use mod_security?
# Generated using http://solidshellsecurity.com services
# Begin ...
1
vote
2
answers
2k
views
How to create a global robots.txt that gets appended to each domain's own robots.txt on Apache?
I know can create ONE robots.txt file for all domains on an Apache server*, but I want to append to each domain's (if pre-existing) robots.txt. I want some general rules in place for all domains, but ...
1
vote
3
answers
733
views
Should I ban spiders?
A rails template script that I've been looking at automatically adds User-Agent: and Dissalow: in robots.txt thereby banning all spiders from the site
What are the benefits of banning spiders and why ...
1
vote
7
answers
212
views
Robots.txt command
I have a bunch of files at www.example.com/A/B/C/NAME (A,B,C change around, NAME is static) and I basically want to add a command in robots.txt so crawlers don't follow any such links that have NAME ...
1
vote
2
answers
2k
views
Thousands of robots.txt 404 errors from bots trying to crawl old multisite
Current situation is that we are getting thousands and thousands of 404 errors from bots looking for robots.txt in different places on our site due to domain redirects.
Our old website was a ...
1
vote
3
answers
3k
views
Dynamic robots.txt based on hostname
Is there a way to swap out a robots.txt file in nginx based on hostname? I currently have www.domain.com and backup.domain.com pointing at the same nginx server, but I don't want Google indexing ...
1
vote
1
answer
713
views
Does a forward web proxy exist that checks and obeys robots.txt on remote domains?
Does there exist a forward proxy server that will lookup and obey robots.txt files on remote internet domains and enforce them on behalf of requesters going via the proxy?
e.g. Imagine a website at ...
1
vote
1
answer
316
views
Googlebot cant access my site webmaster tools reply Unreachable robots.txt
When I try to fetch my site as a googlebot in webmaster tools it return Unreachable robots.txt, after investigate I understood google bot can see my server:
tcpdump | grep google
It returns that ...
1
vote
2
answers
2k
views
Is there a way to disallow robots crawling through IIS Management Console for entire site
Can I do the same as robots.txt through IIS settings?
Telling
User-agent: * Disallow: /
in host header or through web.config?
1
vote
2
answers
471
views
Weird entry in access.log on Apache 2.2
I'm running Apache 2.2, and my server runs well. Noticed this weird anomaly in my access.log file, how should I prevent it? robots.txt doesn't seem to be working.
127.0.0.1 - - [17/Apr/2011:12:17:00 +...
1
vote
1
answer
417
views
Tons of Access from Google Proxy
I freaquently have a lots of access from google proxy. It says it is Google Favicon bot and I've checked it by host command. User-agent is like following.
"Mozilla/5.0 (X11; Linux x86_64) ...
1
vote
1
answer
584
views
Redirect "robots.txt" on specific domain
I want to redirect all requests on "robots.txt" if the domain contains ".our-internal-devel-domain.de". It should be server-wide, because when we develop a website and publish it over our test-domain, ...
1
vote
1
answer
369
views
High no of hits by facebook crawler on server
There are daily about 3000 404 hits or more from facebook crawler. Log is as
X.X.X.X Y.Y.Y.Y - - [24/May/2017:03:43:35 +0000] "GET /health-and-medicine/trumps-2018-budget-cuts-funding-for-cancer-...
1
vote
1
answer
624
views
How to Disallow Particular Path in robots.txt
I want to disallow /path but also wanna allow /path/another-path in robots.txt. I already tried:
Disallow: /path
Or:
Disallow: /path$
But doesn't work, I mean it blocked /path/another-path too. ...
1
vote
1
answer
71
views
If denying crawlers access to a directory via robots.txt, will it still index a file in that directory if I direct link?
I am denying indexing to a folder called pdf via robots.txt. However, I do direct link to a few files that exist in that directory.
Will search engines such as Google index those files, or ignore ...
1
vote
2
answers
152
views
robots.txt file with more restrictive rules for certain user agents
I'm a bit vague on the precise syntax of robots.txt, but what I'm trying to achieve is:
Tell all user agents not to crawl certain pages
Tell certain user agents not to crawl anything
(basically, ...
1
vote
3
answers
409
views
Does GoogleBot respect User-agent: *
I blocked a page in robots.txt under User-agent: *, and tried to do a manual removal of that URL from Google's cache in the webmasters tools. Google said it wasn't being blocked in my robots.txt, so I ...
1
vote
0
answers
136
views
Traefik, docker swarm and portainer. Serving robots.txt file
I'm playing around with my homelab and I'm trying to include robots.txt file.
I'm launching traefik and portainer using this docker_compose file. This is using Docker swarm mode
version: "3.3&...
1
vote
0
answers
28
views
What are they trying to get with "GET /public-projects"
I haven't even shared my website with anyone yet and I have already started seeing attempts to GET /public-projects. However, I couldn't get any information about it, what are they trying to get? The ...
1
vote
0
answers
50
views
How to block bad url path that is not part of my site from showing in google search?
I have got a site that is running on Node.js (Express) , and Apache httpd.
Hundreds of requests are coming in from malicious IP's, which I'm proactively blocking. (I have a script that looks at the ...
1
vote
1
answer
1k
views
robots.txt route requires a backslash when behind an Application Load Balancer
I have a rails site using an AWS ALB and all routes appear to work except one, robots.txt.
I am getting the error "ERR_TOO_MANY_REDIRECTS", link to example: https://www.mamapedia.com/robots.txt
...
1
vote
1
answer
897
views
Custom robots.txt being overwritten in Azure IIS 8 by something
We have a custom robots.txt in the root of our IIS cloud service Azure website that does not display correctly when navigating to www.oursite.com/robots.txt . A “different” robots.txt file displays ...
1
vote
0
answers
2k
views
How to block fake google spider and fake web browser access?
Recently I found that someguys are trying to mirror my website. They are doing this in two ways:
Pretend to be google spiders . Access logs are as following:
89.85.93.235 - - [05/May/2015:20:23:16 +...
1
vote
1
answer
1k
views
apache robots.txt with SSL
I have an .htaccess file with a rewrite rule to get a redirect of every HTTP request to HTTPS.
But now I have a problem that my robots.txt is not recognized by some online checker.
If I remove the ...