Free Republic
Browse · Search
News/Activism
Topics · Post Article

To: Danae
He is specifically speaking about the robots.txt file with the user agent disallow flag set so archiving sites (ones that actually obey the protocol) do not crawl and archive the site/domain/URL in question. This is not a robot, it is a message to the crawling bot to ignore the site (not all bots will obey).
See the following: Block or remove pages using a robots.txt file

I would also refer you to the following: robotstxt.org's About /robots.txt
Or this other informational FAQ directly from archive.org: How can I have my site's pages excluded from the Wayback Machine?

You could always look at the Google cache for the site(s)/URL(s) in question, because it (Google's crawler) sometimes ignores the robots.txt file. You could also search with Google using the term "robots.txt disallow" sans quotes, or use that term in whatever search engine you prefer (if you eschew Google for whatever reason) to get the same info.

Regardless, thanks for the article; it is very interesting!

135 posted on 10/20/2011 10:10:47 PM PDT by jurroppi1
[ Post Reply | Private Reply | To 39 | View Replies ]


To: jurroppi1

http://obamareleaseyourrecords.blogspot.com/2011/10/new-york-state-board-of-elections.html

New York State Board of Elections Website Blocking Access To Natural Born Citizen Requirements


142 posted on 10/20/2011 11:52:30 PM PDT by rolling_stone
[ Post Reply | Private Reply | To 135 | View Replies ]

Free Republic
Browse · Search
News/Activism
Topics · Post Article


FreeRepublic, LLC, PO BOX 9771, FRESNO, CA 93794
FreeRepublic.com is powered by software copyright 2000-2008 John Robinson