Robots.io is a Java library designed to make parsing a websites 'robots.txt' file easy.
The RobotsParser class provides all the functionality to use robots.io.
The Javadoc for Robots.io can be found here.To parse the robots.txt for Google with the User-Agent string "test":
RobotsParser robotsParser = new RobotsParser("test");
robotsParser.connect("http://google.com");Alternatively, to parse with no User-Agent, simply leave the constructor blank.
You can also pass a domain with a path.
robotsParser.connect("http://google.com/example.htm"); //This would also be validNote: Domains can either be passed in string form or as a URL object to all methods.
To check if a URL is allowed:
robotsParser.isAllowed("http://google.com/test"); //Returns true if allowedOr, to get all the rules parsed from the file:
robotsParser.getDisallowedPaths(); //This will return an ArrayList of StringsThe results parsed are cached in the robotsParser object until the connect() method is called again, overwriting the previously parsed data
In the event that all access is denied, a RobotsDisallowedException will be thrown.
Domains passed to RobotsParser are normalised to always end in a forward slash. Disallowed Paths returned will never begin with a forward slash. This is so that URL's can easily be constructed. For example:
robotsParser.getDomain() + robotsParser.getDisallowedPaths().get(0); // http://google.com/example.htmRobots.io is distributed under the GPL.