Web Robot Detection Requirements
By Roger Hsueh
Introduction
Search engines use web robots to periodically retrieve pages
from sites for indexing. However, robots won't be able to access
areas that requre users to log in, yet those areas probably have
content in the database that should be open to searches from the
public. The site administrator can set up a dedicated area on the
site to serve content to robots, now it's up to robot-detection to
make robots go to the right place.
Vision Statement
Without search engines, people would be lost on the Internet.
However, personalized systems like ACS have much of their content
hidden behind login pages -- which would be inaccessible for the
software robot crawlers from search engines. To increase a site's
visibility, site owners need a tool to identify visiting robots and
present them with content to be indexed. The Web Robot Detection
package fulfills the role of such a tool.
Web Robot Detection Overview
Web Robot Detection is an application package that defines a
data model and some code to handle traffic from search-engine
robots. It has the following components:
A data model for storing information about known search-engine
robots on the web.
A mechanism to maintain the list of robots and keep that in
sync with the database.
A mechanism based on the ACS Kernel to specify the paths from
which to redirect robots and the target to direct the robots
to.
Code definition to make use of the request processor filter
provided by ACS Kernel 4.0.
Use-cases and User-scenarios
The Web Robot Detection package is not meant to be used by
regular users. Instead, the site-wide administrator is responsible
for mapping directories not accessible to search engines to a
"robot heaven", which has been setup to provide content suitable to be
indexed.
The site-wide administrator would typically download the
robot-detection package, install it using the APM (Arsdigita Package
Manager) and set up the parameters, check out the administration page
to see the current parameters and the list of identifiable robots,
build the "robot heaven" and verify the whole thing works.
Afterward, this package requires no additional maintenance.
A software robot making a http request to a part of the site that
requires login would be automatically redirected to the "robot heaven".
Related Links
Test Cases
Requirements: Data Model
10.10.0 Store information about robots
10.10.5 A primary key to identify each individual
robot
10.10.7 Some fields for the UI: name of robot and
url to get more information about the robot
10.10.10 Useragent header is necessary to find out which
connection is coming from a robot
10.10.15 Insertion date to keep track when was the robots table last refreshed
Requirements: API
20.10.10 Check on server restart when was the robot information last refreshed, if it's been longer than a value specified in the package's parameter, run the procedure (20.20.0) to refresh the robot information in the database
20.10.15 On server restart, if the robot information is not present in the database, run the refresh procedure (20.20.0) to obtain it
20.20.0 A way to automatically gather information about web robots from a website that keeps such data
20.30.0 A way to detect robots based on the useragent field of the robot's http header
20.40.0 A way to redirect an identified robot to another path on the same site
Requirements: Site Administrator Interface
30.10.0 Display current parameters
30.20.0 Display the list of robots known to the system
30.30.0 A way to refresh the robot list on demand by calling the procedure (20.0.0) to refresh the robot information in the database
Revision History
Document Revision #
Action Taken, Notes
When?
By Whom?
0.4
Added more detailed requirements,
based on suggestions from Kai Wu
2001-01-23
Roger Hsueh
0.3
Revised document based on comments
from Kevin Scaldeferri
2000-12-13
Roger Hsueh
0.2
Revised document based on comments
from Michael Bryzek
2000-12-07
Roger Hsueh
0.1
Create initial version
2000-12-06
Roger Hsueh
Last modified: $Date: 2001/04/20 20:51:22 $