Web Robot Detection Design Documentation
By Roger Hsueh
Essentials
ACS Administrator directory:
/robot-detection/admin
Tcl API
robot-detection-procs.tcl
robot-detection-init.tcl
PL/SQL file
robot-detection.create.sql
Introduction
The goal of this package is to expose some parts of the site
that are normally not indexable by search engines. The package
accomplishes this by performing the following functions:
Storing and maintaining information about known robots on the
web (found at Web
Robots Database) in a robots table.
Using postauth ad_register_filters to identify robots
connecting to areas covered by robot detection by matching the
UserAgent header with known robots.
Redirecting the robots to a special area on the site.
This special area can either be a static page or a dynamic script
that generates texts from the database which are suitable for
search-engine indexing.
Historical Considerations
Previous versions of the robot detection package relied on the
AOLServer procedure ns_register_filter, which is
deprecated as of ACS4. The robot detection package now uses the
analogous ad_register_filter provided by the request
processor in ACS 4.
The queries are re-written to use the ACS 4 Database Access API.
Other than that, this version is a pretty straightforward port.
Competitive Analysis
On Apache, it's possible to
accomplish more or less the same thing with mod_rewrite,
but that's a low-level module that requires a lot of coding to make
it work like the robot detection package.
Design Tradeoffs
Only suitable for ACS administrator use.
What the web robots from search engines index depends on the
content in the the directory specified with the RedirectURL
parameter ("robot heaven"). Right now there's no automatic way to
generate the static (or pseudo static) files in "robot heaven." The
administrator should decide what sort of content to expose to
search engines and generate the files accordingly.
To keep things simple, admin page permission and parameter
setting depends on the site-mapper. In particular, setting multiple
values for the FilterPattern parameter can be kind of clumsy.
Because ad_parameter only allows one value per key at this time, I
chose to use to use CSV (comma separated values) string to
represent a list of paths to be filtered. Fortunately, once
robot-detection is setup, it doesn't require further
administration.
Refreshing the robots table entails dropping all rows from
robots table and then re-populating the data from scratch. This
ensures synchronization with the data from the web robots
database.
The raw data comes from one static file that must be retrieved
in its entirety using http. However, building and maintaining real
rdbms backed registry of web robots is not worth the effort because
the number of robots is unlikely to grow very much.
API
The API for robot detection is quite simple:
(You can also use the API Browser to view
this)
ad_cache_robot_useragents
Caches "User-Agent" values for known robots
ad_replicate_web_robots_db
Replicates data from the Web
Robots Database into a table in the database. The data is
published on the Web as a flat file, whose format is specified in
http://info.webcrawler.com/mak/projects/robots/active/schema.txt.
Basically, each non-blank line of the database corresponds to one
field (name-value pair) of a record that defines the
characteristics of a registered robot. Each record has a "robot-id"
field as a unique identifier. (There are many fields in the schema,
but, for now, the only ones we care about are: robot-id,
robot-name, robot-details-url, and robot-useragent.) Returns the
number of rows replicated. May raise a Tcl error that should be
caught by the caller.
ad_robot_filter conn args why
A filter to redirect any recognized robot to a specified page.
ad_update_robot_list
Will update the robots table if it is empty or if the number of
days since it was last updated is greater than the number of days
specified by the RefreshIntervalDays configuration parameter.
robot_exists_p robot_id
Returns 1 if a row already exists in the robots table with the
specified "robot_id", 0 otherwise.
robot_p useragent
Returns 1 if the useragent is recognized as a search engine, 0
otherwise.
Data Model Discussion
The data model has one table: robots. It mirrors a
few critical fields of the schema laid out in
Web
Robots Database Schema. Only robot_id,
robot_name, robot_details_url,
and robot_useragent are used in the current
implementation.
robot_id is not generated from a sequence in
Oracle, rather it is the robot_id assigned by
Web
Robot Database for each individual robot.
robot_name and
robot_details_url are
used to display a list of robots in the
Robot Detection Administration page.
robot_useragent is the most important
field of the bunch. An incoming http request is identified as a robot if
the UserAgent header matches of one of the
robot_useragent in the robots table.
Transactions
On server restarts, the robots table is refreshed with
the
raw data from Web Robots Database if the table's age is beyond
the number of days specified in the
RefreshIntervalDays parameter. ACS Administrators can
also refresh the robots table manually with
Robot Detection Administration.
User Interface
There is one page for ACS Administrators. It displays the
current parameters and a list of robots, plus a link to manually
refresh the robots table.
Configuration/Parameters
Name
Description
Default
WebRobotsDB
the URL of the Web Robots DB text file
http://info.webcrawler.com/mak/projects/robots/active/all.txt
FilterPattern
a CSV string containing the URLs for ad_robot_filter to
checked
/members-only-stuff/*
RedirectURL
the URL where robots should be sent
/robot-heaven/
RefreshIntervalDays
How frequently (in days) the robots table should be refreshed
from the Web Robots DB
30
Future Improvements/Areas of Likely Change
Authors
System creator: Michael
Yoon
System owner: Roger
Hsueh
Documentation author: Roger Hsueh
Revision History
Document Revision #
Action Taken, Notes
When?
By Whom?
0.3
Revised document based on comments
from Kevin Scaldeferri
2000-12-13
Roger Hsueh
0.2
Revised document based on comments
from Michael Bryzek
2000-12-07
Roger Hsueh
0.1
Create initial version
2000-12-06
Roger Hsueh