Search

Content Repository : Developer Guide

The content repository provides a consistent sitewide interface for searching content. It uses Intermedia to index the content column of cr_revisions) as well as all the attribute columns for each content type.

Searching Content

The content column in cr_revisions may contain data in any text or binary format. To accomodate searches across multiple file types, the content repository uses an Intermedia index with the INSO filtering option. The INSO filter automatically detects the the file type of a binary object, and extracts text from it for indexing. Most common file types are supported, including PDF and Microsoft Word, and Excel and PowerPoint.

Searching for content requires the same syntax as any text index:

select
  score(1), revision_id, item_id
from
  cr_revisions r
where
  contains(content, 'company', 1) > 0

The above query may be useful for an administrative interface where you wish to search across all revisions, but in most cases you only want to search live revisions:

select
  score(1), revision_id, item_id, content_item.get_path(item_id) url, title
from
  cr_revisions
where
  contains(content, 'company', 1) > 0
and
  revision_id = content_item.get_live_revision(item_id)

The URL and title may be used to construct a hyperlink directly to the item.

You may implement any number of variants on this basic query to place additional constraints on the results, such as publication date, content type, subject heading or a particular attribute (see below).

Some limitations of the current implementation include:

Searching Attributes

This task is primarily handled to two Intermedia indices:

Providing a generic mechanism for searching attributes is complicated by the fact that the attributes for each content type are different. The content repository takes advantage of the XML features in Oracle 8.1.6 to address this:

  1. After creating a new revision and inserting attributes into the storage table for the content type and all its ancestors, you must execute the content_revision.index_attributes procedure. (Note that this cannot be called automatically by content_revision.new, since the attributes in all extended storage tables must be inserted first).

  2. This procedure creates a row in the cr_revision_attributes table, and writes an XML document including all attributes into this row. A Java stored procedure using the Oracle XML Parser for Java v2 is used to actually generate the XML document.

  3. A special Intermedia index configured to parse XML documents is built on the column containing the XML documents for all revisions.

The Intermedia index allows you to use the WITHIN operator to search on individual attributes if desired.

select 
  revision_id,score(1) 
from 
  cr_revisions 
where 
  contains(attributes, 'company WITHIN title', 1) > 0

Some limitations of the current implementation include:

  1. A USER_DATASTORE associated with each row of the cr_items table, which feeds Intermedia the contents of the content column (a BLOB) for the live revision of an item. This should theoretically be more efficient for searching live content, especially in production environments where content is revised often.
  2. A second USER_DATASTORE associated with each row of the cr_items table, which feeds Intermedia the XML document representing all attributes for the live revision of an item (from cr_revision_attributes).
  3. The default XML document handler for the content repository simply provides a flat file of all attributes. Content types should also be able implement custom handlers, to allow the XML document to reflect one-to-many relationships or special formatting of attributes as well. The handler should specify a java class and method, which a dispatch method can call by reflection.

karlg@arsdigita.com
Last Modified: $Id: search.html,v 1.1 2002/07/09 17:34:57 rmello Exp $