Working With Intermediapart of ACS lessons by Michael Bryzek |
If you're like me, you love the idea of Intermedia - full text indexing, decent performance, tight integration with sql. But, if you're like me, Intermedia is the part of Oracle that doesn't let you sleep at night.While working on GuideStar with a lot of other really good programmers, we decided to use Intermedia to add a keyword search box to our site. Following Phil's recommendation, we built a Site-wide Search, that we called search. We were going to use Intermedia to index a columm, keywords, in that table. Sounds simple, and it would have been, except that our data set contained approximately 700,000 rows, 3,000-5,000 of which were updated daily. (Note: Alpha, one of the guys who wrote Intermedia, says 700,000 rows is nothing for Intermedia. I, however, disagree.)
Creating Intermedia Indexes
Once we had populated the table search with a pl/sql procedure, we ranSQL> create index search_im_keyword_idx on search(keywords) indextype is ctxsys.context;to create an intermedia index on search(keywords). Note that the "im" in the name of the index is just there to help me identify Intermedia indexes more easily. About 2 1/2 hours later (on a Sun E450 with 4 processors and 4gb of RAM) we saw:Index created.And we were in business. This one line sql command was enough to give us the performance we wanted:SQL> set timing on SQL> select count(1) from search where contains(keywords, 'guidestar', 1) > 1 order by score(1); COUNT(1) ---------- 3 Elapsed: 00:00:00.27Updating Intermedia Indexes
Unlike normal oracle indexes, intermedia indexes are not updated after the execution of dml statements. This means that we manually have to update the Intermedia indexes we are using. There are at least two ways to update these indexes:Since the context server is deprecated, we use the second method. From start to finish, the update of our Intermedia index looks like:
- Run the context server to automatically update your indexes. To start context server, get yourself a unix shell and run:
> ctxsrv -user ctxsys/ctxsyspwd -personality M -log $ORACLE_HOME/ctx/ctxsys.log &where ctxsyspwd is the password for the oracle user ctxsys. As it good as it sounds, running the context server is not the recommend method to keep the indexes up-to-date and the context server is already deprecated.- Manually update the index on a regular basis. To update an intermedia index, open a sqlplus session as the user who created the index. Then:
SQL> alter index search_im_keyword_idx rebuild online parameters('sync memory 45M'); Index altered.The parameters of this call:
- rebuild online: Rebuild the index incrementally so that it is still accessible. If you forget to specify "online," users searching the website would not be able to access the index.
- sync: Just update the index (rather than rebuild it entirely)
- memory 45M: Specify the amount of memory, in megabytes, that Intermedia can use while rebuilding the index. Using more memory decreases the amount of disk I/O and thus increases the speed with which the index can be rebuilt. The default is 12 megabytes and the acceptable range of inputs for memory is 1 KB to about 50 MB. To increase the maximum allowable index memory:
SQL> exec ctx_adm.set_parameter('max_index_memory', '1G');- More information available in the Oracle Documentation
When manually altering intermedia indexes, you have to make sure that no more than 1 of these alter index statements is running at any given time. More than 1 statement could result in a deadlock inside intermedia that can only be solved by dropping and recreating the index.
- We have triggers on all the tables that feed into the site wide index. Whenever a row in one of these tables is updated, the "dirty bit" in the site wide search table is set
- Every hour we run a tcl procedure that updates the dirty rows in the search table by calling a pl/sql procedure. The aolserver call looks like:
ns_ora exec_plsql $db "begin :1 := 1; keyword_datastore_gasket('$ein'); end;"where the :1 := 1 is just a hack to work around exec_plsql behavior- Every hour we also run a tcl procedure that executes an "alter index" statement to update the intermedia indexes:
ns_db dml $db "alter index search_im_keyword_idx rebuild online parameters('sync memory 45M')"
Useful Intermedia Trivia
If you ever wanted to know how many rows are waiting to be synced with each of your Intermedia indexes, you can do:SQL> select PND_INDEX_NAME, count(1) from ctx_user_pending group by PND_INDEX_NAME;You can also look at Intermedia errors to find out what went wrong by opening a sqlplus session as the ctxsys user and running:SQL> select err_timestamp, err_index_name, err_text from ctx_index_errors order by err_timestamp;to get a list of all the errors on intermedia indexes, sorted by date.
Where the Problems Start
A few times the update of the intermedia index failed. One night as I was headed to bed, I decided to update the index overnight.SQL> alter index search_im_keyword_idx rebuild online parameters('sync memory 45M');About two hours into my dreams, the phone rang to wake me up - all searches were failing on the website. Users were simply getting back a page with zero results, regardless of their query. About fifteen minutes later, I'm back at the computer where I ran the update index statement. I saw that my index was in a FAILED/LOADING state. Had I been smart, I would have logged what was going on in context by executing:SQL> exec ctx_output.start_log('better_days');before running my query to log its progress. This starts context logging to $ORACLE_HOME/ctx/log/better_days.Whenever an index update fails, try a resume statement that should finish up the work from before:
SQL> alter index search_im_keyword_idx rebuild online parameters('resume memory 45M');The resume statement has never worked for me but maybe you'll have better luck. Probably, you'll just see the same error message you had before. Somehow the index has become corrupted:SQL> drop index search_im_keyword_idx force; SQL> create index search_im_keyword_idx on search(keywords) indextype is ctxsys.context;And 2 1/2 hours later, the index was re-created and back on-line. Meanwhile, our website was not serving search results for over three hours.What Went Wrong?
The hourly procedure running inside AOLServer to update the Intermedia index started up in parallel to the one I manually executed in sqlplus. These two threads competed for some resource, and ended up in a deadlock.
I have also seen the same error while running context server and executing the an alter index statement in sqlplus. Alpha tells me that the deadlock shouldn't happen, and I agree with him. However, the reality of the deadlock gives us the following Intermedia law: Use exactly one method to update Intermedia indexes. It is also worth mentioning that an uncommitted sql session can keep the alter index statement from completing.
Note that the two threads were able to become deadlocked simply because we had a lot of data. On a small data set, it is much less likely that two threads would be running at once since each thread would complete much more quickly.
Using Intermedia on the web
The queries humans write will most certainly not be Intermedia friendly. The following procedure provided by some folks at Intermedia takes text written by a human and returns a string suitable for feeding into Intermedia:- -- We will assume that the users are not going to use any intermedia text syntax - -- So all intermedia text keywords will be braced to escape them create or replace function im_convert( query in varchar2 default null ) return varchar2 is i number :=0; len number :=0; char varchar2(1); minusString varchar2(256); plusString varchar2(256); mainString varchar2(256); mainAboutString varchar2(256); finalString varchar2(256); hasMain number :=0; hasPlus number :=0; hasMinus number :=0; token varchar2(256); tokenStart number :=1; tokenFinish number :=0; inPhrase number :=0; inPlus number :=0; inWord number :=0; inMinus number :=0; completePhrase number :=0; completeWord number :=0; code number :=0; begin len := length(query); - -- we iterate over the string to find special web operators for i in 1..len loop char := substr(query,i,1); if(char = '"') then if(inPhrase = 0) then inPhrase := 1; tokenStart := i; else inPhrase := 0; completePhrase := 1; tokenFinish := i-1; end if; elsif(char = ' ') then if(inPhrase = 0) then completeWord := 1; tokenFinish := i-1; end if; elsif(char = '+') then inPlus := 1; tokenStart := i+1; elsif((char = '-') and (i = tokenStart)) then inMinus :=1; tokenStart := i+1; end if; if(completeWord=1) then token := '{ '||substr(query,tokenStart,tokenFinish-tokenStart+1)||' }'; if(inPlus=1) then plusString := plusString||','||token||'*10'; hasPlus :=1; elsif(inMinus=1) then minusString := minusString||'OR '||token||' '; hasMinus :=1; else mainString := mainString||' NEAR '||token; mainAboutString := mainAboutString||' '||token; hasMain :=1; end if; tokenStart :=i+1; tokenFinish :=0; inPlus := 0; inMinus :=0; end if; completePhrase := 0; completeWord :=0; end loop; -- find the last token token := '{ '||substr(query,tokenStart,len-tokenStart+1)||' }'; if(inPlus=1) then plusString := plusString||','||token||'*10'; hasPlus :=1; elsif(inMinus=1) then minusString := minusString||'OR '||token||' '; hasMinus :=1; else mainString := mainString||' NEAR '||token; mainAboutString := mainAboutString||' '||token; hasMain :=1; end if; mainString := substr(mainString,6,length(mainString)-5); mainAboutString := replace(mainAboutString,'{',' '); mainAboutString := replace(mainAboutString,'}',' '); plusString := substr(plusString,2,length(plusString)-1); minusString := substr(minusString,4,length(minusString)-4); -- we find the components present and then process them based on the specific combinations code := hasMain*4+hasPlus*2+hasMinus; if(code = 7) then finalString := '('||plusString||','||mainString||'*0.1,about('||mainAboutString||')*0.5) NOT ('||minusString||')'; elsif (code = 6) then finalString := plusString||','||mainString||'*0.1'||',about('||mainAboutString||')*0.5'; elsif (code = 5) then finalString := '('||mainString||',about('||mainAboutString||')) NOT ('||minusString||')'; elsif (code = 4) then finalString := mainString; finalString := replace(finalString,'*1,',NULL); finalString := '('||finalString||')*0.1,about('||mainAboutString||')'; elsif (code = 3) then finalString := '('||plusString||') NOT ('||minusString||')'; elsif (code = 2) then finalString := plusString; elsif (code = 1) then -- not is a binary operator for intermedia text finalString := 'totallyImpossibleString'||' NOT ('||minusString||')'; elsif (code = 0) then finalString := ''; end if; return finalString; end; /Words of Caution: Oracle 8.1.5.0 has a bug in the parser for the about clause. The bug seems to incorrectly map multi word queries to their corresponding tokens in the Intermedia tables. For example, about('massachusetts bay'), which is tokenized as 'Massachusetts Bay', is parsed as 'Massachusetts bay', not matching any row in the Intermedia tokens table and thereby throwing an error. If you find yourself in this situation, simply get rid of the about clauses in the above pl/sql function.Optimizing Intermedia
As you continue to update your intermedia index, the tables Intermedia maintains will become more and more fragmented. Once in awhile, you might want to optimize your index by executing the following from sqlplus:SQL> ALTER INDEX newsindex rebuild parameters('optimize fast');Optimization is an extremely slow process and it's not clear that the performance gain of an optimized index will be noticeable. If you are going to optimize, you should limit the amount of time that the optimization will take (180 = 180 minutes):SQL> ALTER INDEX newsindex rebuild parameters('optimize full maxtime 180');The Intermedia documentation has more information about optimizing indexes.Optimizing Web Searches with Intermedia
Unlike "normal" SQL queries, Intermedia actually handles the FIRST_ROWS hint very well. This means that you can sort large data sets incrementally (as opposed to sorting the entire data set to return the first 25 rows in sorted order):SQL> set timing on SQL> select /*+ FIRST_ROWS */ org_name from search where contains(keywords, 'philanthropic', 1) > 0 order by score(1); 250 rows selected. Elapsed: 00:00:00.42The search is optimized to return the first rows from the cursor. We could now display multi-page search results for organization names that matched the query for "philanthropic"... in .42 seconds, including the time my sqlplus session took to print out the org names. It is important to know that Intermedia does a better job with FIRST_ROWS than Oracle which leads us to second Intermedia law: When sorting through a large dataset, order by score(n) rather than a column not indexed by Intermedia.