Oracle® Secure Enterprise Search Administrator's Guide 10g Release 1 (10.1.6) Part Number B19002-02 |
|
|
View PDF |
This chapter contains the following topics:
In a production environment, where a load balancer or other monitoring tools are used to ensure system availability, Oracle Secure Enterprise Search (SES) can also be easily monitored through the following URL: http://<host>:<port>/monitor/check.jsp
. The URL should return the following message: Oracle Secure Enterprise Search instance is up.
Note: This message is not translated to other languages, because system monitoring tools may need to byte-compare this string. |
If Oracle SES is not available, then the URL returns either a connection error or the HTTP error code 503.
Debug mode is useful for troubleshooting purposes. To turn on debug mode for Oracle SES administration tool, update the search.properties
file located in the $ORACLE_HOME/search/webapp/config
directory. Set debug=true
, and restart the Oracle SES middle tier. Set debug=false
when you are done troubleshooting.
Note: $ORACLE_HOME represents the directory where Oracle SES was installed.
Debug information can be found in the OC4J log file: |
Your Web crawling strategy can be as simple as identifying a few well-known sites that are likely to contain links to most of the other intranet sites in your organization. You could test this by crawling these sites without indexing them. After the initial crawl, you have a good idea of the hosts that exist in your intranet. You could then define separate Web sources to facilitate crawling and indexing on individual sites.
However, the process of discovering and crawling your organization's intranet, or the Internet, is generally an interactive one characterized by periodic analysis of crawling results and modification to crawling parameters. For example, if you observe that the crawler is spending days crawling one Web host, then you might want to exclude crawling at that host or limit the crawling depth.
This section contains the most common things to consider to improve crawl performance:
By default, Oracle SES is configured to crawl Web sites in the intranet. In other words, crawling internal Web sites requires no additional configuration. However, to crawl Web sites on the Internet (also referred to as external Web sites), Oracle SES needs the HTTP proxy server information. See the Global Settings - Proxy Settings page.If the proxy requires authentication, then enter the proxy authentication information on the Global Settings - Authentication page.
The seed URL you enter when you create a source is turned into an inclusion rule. For example, if www.example.com is the seed URL, then Oracle SES creates an inclusion rule that only URLs containing the string www.example.com will be crawled.
However, suppose that the example Web site includes URLs starting with www.exa-mple.com or ones that start with example.com (without the www). Many pages have a prefix on the site name. For example, the investor section of the site has URLs that start with investor.example.com.
Always check the inclusion rules before crawling, then check the log after crawling to see what patterns have been excluded.
In this case, you might add www.example.com, www.exa-mple.com, and investor.example.com to the inclusion rules.Or you might just add example.
To crawl outside the seed site (for example, if you are crawling text.us.oracle.com, but you want the crawler to follow links outside of text.us.oracle.com to oracle.com), then consider removing the inclusion rules altogether. Do so carefully. This could lead the crawler into many, many sites.
For file sources, if no boundary rule is specified, then crawling is limited to the underlying file system access privileges. Files accessible from the specified seed file URL will be crawled, subject to the default crawling depth. The depth, which is 2 by default, is set on the Global Settings - Crawler Configuration page. For example, if the seed is file://localhost/home/user_a/
, then the crawl will pick up all files and directories under user_a
with access privileges. It will crawl any documents in the directory /home/user_a/level1/level2
due to the depth limit. The documents in the /home/user_a/level1/level2
directory are at level 3.
The file URL can be of UNC (universal naming convention) format. The UNC file URL has the following format: file://localhost///<LocalMachineName>/<SharedFolderName>
.
For example, \\stcisfcr\docs\spec.htm
should be specified as file://localhost///stcisfcr/docs/spec.htm
.
On some machines, the path or file name may contain non-ASCII and multibyte characters. URLs are always represented using the ASCII character set. Non-ASCII characters are represented using the hex representation of their UTF-8 encoding. For example, a space is encoded as %20, and a multibyte character may be encoded as %E3%81%82.
For file sources, spaces can be entered in simple (not regular expression) boundary rules. Oracle SES automatically encodes these URL boundary rules. If (Home Alone) is specified, then internally it is stored as (Home%20Alone). Oracle SES does this encoding for the following:
File source simple boundary rules
Test URL strings
File source seed URLs
Note: Oracle SES does not alter the rule if it is a regular expression rule. It is the administrator's responsibility to make sure that the regular expression rule specified is against the encoded file URL. Spaces are not allowed in regular expression rules. |
Indexing dynamic pages can generate an excessive number of URLs. From the target Web site, manually navigate through a few pages to understand what boundary rules should be set to avoid crawling identical pages.
Setting the crawler depth very high (or unlimited), could lead the crawler into many sites. Without boundary rules, 20 will probably crawl the whole WWW from most locations.
You can control which parts of your sites can be visited by robots. If robots exclusion is enabled (default), then the Web crawler traverses the pages based on the access policy specified in the Web server robots
.txt
file.
The following sample /robots.txt
file specifies that no robots should visit any URL starting with /cyberworld/map/
or /tmp/
, or /foo.html
:
# robots.txt for http://www.example.com/ User-agent: * Disallow: /cyberworld/map/ Disallow: /tmp/ Disallow: /foo.html
If the Web site is under the user's control, then a specific robots rule can be tailored for the crawler by specifying the Oracle SES crawler plug-in name "User-agent: Oracle Secure Enterprise Search." For example:
User-agent: Oracle Secure Enterprise Search Disallow: /tmp/
The robots meta
tag can instruct the crawler to either index a Web page or follow the links within it. For example:
<meta name="robots" content="noindex,nofollow">
If Oracle SES thinks a page is identical to one it has seen before, then it will not index it. If the page is reached through a URL that Oracle SES has already processed, then it will not index that either.
The crawler crawls only redirected pages. For example, a Web site might have Javascript redirecting users to another site with the same title. Only the redirected site is indexed.
Check for inclusion rules from redirects. This is based on type of redirect. There are three kinds of redirects defined in EQ$URL
:
Temporary Redirect: A redirected URL is always allowed if it is a temporary redirection (HTTP status code 302, 307). Temporary redirection is used for whatever reason that the original URL should still be used in the future. It's not possible to find out temporary redirect from EQ$URL
table other than filtering out the rest from the log file.
Permanent Redirect: For permanent redirection (HTTP status 301), the redirected URL is subject to boundary rules. Permanent redirection means the original URL is no longer valid and the user should start using the new (redirected) one. In EQ$URL
http permanent redirect has the status code 954
Meta Redirect: Metatag redirection is treated as a permanent redirect. Meta redirect has status code 954. This is always checked against boundary rules.
URL looping refers to the scenario where, for some reason, a large number of unique URLs all point to the same document. One particularly difficult situation is where a site contains a large number of pages, and each page contains links to every other page in the site. Ordinarily, this would not be a problem, because the crawler eventually analyzes all documents in the site.
However, some Web servers attach parameters to generated URLs to track information across requests. Such Web servers might generate a large number of unique URLs that all point to the same document.
For example, http://example.com/somedocument.html?p_origin_page=10
might refer to the same document as http://example.com/somedocument.html?p_origin_page=13
but the p_origin_page
parameter is different for each link, because the referring pages are different. If a large number of parameters are specified and if the number of referring links is large, then a single unique document could have thousands or tens of thousands of links referring to it. This is an example of how URL looping can occur.
Monitor the crawler statistics in the Oracle SES administration tool to determine which URLs and Web servers are being crawled the most. If you observe an inordinately large number of URL accesses to a particular site or URL, then you might want to do one of the following:
Exclude the Web Server: This prevents the crawler from crawling any URLs at that host. (You cannot limit the exclusion to a specific port on a host.)
Reduce the Crawling Depth: This limits the number of levels of referred links the crawler will follow. If you are observing URL looping effects on a particular host, then you should take a visual survey of the site to find out an estimate of the depth of the leaf pages at that site. Leaf pages are pages that do not have any links to other pages. As a general guideline, add three to the leaf page depth, and set the crawling depth to this value.
Be sure to restart the crawler after altering any parameters. Your changes take effect only after restarting the crawler.
If you are still not crawling all the pages you think you should, then check which pages were crawled by doing one of the following:
Check the crawler log file (there's a link on the schedule page, and the location of the full log on the schedule-status page).
Create a search source group (Search - Source Groups - Create New Source Group). Put only one source in the group. From the Search page, search that group. (Click the group name above the search box.) Or, from the Search page, click Browse Search Groups. Click the group name for a hierarchy. You could also click the number next to the group name for a list of the pages crawled.
This section contains suggestions on how to improve the response time and throughput performance of Oracle SES.
This section contains the most common things to consider to improve search performance:
Optimizing the index reduces fragmentation, and it may significantly increase the speed of searches. Schedule index optimization on a regular basis. Also, optimize the index after the crawler has made substantial updates or if fragmentation is more than 50%. Make sure index optimization is scheduled during off-peak hours. Optimization of a very large index may take several hours.
See the fragmentation level and run index optimization on the Global Settings - Index Optimization page in the administration tool.
The data in the cache directory continues to accumulate until it reaches the indexing batch size. When the size is reached, the data is indexed. The bigger the batch size, the less fragmentation in the index. However, the bigger the batch size, the longer it will take to index each batch. Only indexed data can be searched: data in the cache cannot be searched.
Set the indexing batch size on the Global Settings - Crawler Configuration page in the administration tool.
See the Home - Statistics page in the administration tool for lists of the most popular queries, failed queries, and ineffective queries. This information can lead to the following actions:
Refer users to a particular Web site for failed queries on the Search - Suggested Links page.
Fix common errors that users make in searching on the Search - Alternate Words page.
Make important documents easier to find on the Search - Relevancy Boosting page.
Relevancy boosting lets administrators influence the order of documents in the result list for a particular search. You might want to override the default results for the following reasons:
For a highly popular search, direct users to the best results
For a search that returns no results, direct users to some results
For a search that has no click-throughs, direct users to better results
In a search, each result is assigned a score that indicates how relevant the result is to the search; that is, how good a result it is. Sometimes, there are documents that you know are highly relevant to some search. For example, your company Web site may have a home page for XML (http://example.com/XML-is-great.htm), which you want to appear high in the results of any search for "XML". You would boost the score of that home page (http://example.com/XML-is-great.htm) to 100 for an "XML" search.
There are two methods for locating URLs for relevancy boosting: locate by search or manual URL entry.
Note: The document still has a score computed if you enter a search that is not one of the boosted queries. |
With relevancy boosting, comparison of the user's query against the boosted queries uses exact string matching. This means that the comparison is case-sensitive and space-aware. Therefore, a document with a boosted score for "Enterprise Search" is not boosted when you enter "search".
With federated search, the search is performed by a separate entity. That entity may be another Oracle SES application, or it may be a completely separate application. For example, a third-party content management system may contain its own text indexes. Instead of letting Oracle SES crawl the data in the third-party system, you can let it do the searching. The results are merged with any results from other sources that are searched at the same time and presented in the Oracle SES result list.
When using federated search on secure content, use special care to set up a secure federated search environment.
Note: Oracle SES supports 2-tier federated search. Federation of 3-tier or more is not currently supported. |
Federated search characteristics:
Federated search can improve performance by distributing query and indexing processing on multiple machines. It can be an efficient way to scale up search service by adding a cluster of Oracle SES instances.
Federated search can be used to integrate with existing applications' search solutions.
The federated search query performance depends on the network topology and throughput of the entire federated Oracle SES environment.
Federated search is subject to the following limitations.
There is a size limit of 200KB for the cached documents existing on the remote Oracle SES instance to be displayed on the master node.
For infosource browse, if the source hierarchies for both local and federated sources under one source group start with the same top level folder, then only one of the hierarchies is available for browse.
There is no direct access to the documents on the remote server through the display URL in the search hitlist, except for the Web source documents. For all other sources, only the cached version of the document is accessible, if available.
This section provides an example of how to use federated search. The setup involves the following steps:
Step 1: Deploy the Oracle Secure Enterprise Search Federator
Step 2: Deploy and Set Up an Oracle Secure Enterprise Search Federated Source
Note: All steps should be performed on the instance where the federated source will be created. That is, the federator and the searchlet should be deployed on the master instance. |
Update $ORACLE_HOME/oc4j/j2ee/OC4J_SEARCH/config/application.xml
to add the following library paths. Only one federator needs to be deployed for all Oracle SES searchlets (federated sources). These library paths must be added as child elements of the orion-application
element:
<library path="../../../../search/lib/searchlet.jar"/> <library path="../../../../search/lib/search_query.jar"/> <library path="../../../../search/lib/search_midtier.jar"/> <library path="../../../../search/lib/searchctl.jar"/> <library path="../../../jlib/ldapjclnt10.jar"/> <library path="../../../../jlib/ldapjclnt10.jar"/> <library path="../../../../search/webapp/config"/> <library path="../../../../jlib/orai18n.jar"/> <library path="../../../../jlib/orai18n-mapping.jar"/> <library path="../../../../jlib/orai18n-translation.jar"/> <library path="../../../../oc4j/j2ee/home/jazn.jar"/> <library path="../../../../oc4j/j2ee/home/jazncore.jar"/> <library path="../../../../jlib/uix2.jar"/> <library path="../../../../jlib/commons-el.jar" /> <library path="../../../../jlib/oracle-el.jar" /> <library path="../../../../jlib/jsp-el-api.jar" /> <library path="../../../../jlib/regexp.jar"/> <library path="../../../../jlib/share.jar"/> <library path="../../../../jlib/ohw.jar" /> <library path="../../../../sysman/jlib/ohw.jar"/>
Before deploying the Oracle SES federator, check the RMI port used in the middle tier. You can find the RMI port number in the $ORACLE_HOME/oc4j/j2ee/OC4J_SEARCH/config/rmi.xml
file as the value of the port
attribute in the rmi-server
element.
Note: The RMI port number is subject to change. |
The federator should be deployed on the instance where the federated source will be created. Run the following single command from a command shell on the Oracle SES host to deploy the federator:
$ORACLE_HOME/jdk/bin/java -jar $ORACLE_HOME/oc4j/j2ee/home/admin.jar ormi://localhost:<rmi_port_number> admin <admin_password> -deployconnector -file $ORACLE_HOME/search/adapter/federator_searchlet.rar -name Federator
Where
<rmi_port_number>
is the RMI port number
<admin_password>
is the Oracle SES administrator password
The last output from the command should include the text "Connector Module Deployer for Federator COMPLETES"
to signal success.
Note: $ORACLE_HOME represents the directory where Oracle SES was installed. |
To set up an Oracle Secure Enterprise Search federated source, follow these steps:
Repeat these steps to set up more Oracle SES federated sources.
The searchlet should be deployed on the master instance, where the federated source will be created. Run the following single command from a command shell on the Oracle SES host to deploy the searchlet:
$ORACLE_HOME/jdk/bin/java -jar $ORACLE_HOME/oc4j/j2ee/home/admin.jar ormi://localhost:<rmi_port_number> admin <admin_password> -deployconnector -file $ORACLE_HOME/search/adapter/search_searchlet.rar -name <searchlet_name>
Where:
<rmi_port_number>
is the RMI port number
<admin_password>
is the Oracle SES administrator password
<searchlet_name>
is any name you choose to use to identify the searchlet
The last output from the command should include the text "Connector Module Deployer for <searchlet_name> COMPLETES"
to signal success.
Each Oracle SES searchlet must point to its federated slave instance. The host name and port of the slave Oracle SES instance is needed.
Update the $ORACLE_HOME/oc4j/j2ee/OC4J_SEARCH/application-deployments/default/<searchlet_name>/oc4j-ra.xml
file with the following:
Location of the connector-factory:
eis/oracle/oracleSearch/<searchlet_name>
Value of the config-property webServiceURL
:
http://<slave_ses_host_name>:<port>/search/query/OracleSearch
Value of the config-property appsURLPath
:
http://<slave_ses_host_name>:<port>/search/query/
Note: If the slave is SSL-enabled, then the values forwebServiceURL and appsURLPath should start with https instead of http. |
Navigate to the Oracle SES administration tool Home - Sources page. Create a federated type source. Specify the source name and its JNDI name.
You can find the JNDI name from the Oracle SES searchlet resource adapter file $ORACLE_HOME/oc4j/j2ee/OC4J_SEARCH/application-deployments/default/<searchlet_name>/oc4j-ra.xml
.
The JNDI name is the value of the location
attribute of the connector-factory
element.
Click Create after entering the source name and JNDI name.
From the administration tool Home - Sources page, edit the federated source. Create the following federated search attribute mappings in the Home - Sources - Edit for the federated source:
Table 5-1 Attribute Mappings for the Oracle SES Federated Source
Federated Source Document Attribute | Attribute Type | Federated Search Attribute |
---|---|---|
URL |
String |
URL |
Description_STRING_KWIC |
String |
Description |
CONTENT LENGTH |
Number |
Content Length |
SCORE |
Number |
Score |
APPS URL PATH |
String |
Apps URL Path |
Signature |
Number |
Signature |
Excerpt |
String |
Kwic |
HasDuplicate |
Number |
HasDuplicate |
IsDuplicate |
Number |
IsDuplicate |
ID |
Number |
Id |
fedId |
String |
fedId |
After you restart the OC4J middle tier (with searchctl restart
), you can run federated search from Oracle SES search application.
For the Oracle SES federated search environment to perform secure search, the federated master instance must be registered in Oracle Internet Directory as a trusted application to the slave instance.
To do this, use oidadmin
to add the master instance's application entity (DN) to the trusted application's group under the slave instance's application entity entry. For example, add:
orclApplicationCommonName=oesEntity_<master_name>,cn=OES,cn=Products, cn=OracleContext,dc=us,dc=oracle,dc=com
to the uniqueMember
attribute of
cn=TrustedApplications,orclApplicationCommonName=oesEntity_<slave_name>, cn=OES,cn=Products,cn=OracleContext,dc=us,dc=oracle,dc=com
where <master_name>
is the search server name of the master Oracle SES instance, and <slave_name>
is the name of the slave instance.
The federated slave instance cannot have single sign-on set up. However, the master instance can be protected by single sign-on.
Note: The master and slave instances should be connected to the same Oracle Internet Directory server.If you disconnect and then reconnect the slave instance to Oracle Internet Directory, or if you switch Oracle Internet Directory server for both the master and slave instances, then you must add the master instance's DN to the trusted application's group under the slave instance's application entity entry again. |
Secure federated search enables searching secure content across distributed Oracle SES instances. An end user is authenticated to the federated Oracle SES master instance and enters a query. Along with querying the secure content in its own index, the master instance federates the query to each of the slave Oracle SES instances, on behalf of the authenticated end user. To each federated slave instance, the master instance sends search queries using the Web Services API posing as a search user. This mechanism necessitates propagation of user identity between the Oracle SES instances.
In building a secure federated search environment, an important consideration is the secure propagation of user identities between the federated instances. This section explains how Oracle SES performs secure federation in a service-to-service manner without sending end user passwords across the network.
Secure Oracle HTTP Server-Oracle SES channel: Because any Oracle HTTP Server can potentially connect to the AJP13 port on the Oracle SES instances and masquerade as a specific person, the channel between the Oracle HTTP Server and the Oracle SES instance must be SSL-enabled, or the entire Oracle HTTP Server and Oracle SES instance machines must be protected by firewall.
See Also: Chapter 4, "Security in Oracle Secure Enterprise Search" for more information about setting up single sign-on |
Notes:
|
On the slave instance, authenticating requests from the master instance are achieved by the Web Services API calls. The slave instance must trust the master instance, so that the master instance can impersonate (or proxy) as users when sending queries to the slave instance.
To establish the trust relationship between the master and slave instance, the master's application entity must be registered under the Trusted Applications Group of the slave's application entity in the Oracle Internet Directory.
The master instance invokes a Web service method, passing its application entity, password and the end user GUID to the slave. The slave then validates application entity credentials with the Oracle Internet Directory server and checks if this application entity is in its Trusted Group. If the Oracle Internet Directory checks are successful, then the slave switches the user of the current query session to the end user GUID passed in.
Because the application entity password is passed through the proxyLogin()
method call across the network, the channel between the master and slaves must be SSL-enabled. The following graphic illustrates this.
Oracle Secure Enterprise Search provides a plug-in to integrate with Google Desktop for Enterprise (GDfE). You can include Google Desktop results in your Oracle SES hitlist. You can also link to Oracle SES from the GDfE interface.
See Also: Google Desktop for Enterprise Readme athttp://host:port/search/query/gdfe/gdfe_readme.html for details about how to integrate with GDfE |
A backup is a copy of configuration data that can be used to recover your configuration settings after a hardware failure. When a backup is performed on the Global Settings - Configuration Backup and Recovery page, Oracle SES copies the data to the binary metaData.bkp
file. The location of that file is provided on the Global Settings - Configuration Data Backup and Recovery page. When the backup successfully completes, you must copy this file to a different host. You should backup after making configuration data changes, such as creating or editing sources.
Recovery can only be performed on a fresh installation. When the installation completes, copy the metaData.bkp
file to the location provided in the administration tool. Sources need to be crawled again to see search results.
Some notes about backup and recovery:
You must stop all running schedules before doing the backup. Also, if secure search is enabled in the backup instance, then you must re-register the Oracle Internet Directory after performing the recovery steps.
If you have file or table sources residing on the same machine as the one running Oracle SES, and if you intend to use a different machine for recovery, then you must use the actual host name (not localhost) when creating the sources.
For database table sources, confirm that the remote tables exist.
For file sources, confirm that files and paths are valid after recovery.
For secure searches, Oracle Internet Directory connections must be set up again, and secure search must be re-enabled, after recovery.
During recovery, the mail archive directory settings for existing mailing list and e-mail sources is changed. After recovery, the location will be <cache-dir>/mail
, which is the default for new e-mail and mailing list sources. Any customized directory locations prior to recovery will be lost.
To crawl non-Oracle databases, you must create a view in an Oracle database on the remote non-Oracle table. Then create the table source on the Oracle view. Oracle SES accesses remote databases using database links. Only one table or view can be specified for each table source. If data from more than one table or view is required, then first create a single view that encompasses all required data.
The following datatypes are supported for table sources: BLOB
, CLOB
, CHAR
, VARCHAR
, VARCHAR2
. Datatypes are associated with a specific storage format, constraints, and a valid range of values. A datatype is specified for each column in a table.
This section contains the following:
For file sources to successfully crawl and display multibyte environments, the locale of the machine that starts the Oracle SES server must be the same as the target file system. This way, the Oracle SES crawler can "see" the multibyte files and paths.
If the locale is different in the installation environment, then Oracle SES should be restarted from the environment with the correct locale. For example, for a Korean environment, either set LC_ALL
to ko_KR
or set both LC_LANG
and LANG
to ko_KR.KSC5601
. Then run searchctl restartall
from either a command prompt on Windows or an xterm on Linux.
When crawling file sources on Linux, the crawler will resolve any symbolic link to its true directory path and enforce the boundary rule on it. For example, suppose directory /tmp/A
has two children, B
and C
, where C
is a link to /tmp2/beta
. The crawl will have the following URLs:
/tmp/A
/tmp/A/B
/tmp2/beta
/tmp/A/C
If the boundary rule is /tmp/A
, then /tmp2/beta
will be excluded. The seed URL is treated as is.
If a file URL is to be used "as is", without going through Oracle SES for retrieving the file, then "file" in the URL should be upper case "FILE". For example, FILE://localhost/...
"As is" means that when a user clicks on the search link of the document, the browser will try to use the specified file URL on the client machine to retrieve the file. Without that, Oracle SES uses this file URL on the server machine and sends the document through HTTP to the client machine.
If the plug-in is to return file URLs to the crawler, then the file URLs must be fully qualified. For example, file://localhost/
.
Also, if a file URL is to be used "as is", without going through Oracle SES for retrieving the file, then "file" in the URL should be upper case "FILE". For example, FILE://localhost/...
The tool for starting and stopping the search engine is searchctl
. To restart Oracle SES (for example, after rebooting the host machine), navigate to the bin
directory and run searchctl startall
.
Note: Users are prompted for a password when runningsearchctl commands on Linux platforms. No password is required on Windows platforms. This is because Oracle SES installation on Windows requires a user with administrator privileges. When running commands to start or stop the search engine, no password is required as long as the user is a member of the administrator group. |
See Also: Startup / Shutdown lesson in the Oracle SES tutorial:http://st-curriculum.oracle.com/tutorial/SESAdminTutorial/index.htm |