4. SIETS API Specification

This section generally describes all SIETS API specification, which is implemented in XML.

This section contains the following topics:

4.1. Overview

This section contains the following topics:

4.1.1. Submitting SIETS Commands and Receiving Replies

XML request and reply messages are exchanged between the application and the SIETS storage via HTTP with the port 80 as the default.

As mentioned earlier, it is possible to transport SIETS commands to the SIETS server and receive replies as XML messages and, also it is possible to submit HTTP GET parameters and receive formatted replies.

Both options are described in the following sections:

4.1.1.1. Exchanging XML Messages Directly

The following figure illustrates submitting SIETS commands and receiving replies via XML messages directly:

Figure 17: Exchanging XML messages directly

A request is sent as a POST method.

As the HTTP resource identification, the URL http://host/cgi-bin/siets/api.cgi must be used, where <host> is the SIETS server host name.

4.1.1.2. Submitting Parameters and Receiving Formatted Replies

The following figure illustrates submitting SIETS commands as HTTP GET parameters and receiving formatted XML replies:

Figure 18: Submitting parameters and receiving formatted replies

A request is sent as a GET or POST method.

As the HTTP resource identification, the URL http://host/cgi-bin/siets/api.cgi must be used, where <host> is the SIETS server host name. Command specific parameters must be included in query string or passed as POST data.

4.1.2. XML Message Structure

As described previously, each XML message contains a command name, content data that are specific for the command, and other information, such as user name and request identifier, which is common for all XML messages and included in the so called XML message envelope.

For more information on the XML message envelope, see SIETS XML Message Envelope.

The following figure illustrates the common part for all XML messages and content part that is specific for each command:

Figure 19: XML message: common part and content part

Description of SIETS API commands is organized so that the common part is described in SIETS XML Message Envelope, and only the content parts are described for each command in separate sections named after the command.

XML elementsare presented as they appear in messages and each XML element is described within its tags.

The command syntax consists of an XML request and an XML reply, and as mentioned, XML requests can be submitted as HTTP GET or POST parameters. To describe XML request, XML reply, and HTTP GET parameters syntax, each section contains the following subsections:

Subsection

Description

XML Request

Lists all XML request elements that specific for the command as they appear in XML request messages. Each element is described within its tags. The description within the tags ends with an asterisk *, if the element is mandatory.

HTTP GET Parameters

Describes HTTP GET parameter syntax in the form of an example. The example looks as follows:

http://host/cgi-bin/siets/api.cgi?param1=value&param2=value

where:

  • host is SIETS server IP address or a host name,

  • param1, param2, and so on are HTTP GET parameters,

  • value is a parameter’s value.

Note: In examples HTTP GET parameters are described, however, you can submit also POST parameters.

XML Reply

Lists all XML reply elements that are specific for the command as they appear in XML reply messages. Each element is described within its tags.

Some elements in XML requests, and thus, respective parameters, if submitting the XML request as HTTP GET parameters, are mandatory, and some are not. The mandatory elements are marked with an asterisk * in the XML request description.

However, there are some XML request elements that are mandatory only if submitted as XML request, but are not mandatory if submitted as HTTP GET parameters. Such parameters first must be defined in the SIETS Web server module configuration file, and then, do not have to be submitted each time when sending a command. Parameters that can be defined in the SIETS Web server module configuration file are the following:

For more information on the SIETS Web server module configuration file, see the SIETS Administration and Configuration Guide.

4.2. SIETS XML Message Envelope

This section describes the common parts of the XML request and reply for all SIETS API commands.

4.2.1.1. XML Request

<?xml version=”1.0” encoding=”REQUEST-ENCODING”?>

<siets:request xmlns:siets=”www.siets.net”>

<siets:storage>storage name*</siets:storage>

<siets:command>command name*</siets:command>

<siets:timestamp>message creation date and time</siets:timestamp>

<siets:requestid>message number</siets:requestid>

<siets:application>creator of message</siets:application>

<siets:user>user name*</siets:user>

<siets:password>user password*</siets:password>

<siets:timeout> function timeout period </siets:timeout>

<siets:reply_encoding>reply encoding</siets:reply_encoding>

<siets:content>command specific data </siets:content>

</siets_request>

4.2.1.2. XML Reply

<?xml version="1.0" encoding=”REPLY-ENCODING”?>

<siets:reply xmlns:siets=”www.siets.net”>

<siets:storage>storage name</siets:storage>

<siets:timestamp>reply creation date and time</siets:timestamp>

<siets:content>command specific data</siets:content>

<siets:command>command name for which the reply is created</siets:command>

<siets:requestid>message number for which the reply is created</siets:requestid>

<siets:seconds>time period for the reply creation</siets:seconds>

<siets:replyid>unique message id created by the SIETS server</siets:replyid>

</siets_reply>

4.3. Data Manipulation

This section describes the following data manipulation commands:

4.3.1. Insert, Update, and Replace

The insert command adds a document to the SIETS storage. If a document with such ID already exists, the command returns an error.

If a document with such ID exists in the SIETS storage, the update command updates the document. If a document with such ID is not in the SIETS storage, the update command adds it to the SIETS storage.

The replace command replaces contents of a document in the SIETS storage. If a document with such ID is not in the SIETS storage, the command returns an error.

4.3.1.1. XML Request

<siets:content>

<document>document content <document>

</siets:content>

Where the document content consists of document structure elements. The default SIETS document structure is as follows:

<document>

<id> document id * </id>

<title> document title </title>

<rate> document rate </rate>

<domain> document domain </domain>

<info> meta data </info>

<text> textual information, which is used for indexing </text>

<hidden> textual information, which is used for indexing, but which is not shown</hidden>

</document>

For more information on the default SIETS document structure, see Creating Document Structure with Application.

4.3.1.2. HTTP GET Parameters

http://host/cgi-bin/siets/api.cgi?command=insert&storage=test&id=1&title=Doc1

http://host/cgi-bin/siets/api.cgi?command=update&storage=test&id=1&title=Doc1

http://host/cgi-bin/siets/api.cgi?command=replace&storage=test&id=1&title=Doc1

4.3.1.3. XML Reply

If the command is executed successfully, the XML reply does not contain any command specific data.

If the command is not executed successfully, an error is returned. For more information on errors, see Error Handling.

4.3.1.4. Binary Files Conversion

Note: The binary files conversion functionality is available only starting from the SIETS server version 3.2.8.

Binary files conversion is integrated feature for the insert, update, and replace commands. The binary files conversion functionality converts binary file contents into plain text. Thus, it is possible to add several Microsoft Office files and other binary files to the SIETS storage and perform full text search on them.

The following table lists extensions of binary files that can be added to the SIETS storage:

Extension

Description

DOC

Microsoft Word document.

XLS

Microsoft Excel document.

PPT

Microsoft PowerPoint document.

RTF

Rich text format document.

PDF

Adobe portable document format document.

PS

Post script document.

To use the binary files conversion functionality, in the XML request, in the place of the text tag, use the file tag in the following format:

<file store=’’yes/no <!--If store=”yes”, then the original document is stored in the SIETS storage and returned when retrieved. The default value is “no”-->>

<ext> extension of binary file </ext>

<data> data of binary file converted to the base64 encoding </data>

</file>

As described in the data tag, binary file contents first must be converted to the base64 encoding. This is because XML does not support storing binary data within a tag.

4.3.2. Delete

The delete command deletes a document from the SIETS storage. If a document with such ID is not in the SIETS storage, the command returns an error.

4.3.2.1. XML Request

<siets:content>

<document>

<id>document id *</id>

</document>

</siets:content>

4.3.2.2. HTTP GET Parameters

http://host/cgi-bin/siets/api.cgi?command=delete&storage=test&id=1

4.3.2.3. XML Reply

If the command is executed successfully, the XML reply does not contain any command specific data.

If the command is not executed successfully, an error is returned. For more information on errors, see Error Handling.

4.3.3. Index

After inserting, updating, replacing, or deleting documents in the SIETS storage, the SIETS server must permanently save the changes made to the inverted index. The SIETS server is able to make the decision, when to start saving the changes to the inverted index, on its own. However, to optimize performance, for large data amounts, it is recommended to inform the system when a portion of documents are loaded and in the nearest time period more documents are not to be loaded, in other words, the SIETS server can allocate all resource for the process of indexing.

The index command tells the SIETS server to start the process of indexing.

4.3.3.1. XML Request

The <siets:content> element does not contain any command specific data.

4.3.3.2. HTTP GET Parameters

http://host/cgi-bin/siets/api.cgi?command=index

4.3.3.3. XML Reply

If the command is executed successfully, the XML reply does not contain any command specific data.

If the command is not executed successfully, an error is returned. For more information on errors, see Error Handling.

4.3.4. Clear

The clear command deletes all documents from the SIETS storage. This command should be used only when a complete re-indexing of the SIETS storage is necessary.

4.3.4.1. XML Request

The <siets:content> element does not contain any command specific data.

4.3.4.2. HTTP GET Parameters

http://host/cgi-bin/siets/api.cgi?command=clear

4.3.4.3. XML Reply

If the command is executed successfully, the XML reply does not contain any command specific data.

If the command is not executed successfully, an error is returned. For more information on errors, see Error Handling.

4.3.5. Get_scheme

The get_scheme command retrieves the document structure definition, in other words, scheme, from the SIETS storage.

Note: It is also possible to review and edit the document policy scheme from SIETS Enterprise Manager. For information on SIETS Enterprise Manager, see the SIETS Administration and Configuration Guide, Configuring SIETS Storage.

4.3.5.1. XML Request

The <siets:content> element does not contain any command specific data.

4.3.5.2. HTTP GET Parameters

http://host/cgi-bin/siets/api.cgi?command=get_scheme

4.3.5.3. XML Reply

<siets:content>

<scheme>

<part>

<location>location of the document part in XPath notation</location>

<policy>policy for the document part, policy=value </policy>

</part>

</scheme>

</siets:content>

4.3.6. Set_scheme

The set_scheme command sets the document structure definition, in other words, scheme, to the SIETS storage.

Note: If you modify the scheme, it applies to all documents that are to be imported to the SIETS storage. However, it does not automatically modify the document structure for documents that already are imported to the SIETS storage.

Note: It is also possible to review and edit the document policy scheme from SIETS Enterprise Manager. For information on SIETS Enterprise Manager, see the SIETS Administration and Configuration Guide.

4.3.6.1. XML Request

<siets:content>

<scheme>

<part>

<location>location of the document part in XPath notation</location>

<policy>policy for the document part, policy=value</policy>

</part>

</scheme>

</siets:content>

For more information on document policies, see Importing XML Structured Data.

4.3.6.2. HTTP GET Parameters

The set_scheme command cannot be submitted as HTTP GET parameters.

4.3.6.3. XML Reply

If the command is executed successfully, the XML reply does not contain any command specific data.

If the command is not executed successfully, an error is returned. For more information on errors, see Error Handling.

4.4. Status Monitoring

This section describes the Status command.

4.4.1. Status

The status command returns status information of the SIETS server instance. The status information includes:

4.4.1.1. XML Request

The <siets:content> element does not contain any command specific data.

4.4.1.2. HTTP GET Parameters

http://host/cgi-bin/siets/api.cgi?command=status

4.4.1.3. XML Reply

If the command is executed successfully, the XML reply contains the following command specific data.

<siets:content>

<status>

<ctrld>

<started> date and time, when the SIETS server was started </started>

<age>time period the SIETS server is working since it was started</age>

<total_time_elapsed>total time spent by the SIETS sever executing commands</total_time_elapsed>

<transactions><--This element contains information about executed commands-->

<total> total number of commands executed</total>

<successful>number of commands that were successfully executed</successful>

<failed> number of commands that were executed unsuccessfully </failed>

<requests command="command name">number of times the command was executed </requests> <-- This element is repeated for every command that was executed.-->

</transactions>

<last_modified> date and time, when modifications in SIETS storage occurred last time </last_modified>

<queue> number of commands executed simultaneously </queue>

<version> SIETS version number</version>

</ctrld>

<mtxd> <-- This element contains information about the inverted index.-->

<journal>

<usage> indexing memory cache usage in percent</usage>

</journal>

<pool_state> index state: normal, expanding, or collapsing</pool_state>

</mtxd>

<wordd> <-- This element contains information about the vocabulary.-->

<unique_words>unique words in the SIETS storage</unique_words>

<total_words>total number of all words</total_words>

</wordd>

<docd>

<documents>total number of documents</documents>

<domains> number of distinct domains of documents</domains>

</docd>

</status>

</siets:content>

When importing data to the SIETS storage:

If the command is not executed successfully, an error is returned. For more information on errors, see Error Handling.

4.5. Data Retrieval

This section describes the following data retrieval commands:

4.5.1. Lookup and Retrieve

The lookup command searches for a document in the SIETS storage and returns the information whether the document with such ID exists is in the SIETS storage or it does not.

The retrieve command returns a document from the SIETS storage. If a document with such ID is not in the SIETS storage, the command returns an error.

4.5.1.1. XML Request

<siets:content>

<document>

<id>document id *</id>

</document>

</siets:content>

4.5.1.2. HTTP GET Parameters

http://host/cgi-bin/siets/api.cgi?command=lookup&storage=test&id=1

http://host/cgi-bin/siets/api.cgi?command=retrieve&storage=test&id=1

4.5.1.3. XML Reply

If the command is executed successfully, the XML reply contains the following command specific data.

<siets:content>

<found>indicator 1 or 0 if a document is found or not, respectively</found>

<results>

<document>

meta data for the lookup command

textual information for the retrieve command

</document>

</results>

</siets:content>

Meta data for the lookup command is information included in tags, for which the policy list is set to YES. By default, these are id, title, and rate tags.

For more information on policies, see Importing XML Structured Data.

If the command is not executed successfully, an error is returned. For more information on errors, see Error Handling.

4.5.2. Search

The search command performs FTS in the SIETS storage.

4.5.2.1. XML Request

<siets:content>

<query> search query *</query>

<docs> number of documents in the result set </docs>

<offset> intend from the beginning of the result set</offset>

<case_sensitive> Boolean type parameter: YES to enable case sensitivity of the first letter of words when performing the search, NO not to enable case sensitivity </case_sensitive>

<relevance> Boolean type parameter: YES to order results by relevance, NO not to order results by relevance </relevance>

<max_from_domain> Maximum number of documents from one domain. Results from one domain are grouped together within one result page. If the parameter is not set, the default value is 0, which implies that no grouping by domains is performed and no limit is set. </max_from_domain>

<rate_from> searching documents with in a rate range: the FROM value </rate_from>

<rate_to> searching documents with in a rate range: the TO value </rate_to>

<wildcards> <!-- This element contains parameters for configuring wildcard patterns support. Functionality of this tag is available only starting from the SIETS server version 3.2.8.-->

<allow> Information whether the wildcard patterns search is enabled. Values “yes” or “no”.</allow>

<cover_factor> When wildcard patterns are used to define a class of words to be searched, only a limited number of statistically frequent words are searched for to ensure a higher performance. This element defines the limit in percent from the sum of all words created from the wildcard pattern appearance in the SIETS storage.</cover_factor>

<min_expand> The minimum limit of the wildcard patterns matching set from the SIETS storage vocabulary in absolute numbers. This parameter overcomes the cover_factor parameter. For example, if only 2 words fall in the cover_factor, but the min_exapand is 4, then 4 words are being used in the search.</min_expand>

<max_expand> The maximum limit of the wildcard patterns matching set from the SIETS storage vocabulary in absolute numbers. This parameter overcomes the cover_factor parameter. For example, if 20 words fall in the cover_factor, but the max_exapand is 16, then only 16 words are being used in the search.</max_expand>

</wildcards>

</siets:content>

If values for the wildcards tag are not defined, corresponding parameters set in the SIETS storage configuration file are used.

For more information configuring SIETS storage, see the SIETS Administration and Configuration Guide.

This section contains the following topics:

SIETS provides several mechanisms for specifying your search query. Each mechanism has a definite syntax, which is described in the following subsections. For a better understanding, each subsection also contains an example of the mechanism described and an explanation about what the example search query returns.

This section contains the following topics:

To search for documents that contain a single search term, the search term must be entered as is.

Example:

John returns documents that contain the word “John”.

AND

To search for documents that contain all of the several terms, but which are not necessarily next to each other, the search term must be separated by the space character.

Example:

John Smith returns documents that contain the word “John” and the word “Smith”.

Phrase Search

To search for documents that contain an exact phrase, the search phrase must be enclosed in the quotations marks.

Example:

“John Smith” returns documents that contain the exact phrase “John Smith”.

OR

To search for documents that contain any of the search terms, the search terms must be enclosed in { } and separated with the space character.

Example:

{John Smith} returns documents that contain either the word “John” or the word “Smith”.

NOT

To search for documents that do not contain the search term, the search term must be preceded with ~.

Example:

~John returns documents that do not contain the word “John”.

Boolean Expressions

AND, OR, and NOT logical connectives can be combined in more complex search expressions using the brackets ( ), which allows you to build any Boolean expression.

Example:

{(John Smith) (Abby Brown)} returns documents that either contains the word “John” and the word “Smith”, or the word “Abby” and the word “Brown”.

{(A B ~C) ”D E”} is parsed in the expression tree as follows:

Figure 20: Search query expression tree

Wildcard Patterns

To search for documents that contain a class of words represent:

Example:

ca? returns documents that contain the word “car”, “cat”, “cap”, “can”, and so on.

Joh* returns documents that contain the word “John”, “Johnson”, “Johnny”, and so on.

ca[pt] returns documents that contain only the word “cap” or “cat”.

c?[au]* returns documents that contain the word “counter”, “club”, “chapter”, “country”, “change”, “chat”, “council”, “class”, “cpu”, “challenge", “church”, “couple”, “championship”, and so on.

Ignored Words

By default, SIETS ignores common words and characters such as “and”, ”where”, and “how”, as well as certain single characters and single letters, because they tend to slow down the search without improving the search results. Common words and characters like this are called ignored words.

The SIETS server detects words that appear in the SIETS storage most often and adds them to the ignored words list. It is possible to edit the limit of the ignored words list. For more information on managing the ignored word list limit, see the SIETS Administration and Configuration Guide.

If a common word or a character is essential to getting the results you want, you can include it by preceding it with a plus sign +.

Example:

John +and Abby returns documents that contain all three words: “John”, “and”, and “Abby”.

Stemming

It is possible to include in one search request a word and its declinations, for example, “go” and “going”.

This feature is especially useful for so-called synthetic languages, in which syntactic relations within sentences are expressed by the change in the form of a word that indicates distinctions of tense, person, gender, number, mood, voice, and case, for example, German and Latin.

To enable the declination search, a shared library must be implemented, which exports a function that extracts word roots.

For information on installing the shared library for the SIETS server, see the SIETS Administration and Configuration Guide.

To search for documents that contain words in declinations, a word or a phrase must be enclosed in the dollar signs $ $.

Example:

$John$ returns documents that contain the word “John” and “John’s”.

Search within Markup

To search for documents that contain the search term in a specific tag, the search term must be enclosed in the appropriate tags.

Note: The searching within markup can be performed only if the policy index with values xml or all is used. For the default document structure is the index policy with the value xml is set by default. For more information on policies see, Importing XML Structured Data.

Example:

<person>John Smith</person> returns documents that contain the word “John” in the <person> tag and the word “Smith” in the <person> tag.

{<person>John</person> <address>”New York”</address>} returns documents that either contains the word “John” in the <person> tag, or the phrase “New York” in the <address> tag”.

Proximity Search

It is possible to define maximum of words, which appear between certain search terms. These search terms are also defined in the search query. Such feature is called proximity search.

To use the proximity search feature, the search term must be as follows:

@ N term1 term2 @,

where N is the maximum count of words between the search terms, and term1 and term2 are search terms. There can be any number of search terms included in the proximity search.

If N is 1, then the search is exactly the same as if the phrase search was used.

For more information on the phrase search, see Phrase Search.

Example:

@ 3 street city @ returns documents that contain the words “street” and “city” not further than 3 words from each other.

Numeric Search

Note: The numeric search functionality is available only starting from the SIETS server version 3.3.

Due to the fact that the SIETS server is indexing not only text information, but it also indexes numeric information, it is possible to perform numeric search. Numeric search allows searching documents that contain numeric values within a numeric interval.

For example, each document contains information about an object including geographic coordinate information. In that case, the numeric search can be performed to retrieve all objects in definite range of geographic coordinate. Thus, SIETS can be used in online maps, where people can find information on different objects in a definite area.

The numeric search can be performed only together with a text search.

Numeric values in documents are indexed and stored as floating points, no matter if they are integers or floating points in original documents.

Fraction part is stored up to the sixth digits.

To use the numeric search functionality, the search term must be as follows:

It does not matter if textual search term is entered before or after numeric search term.

Example:

Document content:

<document>

<id>32423</id>

<title>John’s profile</title>

<text>

<name>John Smith</name>

<age>32</age>

</text>

</document>

Search query that matches the document:

<query>

<name>Jonh</name> <age>30 .. 40</age>

</query>

<numeric_ordering>center</numeric_ordering>

Note: For performing numeric searching for one document tag, as in the previous example the <age> tag, only one numeric interval can be entered. If you enter more than one numeric interval for one tag, then nothing is returned since numeric intervals are joined with the AND logical operation.

For information on performing numeric search for more than tags, see Numeric Search in More Than One Tag.

The <numeric_ordering> tag in the example denotes the order in which search results must be returned.

Possible values for numeric ordering are the following:

Title

Description

none

No numeric ordering is applied.

center

Results that are closer to the mean value of the numeric search interval are returned first.

This value is allowed only for numeric search within a range of two numeric values.

ascending

Numeric search results are returned in ascending order.

descending

Numeric search results are returned in descending order.

Numeric Search in More Than One Tag

It is possible to perform numeric search in more than one tag. It means that for each tag that contains numeric information a numeric search range can be performed.

Example:

Document content:

<document>

<id>32423</id>

<title>John’s profile</title>

<text>

<name>John Smith</name>

<age>32</age>

<children>2</children>

</text>

</document>

Search query that matches the document:

<query>

<name>Jonh</name> <age>30 .. 40</age> <children>&lt; 2</children>

</query>

<numeric_ordering>center</numeric_ordering>

Numeric search in more than one tag is especially useful and necessary for geographic coordinate searching, where it is necessary to search for an object by its longitude and latitude.

For numeric search in more than one tag result ordering is combined in one for all tags.

The following table describes result ordering is combined:

Ordering type

Description

ascending

Results are ordered ascending by the sum of all numeric values from tags in which the numeric search is performed.

descending

Results are ordered descending by the sum of all numeric values from tags in which the numeric search is performed.

center

Ordered by shortest distance to the center of intervals in multi-dimensional space where each dimension represents a tag in which the numeric search is performed.

Distance to the center of intervals in multi-dimensional space is calculated by the following formula:

(x-xc)/xr*(x-xc)/xr + (y-yc)/yr*(y-yc)/yr + …+ (z-zc)/zr*(z-zc)zr,

where

x, y, z are numeric search intervals

xc, yc, zc are centers of each interval, respectively

xr, yr, zr are half of numeric interval range, respectively.

Numeric search functionality in several tags or in several dimensions has additional feature that allows returning numeric search results that match:

For example, if geographic coordinates of ATMs in a city are indexed, it is possible to search for an ATM that is not farther than 1 kilometer from a definite location. That is, you need to retrieve only those ATMs that match the circle (a hypersphere with 2 dimensions in this case) with a radius of 1 kilometer.

If in the previous example, the default numeric search is performed, results that match a square with the side length 2 kilometers are returned. This means that also ATMs that are square root of 2, which is approximately 1.41, are returned.

As said before, the default value for the multi dimensional shape feature is a hypercube. Value for the multi dimensional shape feature is defined in the <md_shape> tag, which is included in the siets command syntax.

Possible values for the <md_shape> tag are the following:

Ordering type

Description

cube

Results that match a hypercube are returned.

sphere

Results that match a hypersphere are returned.

Example:

Document content:

<document>

<id>32425</id>

<title>ATM’s profile</title>

<text>

<name>ATM</name>

<x>1.2</x>

<y>3.7</y>

</text>

</document>

Search query that matches the document and finds ATMs within the distance of 1 kilometer from point (2.0, 4.0):

<query>

<name>ATM</name> <x>1.0 .. 3.0</x> <y>3.0 .. 5.0</y>

</query>

<numeric_ordering>center</numeric_ordering>

<md_shape>sphere</md_shape>

Case Sensitivity for Proper Names

It is possible to perform case sensitive search for proper names, which means that case sensitivity is applied for the first letter of a search term.

The case sensitivity feature is switched on or off by setting the <case_sensitive> parameter in the search command’s XML request.

For more information on the search command’s XML request, see XML Request.

Example:

If the <case_sensitive> parameter is set to YES, and the search query contains “Bank”, then the search command returns documents, in which the word “Bank” is with the first capital. Note that in this case, also documents, in which the word “BANK” is with all capitals, are returned, since the case sensitivity is applied only to the first letter of a search term.

Grouping Results by Domain

It is possible to set the maximum number of documents in a search result that are returned form one domain. If this feature is used, in the search result, documents from one domain are grouped together within one result page.

The grouping results by domain feature is defined by setting the <max_from_domain> parameter in the search command’s XML request larger than 0.

If the parameter is not set, the default value is 0, which implies that no grouping by domains is performed and no limit is set.

For more information on the search command’s XML request, see XML Request.

Filtering Results by Rate

It is possible to filter search results by document rate by setting the minimum and maximum of the rate range within which the rate of a document must be to appear in the search result.

Document rate is of the integer type. However, it is possible to convert any date and time into integer using the UNIX timestamp, which converts a date and time into amount of seconds from 01/01/1970 till the given date and time. Thus, it is possible to set date and time as document rate and to search for document within a certain time interval.

The filtering results by rate feature is defined by setting the <rate_from> and <rate_to> parameters in the search command’s XML request.

For more information on the search command’s XML request, see XML Request.

Web Friendly Result Navigation

SIETS is designed for use in Web applications in mind. In many cases to display results in Web, the paging functionality is used. The paging functionality implies that the search result records are divided in parts, where each part is displayed in its own page, and each part contains a fixed amount of records.

The Web friendly result navigation feature is defined by setting the <docs> and <offset> parameters in the search command’s XML request.

For more information on the search command’s XML request, see XML Request.

Example:

If the <docs> parameter is set to 20, and the <offset> parameter is set to 40, the search command returns results from 40 till 59.

XML Drilldown

XML drilldown is feature that allows grouping documents into hierarchical structure and searching in this structure. Using this feature, you can create catalogues, index files and directories and even much more.

Setting ‘classify’ policy

Classify policy should be set for those tags for which menus should be generated.

<part>

<location>//document/spectags</location>

<policy>index=classify</policy>

</part>

Policy schema should be set before indexing data.

Document import

Once you have set policy schema you can import documents into storage.

<document>

<id>3049223</id>

<title>Article</title>

<text>This is article</text>

<spectags>

<type>News_item<type>Comment</type></type>

<author>John_Smith</author>

</spectags>

</document>

Note that subtags of type or author also can be included to enable multi-level navigation.

Menu generation

Within search requests content tag supply menu tag, which identifies XPath location relative to, the tag for which classify policy is defined of the tag for which to generate menu with hit distribution. Note that menu is generated along with search query and represents number of hits.

Example request (single level drilldown):

<siets:command>search</siets:command>

<siets:content>

<query>article</query>

<menu>/type</menu>

</siets:content>

Example response (single level drilldown)::

<siets:content>

<menu>

<item hits="25">News_item</item>

<item hits="1">Comment</item>

<item hits="7">Question</item>

<item hits="7">Reply</item>

</menu>

</siets:content>

For multi level drilldown, simply pass correct deeper XPath location. Be sure to add “=<selected value>” to each parent category or you will receive invalid hits.

Example resquest (multi level drilldown):

<siets:command>search</siets:command>

<siets:content>

<query>article</query>

<menu>/type=News_item/type</menu>

</siets:content>

Example response (multi level drilldown):

<siets:content>

<menu>

<item hits="10">Comment</item>

<item hits="1">Question</item>

<item hits="3">Reply</item>

</menu>

</siets:content>

4.5.2.2. HTTP GET Parameters

http://host/cgi-bin/siets/api.cgi?command=search&storage=test&query=Jhon

4.5.2.3. XML Reply

If the command is executed successfully, the XML reply contains the following command specific data.

<siets:content>

<ignored> common words that are ignored when performing the search </ ignored >

<realquery> real query that was used to perform the search, including the derived words from the wildcard usage and dropped ignored words </ realquery>

<found> number of documents found </found>

<hits> approximate total amount of results that match the search query </hits>

<more> number that indicates how many more documents that match the search query are found, but are not returned to the result set yet, a precise number if in the form of =N, and an at least number if in the form of >N</more>

<from> documents in the result set within a numerical range: the FROM value </from>

<to> documents in the result set within a numerical range: the TO value </to>

<results>

<document> meta data of the document found </document>

</results>

</siets:content>

If the command is not executed successfully, an error is returned. For more information on errors, see Error Handling.

4.5.3. Select

Note: The select command is available only starting from the SIETS server version 3.2.6.

The select command searches for document by their identifiers. It is possible to select one document by a precisely entered document identifier or to use wildcard pattern to select all documents that identifiers match the wildcard pattern entered. For example, if only the asterisk * is entered, identifiers for all document in the SIETS storage will be returned.

The default number of document identifiers returned to result set is 1024, but this number can be changed by entering a different number in the <docs> tag.

4.5.3.1. XML Request

<siets:content>

<document>

<id>document id *</id>

</document>

<docs> number of document identifiers in the result set </docs>

<offset> intend from the beginning of the result set</offset>

</siets:content>

4.5.3.2. XML Reply

If the command is executed successfully, the XML reply contains the following command specific data.

<siets:content>

<found> number of document identifiers matched </found>

<from> document identifiers in the result set within a numerical range: the FROM value </from>

<to> document identifiers in the result set within a numerical range: the TO value </to>

<results>

<id> meta data of the document found </id>

</results>

</siets:content>

If the command is not executed successfully, an error is returned. For more information on errors, see Error Handling.

4.5.4. Similar

The similar command searches for similar documents in the SIETS storage to a textual information, which is given directly, or which is contained by a document. The textual information, to which similar documents are searched for, is also referred as the input text.

The algorithm that is searching for similar documents uses statistical information about the number of times words contained by the input text, or so called keywords, appear in documents and finds similar documents to the input text fragment or document with a given ID.

You must take into account that the algorithm uses statistical information about words and does not know their meaning. Therefore, similar documents might not be semantically alike, however, praxis, when working with large text collections that contain medium large documents, shows that the algorithm works fine.

4.5.4.1. XML Request

<siets:content>

<id> document id to which similar documents must be searched for ** </id>

<text> textual information to which similar documents must be searched for ** </text>

<len> number of keywords in the input text * </len>

<quota> minimal amount of keywords that must be found in documents, which are returned the search result *</quota>

<docs> number of documents to be retuned in the result set </docs>

<offset> intend from the beginning of the result set </offset>

</siets:content>

For large text collections in the SIETS storage, praxis shows that the len element equal to 20 and the quota element equal to 4 gives the best results. However, you can experiment to find the best values for your specific text collection.

The two asterisks ** means that only one from the two elements must be entered, in other means, the relationship between these two elements is XOR.

4.5.4.2. HTTP GET Parameters

http://host/cgi-bin/siets/api.cgi?command=similar&storage=test&id=Doc1&len=20&quota=4

http://host/cgi-bin/siets/api.cgi?command=similar&storage=test&text=Jhon&len=20&quota=4

4.5.4.3. XML Reply

If the command is executed successfully, the XML reply contains the following command specific data.

<siets:content>

<found> number of documents found </found>

<hits> approximate total amount of results that match the search query </hits>

<more> number that indicates, how many more documents that match the search query are found, but are not returned to the result set yet, a precise number if in the form of =N, and the minimum number if in the form of >N</more>

<from> documents in the result set within a numerical range: the FROM value </from>

<to> documents in the result set within a numerical range: the TO value </to>

<results>

<document> meta data of the document found </document>

</results>

</siets:content>

If the command is not executed successfully, an error is returned. For more information on errors, see Error Handling.

4.5.5. Alternatives

If the alternatives search is performed, the system returns a set of alternative words from the SIETS storage vocabulary, which are similar in spelling or has a different language declination, for example, if you enter ”bote”, then ”bite” and “byte” are offered for searching. Note that only words from the SIETS storage are returned.

This feature can be used for fuzzy searches and for spelling error corrections.

Alternative words are returned from the vocabulary, which ensures that the alternative words are actual words that are in imported to the SIETS storage. When searching alternative words, the alternatives command considers the statistical information about the occurrence of the alternative word in the vocabulary, and the similarity of the alternative word to the search term. In other words, alternatives that occur in the SIETS storage more often and that are more similar to the search term are returned.

4.5.5.1. XML Request

<siets:content>

<query> search query * </siets_query>

<cr> Minimum ratio to include the alternative in the search query between the occurrence of the alternative and the occurrence of the search term. If you increase this parameter, there are less number of results returned to the result set, however performance is improved.</cr><!-- Functionality of this tag is available only starting from the SIETS server version 3.2.8.-->

<idif> Maximum number that indicates how much does the alternative differs from the search term, the greater the idif value, the greater the difference. If you increase this parameter, there are greater number of results returned to the result set, however performance is reduced.</idif><!-- Functionality of this tag is available only starting from the SIETS server version 3.2.8.-->

<h> Minimum number that gives an overall estimation of the quality of the alternative, the greater the cr value and the smaller the idif value, the grater the h value. If you increase this parameter, there are less number of results returned to the result set, however performance is improved.<h><!-- Functionality of this tag is available only starting from the SIETS server version 3.2.8.-->

</siets:content>

If values for the cr, idif, or h tags are not defined, corresponding parameters set in the SIETS storage configuration file are used.

For more information configuring SIETS storage, see the SIETS Administration and Configuration Guide.

4.5.5.2. HTTP GET Parameters

http://host/cgi-bin/siets/api.cgi?command=alternatives&storage=test&query=Jhon

4.5.5.3. XML Reply

If the command is executed successfully, the XML reply contains the following command specific data.

<siets:content>

<alternatives_list>

<alternatives>

<to> alternative search term </to>

<count> number of times the alternative search term occurs in the SIETS storage</count>

<word count=”number of times the alternative occurs in the SIETS storage” cr=”ratio between the occurrence of the alternative and the occurrence of the search term” idif=”number that indicates how much does the alternative differs from the search term, the greater the idif value, the greater the difference” h=”number that gives an overall estimation of the quality of the alternative, the greater the cr value and the smaller the idif value, the grater the h value”> alternative </word>

</alternatives>

</alternatives_list>

</siets:content>

If the command is not executed successfully, an error is returned. For more information on errors, see Error Handling.

4.5.6. List-last

Note: The list-last command is available only starting from the SIETS server version 3.2.9.

The list-last command searched for documents in the SIETS storage that most recently have been inserted, updated, or replaced, using the insert, update, or replace commands, respectively.

4.5.6.1. XML Request

<siets:content>

<docs> number of documents in the result set </docs>

<offset> intend from the beginning of the result set</offset>

</siets:content>

4.5.6.2. HTTP GET Parameters

http://host/cgi-bin/siets/api.cgi?command=list-last&storage=test&docs=10&offset=100

4.5.6.3. XML Reply

If the command is executed successfully, the XML reply contains the following command specific data.

<siets:content>

<found>number of documents returned to the result set</found>

<results>

<document> meta data for the list-last command</document>

</results>

</siets:content>

Meta data for the list-last command is information included in tags, for which the policy list is set to YES. By default, these are id, title, and rate tags.

For more information on policies, see Importing XML Structured Data.

If the command is not executed successfully, an error is returned. For more information on errors, see Error Handling.

4.6. Alerts

SIETS Server allows user to use alerting functionality. Alerts are defined as search queries that can be performed against storage inside server. Alerts are not triggered automatically; special command must be used. This is done, to give user application even more flexibility in alert handling.

Alerting API commands are sent to server using standard SIETS XML messaging.

4.6.1. Adding trigger

This command add trigger identified with supplied ID that will match documents against query supplied in filter tag.

<siets:command>add_trigger</siets:command>

<siets:content>

<id>Trigger id</id>

<filter>Trigger filter query</filter>

<recipient>Recipient of notification</recipient>

</siets:content>

4.6.2. Removing trigger

This command removes specific trigger.

<siets:command>remove_trigger</siets:command>

<siets:content>

<id>Trigger id</id>

</siets:content>

4.6.3. Clear triggers

This commnad clears all triggers.

<siets:command>clear_triggers</siets:command>

4.6.4. Examining document against triggers

This command test document thats ID is supplied against all triggers. If notify parameter is set to yes shell script is executed for each trigger that matches document. Also in reply to this command list of trigger-id’s that matched document is returned.

<siets:command>examine</siets:command>

<siets:content>

<document>

<id>document id to examine</id>

</document>

<notify>yes/no – to send message or not</notify>

</siets:content>

4.6.5. Configuration parameters

Storage configuration can be used, to specify shell script that will be executed, when trigger is matched against document.

<config>

<alerts>

<action>Shell script to execute</action>

</alerts>

</config>

4.7. Error Handling

If a command sent to the SIETS server is not executed successfully, an error is returned in the following XML reply message:

<?xml version="1.0"?>

<siets:reply>

<siets:timestamp> date and time </siets:timestamp>

<siets:storage> storage name </siets:storage>

<siets:requestid>XML request ID</siets:requestid>

<siets:error>

<code>error code</code>

<text> error textual message</text>

<level> error severity</level>

<source>subsystem in which the error occurred</source>

</siets:error>

<siets:seconds> time period in which the XML reply is returned </siets:seconds>

</siets:reply>

The error severity can be one of the following:

Title

Description

Warning

Returned when the command is executed successfully, but there are some problem indications

Failed

Returned when incorrect input data.

Error

Returned when error in the command execution.

Fatal

Returned when the system is not functioning.

The purpose of the error severity is to inform the system:

SIETS is a transaction-based system, which means that commands has a predefined timeout period. If a command is not executed during this predefined timeout period, the command returns the error.

It is possible to define a timeout period for the request, or configure it for the SIETS server.

For more information on configuring timeout periods for the SIETS server, see the SIETS Administration and Configuration Guide.


PreviousTopNext