Appendix C: Frequently Asked Questions

This section contains the following frequently asked questions:

How can I import binary data to SIETS?

To import binary data like MS Word or PDF document files to the SIETS storage, they must be entered in the info document part.

Note: Data in the info part are not available for FTS. If you want your data to be available for FTS, they must be stored as plain text.

Usually, binary data do not comply with the XML formatting standard. However, to be imported to the SIETS storage, they must comply with the XML formatting standard. Therefore, before importing to the SIETS storage, you must encode the binary data to the base64 encoding or other.

For more information on document parts, see Understanding SIETS Document Structure.

How can I make SIETS to automatically ignore common words when performing FTS?

The SIETS server automatically detects words that appear in the SIETS storage most often and adds them to the ignored words list. These words are considered to be common words that are ignored during FTS.

It is possible to edit the limit of the ignored words list. For more information on managing the ignored word list limit, see the SIETS Administration and Configuration Guide.

For more information on ignored words, see Ignored Words.

How can I see the actual query that used for FTS?

Often the actual query that is used for FTS differ from that you entered as a search query. Reasons for this can be the following:

To see the actual query used for FTS, use the <real_query> tag of the XML reply to the search command.

For more information on the search command, see Search.

How can I export the vocabulary with an SIETS API command?

The vocabulary is a list of all unique words in the SIETS storage. Unique words are found in documents and added to the vocabulary while storing these documents to the SIETS storage. Each SIETS storage has its own vocabulary.

Unfortunately, it is not possible to export the vocabulary with any of the SIETS API commands. However, on the file level the vocabulary is stored in a text file, where each line contains one word. You can copy this text file and view it.

For information on the vocabulary text file, see the SIETS Administration and Configuration Guide.

For more information on vocabulary, see Understanding Storing Information in SIETS.

Why do I get an error: connection failed when importing data to the SIETS server from a Windows NT 4.0 or Windows 2000 environment?

Importing data to the SIETS server, just like any other operation with the SIETS server, is performed by transporting XML requests and replies via HTTP.

When importing large amount of data to the SIETS server, many TCP/IP connections are opened. After the connections are closed, they remain in the TIME_WAIT state for a definite time period.

By default, in the Windows NT 4.0 or Windows 2000 environment, the limit of the connections is inconsiderably small and the TIME_WAIT state time period is too long.

Therefore, because the number of new connections created per second can be very large and the closed connections remain in the TIME_WAIT state for some time period, the number of connections can reach the limit very fast.

In that case, the system does not allow to create a new connection and the error is returned.

To configure the limit of the connections and the TIME_WAIT state time period, configure the following key in the Windows NT 4.0 or Windows 2000 registry:

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters]

"TcpTimedWaitDelay"=dword:00000015

where, the value is a decimal number representing seconds.


    

Is it possible to return more than 1000 documents to the result set?

By default, the limit of documents to be returned to the result set is 1000. It is possible to increase this limit. However, there is the following functionality, which is designed for the maximum number of documents in the result set equal to 1000:

If you increase the limit of documents in the result set, the limit will be applied for all functions, except, if sorting search results by relevance or domain, only 1000 documents will be returned to the result set.

If you increase the limit of documents in the result set, it means that transactions in the SIETS server will be performed in a longer time period. Therefore, you should also increase the timeout period of functions.

For more information on configuring the limit of documents in the result set, see the SIETS Administration and Configuration Guide.

For more information on the relevance, see Relevance.

For more information on grouping documents by a domain, see Search.

When I import large amount of data to the SIETS storage, why are they not available for FTS for a while?

When importing data to the SIETS storage, if the memory reserved for memory cache is not enough for the data amount being imported, then:

  1. The data being imported are written to another cache, which is written to the disk, and the index state is expanding.

  2. When the importing is complete, the SIETS server is committing data written on the disk to the inverted index, and the index state is collapsing.

While the index state is expanding or collapsing, the data written to the disk are not available for FTS. Only when data are added to the inverted index, they are available for FTS.

For example, if the data amount to be imported is tens of GB, these data will not be available for FTS for few hours.

For more information on the index state, see Status.


PreviousTopNext