5. Configuring SIETS Storage

When the SIETS storage is added, the SIETS storage configuration XML file is created automatically with the default SIETS storage configuration parameter values set for each SIETS storage instance.

Configuring the SIETS storage is performed on an SIETS storage instance level, as each instance can be located on different hardware and contain different amounts of data.

The SIETS storage configuration file includes parameters for SIETS storage users, and indexing options, and other parameters for performance tuning.

While you may want to leave the performance tuning parameter default values, as by default they are optimally adjusted for the performance of most common size and type of data and could be more complicate to understand, there are parameters that are specific only to your system and must be configured yourself. These are users, user passwords, optional dictionary settings, and so on.

If you, however, feel that a performance tuning is necessary, but you are not sure about which parameters must be changed, contact and get assistance from the SIETS support team.

For information on contacting the SIETS support team, see Getting Help.

SIETS storage configuration is performed in textual mode. After you select the SIETS storage configuration control, the SIETS storage configuration XML file is loaded in SIETS Enterprise Manager. Configuring SIETS storage means editing configuration parameters in the tags of the XML file.

If any of the configuration parameter tags are deleted from the SIETS storage configuration XML file, the default value of the parameter is used when running the SIETS storage.

The changes made to the SIETS storage configuration file become effective after the instance is stopped, if running, and started again with the new configuration.

For more information on stopping and starting SIETS storage instances, see Running SIETS Storages.

As opening, editing, saving, closing the SIETS storage configuration file is common for editing all configuration parameters, it is described in a separate section. Each parameter or parameter group is described in a separate section, which includes its representation in the SIETS storage configuration file and describes what it means.

This section contains the following topics:

5.1. Working with SIETS Storage Configuration File

To open, edit, save, and close the SIETS storage configuration file from SIETS Enterprise Manager, proceed as follows:

  1. After you have logged in SIETS Enterprise Manager, in Main Menu, select SIETS Storages.

  2. In the SIETS Storages window, in the Name column, select the SIETS storage to be configured.

  3. To configure an instance, select Configuration below the instance.

  4. Select Instance Configuration.

  5. To open the configuration file for editing, click Edit.

  6. Edit the configuration parameter values in tags as described in the following sections.

  7. To save the changes made, click Save.

  8. To discard the changes and close the window, click Cancel.

5.2. SIETS Storage Configuration File Parameters

This section described SIETS storage configuration file parameters in a separate section. Each section contains the parameter and its child parameter description and example of how it they appear in the SIETS storage configuration file.

This section contains the following topics:

5.2.1. General

The following table describes general SIETS storage configuration parameters:

Second level element

Third level element

Description

Default

<general>

 

General information about the SIETS storage.

 
 

<storage>

SIETS storage name, entered when adding a new SIETS storage.

 
 

<port>

SIETS storage port, entered when adding a new SIETS storage.

 
 

<max_resultset>

Maximum number of documents returned to the result set.

1000

 

<timeout>

Function timeout period in seconds. If the command is not executed during this predefined timeout period, the command returns the error.

60

 

<log_path>

Relative path according to the SIETS storage directory where all log files are stored.

 
 

<log_rotate>

Number of days after which all log files are deleted to ensure that the disk is not over flooded with log information.

60

 

<dump>

Information whether the dump is to be created. The following values are possible:

no

No dump.

error

Only commands that cause errors are dumped.

all

All commands are dumped.

no

Example:

<general>

<storage>Newspapers</storage>

<port>90</port>

<max_resultset>1000</max_resultset>

<timeout>60</timeout>

<log_path>./logs</log_path>

<log_rotate>60</log_rotate>

<dump>no</dump>

</general>

5.2.2. Users

The following table describes user management SIETS storage configuration parameters:

Second level element

Third level element

Forth level element

Description

<users>

   

This element contains a list of SIETS storage users.

 

<user>

 

This element contains a user name and password. It is repeated for each user.

   

<name>

User name.

   

<pass>

User password.

Example:

<users>

<user>

<name>John</name>

<pass>unbreakable_password</pass>

</user>

</users>

SIETS storage users must not be confused with SIETS Enterprise Manager user accounts. For information on the SIETS Enterprise Managers users, see Administering SIETS Enterprise Manager User Accounts.

5.2.3. Dictionary

The dictionary element, which is a second level element, contains several parameters and parameter groups for configuring search query defining options. This section contains the following topics, which each describes one of the search defining options:

5.2.3.1. Special Symbols

The following table describes parameter for configuring special symbols:

Third level element

Description

Default

<specsymbols>

Letters and numbers are regular symbols that form words. By default, all other symbols like: ! _ % & *., are considered as word separating symbols. This element contains special symbols that are additional to the regular symbols list. The default value _ means that, for example, sleep_walk is treated as one word. Several special symbols are entered without a space or any other separator.

_

5.2.3.2. Wildcard Patterns

The following table describes parameters for configuring wildcard patterns support:

Third level element

Forth level element

Description

Default

<wildcards>

 

This element contains parameters for configuring wildcard patterns support.

 
 

<allow>

Information whether the wildcard patterns search is enabled.

yes

 

<cover_
factor>

When wildcard patterns are used to define a class of words to be searched, only a limited number of statistically frequent words are searched for to ensure a higher performance. This element defines the limit in percent from the sum of all words created from the wildcard pattern appearance in the SIETS storage.

Example:

Search query: ca?

All words: ”car”, cat”, cap”, ”can”, and ”cab

Number of times each word appears in the SIETS storage:

4

3

2

1

car

car

car

car

cat

cat

cat

cap

cap

can

Cover factor 60% means that words in shadowed cells are searched and returned.

10

20

30

40

50

60

70

80

90

100

car

car

car

car

cat

cat

cat

cap

cap

can

Note that the word “cat” is searched, as it is important that at least one of the all appearances of “cat” fall in 60%.

95

 

<min_
expand>

The minimum limit of the wildcard patterns matching set from the SIETS storage vocabulary in absolute numbers.

This parameter overcomes the cover_factor parameter. For example, if only 2 words fall in the cover_factor, but the min_exapand is 4, then 4 words are being used in the search.

4

 

<max_
expand>

The maximum limit of the wildcard patterns matching set from the SIETS storage vocabulary in absolute numbers.

This parameter overcomes the cover_factor parameter. For example, if 20 words fall in the cover_factor, but the max_exapand is 16, then only 16 words are being used in the search.

16

5.2.3.3. Stemming

The following table describes parameters for configuring stemming:

Third level element

Forth level element

Description

Default

<stemming>

 

This element contains parameters for configuring stemming.

 
 

<allow>

Information whether the language declinations search is enabled.

yes

 

<cover_
factor>

When language declinations are used to define a class of words to be searched, only a limited number of statistically frequent words are searched for to ensure a higher performance. This element defines the limit in percent from the sum of all words created from the language declinations appearance in the SIETS storage.

Example:

Search query: $car$

All words: car, cars, and car’s.

Number of times each word appears in the SIETS storage:

7

2

1

car

car

car

car

car

car

car

cars

cars

car’s

Cover factor 80% means that only words in shadowed cells are searched and returned.

10

20

30

40

50

60

70

80

90

100

car

car

car

car

car

car

car

cars

cars

car’s

Note that the word “cars” is searched, as it is important that at least one of the all appearances of “cars” fall in 80%.

95

 

<min_
expand>

The minimum limit of the language declinations matching set from the SIETS storage vocabulary in absolute numbers.

This parameter overcomes the cover_factor parameter. For example, if only 2 words fall in the cover_factor, but the min_exapand is 4, then 4 words are being used in the search.

4

 

<max_
expand>

The maximum limit of the language declinations matching set from the SIETS storage vocabulary in absolute numbers.

This parameter overcomes the cover_factor parameter. For example, if 20 words fall in the cover_factor, but the max_exapand is 16, then only 16 words are being used in the search.

16

5.2.3.4. Alternatives Support

If the alternatives search is performed, the system returns a set of alternative words from the SIETS storage vocabulary, which are similar in spelling or has a different language declination, for example, if you enter ”bote”, then ”bite” are “byte” are offered for searching. Note that only words from the SIETS storage are returned.

This feature can be used for fuzzy searches and for spelling error corrections.

The following table describes parameters for configuring alternatives support:

Third level element

Forth level element

Description

Default

<alternatives>

 

This element contains parameters for configuring alternatives support limits.

When searching alternative words, the alternatives command considers the statistical information about the occurrence of the alternative word in the vocabulary, and the similarity of the alternative word to the search term. Although, the parameters for calculating alternatives similarity and appearance are defined when performing the alternatives command, the limit values for these parameters can be configured in the SIETS storage configuration file.

 
 

<cr>

Minimum ratio to include the alternative in the search query between the occurrence of the alternative and the occurrence of the search term.

If you increase this parameter, there are less number of results returned to the result set, however performance is improved.

2.0

 

<idif>

Maximum number that indicates how much does the alternative differs from the search term, the greater the idif value, the greater the difference.

If you increase this parameter, there are greater number of results returned to the result set, however performance is reduced.

3.0

 

<h>

Minimum number that gives an overall estimation of the quality of the alternative, the greater the cr value and the smaller the idif value, the grater the h value.

If you increase this parameter, there are less number of results returned to the result set, however performance is improved.

2.5

5.2.3.5. Ignored Words in Search Queries

The following table describes parameters for configuring ignored words options:

Third level element

Forth level element

Description

Default

<ignore>

 

This element contains parameters detecting ignored words.

 
 

<word_freq>

Ratio between all words in the SIETS storage and the word to be ignored. If this ratio for a word is less than this number, the word is added to the ignored word list.

500

 

<word_len>

Maximum length of the word to be ignored.

5

Note: It is possible to include ignored words in the search by using the “+” sign in front of the ignored word. Full text index contains all words, including ignored words. The ignored words feature is used only for filtering out common words such as “and”, “but”, “is”.

5.2.3.6. Example

The following is an example of the whole directory element:

<dictionary>

<specsymbols>_</specsymbols

<wildcards>

<allow>yes</allow>

<cover_factor>0.95</cover_factor>

<min_expand>4</min_expand>

<max_expand>16</max_expand>

</wildcards>

<national>

<cover_factor>0.95</cover_factor>

<min_expand>4</min_expand>

<max_expand>16</max_expand>

</national>

<alternatives>

<cr>2.0</cr>

<idif>3.0</idif>

<h>2.5</h>

</alternatives>

<ignore>

<word_freq>500</word_freq>

<word_len>5</word_len>

</ignore>

</dictionary>

5.2.4. Repository

The following table describes SIETS storage repository configuration parameters:

Second level element

Third level element

Forth level element

Fifth level element

Description

Default

<repository>

     

This element contains the repository configuration parameters.

 
 

<highlight>

   

This element contains parameters for highlighting the matching search terms in the search result.

 
   

<open_
mark>

 

Highlight open mark.

&lt;b&gt;

   

<close_
mark>

 

Highlight close mark.

&lt;/b&gt;

 

<tag_compression>

   

This parameter enables or disables (on/off values) tag compression. It can reduce size of storage on disk if documents are small or tag intensive. In case of large text documents with few tags it has no effect on storage size but performance could be affected.

off

Example:

<repository>

<highlight>

<open_mark>&lt;b&gt;</open_mark>

<close_mark>&lt;/b&gt;</close_mark>

</highlight>

</repository>

5.2.5. Index

The following table describes SIETS storage indexing configuration parameters:

Second level element

Third level element

Forth level element

Description

Default

<index>

   

This element contains the indexing configuration parameters.

 
 

<cache>

     
   

<size>

Indexing cache size in mega bytes, from 50 to 150 MB.

If you enter a number outside this interval, then:

  • If less than 50, the performance is very low.

  • If more than 150, the performance is not affected.

Note that the indexing demon uses more RAM than this number, because there are also other operations. If you are importing a large data amounts in size of several GB, then the whole is being used.

80

   

<usage_
idle>

Minimum indexing amount of the cache in percent. Only if this minimum is exceeded in the cache, the indexing is started. If the data amount in the cache is less than the minimum, the background indexing is not performed.

Leave this parameter unchanged, unless advised by the SIETS technical support team.

10

   

<usage_
critical>

Maximum indexing amount of the cache in percent. If the maximum is exceeded, all CPU and I/O resources will be used for indexing. If the data amount in the cache is less than the maximum, CPU and I/O resources for indexing are used proportionally the data amount in the cache.

Leave this parameter unchanged, unless advised by the SIETS technical support team

90

 

<background_indexing>

 

Information whether the background indexing is performed. If not, indexing is performed only when the index command is sent.

Leave this parameter unchanged, unless advised by the SIETS technical support team

yes

 

<optimize
_to>

 

Number of search results to be optimized according to the relevance. Search results after this number are sorted by the rating. It is suggested to have this number the same as the maximum number of documents returned to the result set. The greater the number, the more relevant search results. The lesser the number, the higher performance.

1000

 

<weight
_threshold>

 

Weight threshold for the relevance, which is considered as a very relevant.

For example, if 100 is the maximum relevance weight interval, then 90 is very close to the maximum, but also is likely that documents with such relevance exists in reality. Therefore, it is considered as very relevant.

90

Example:

<index>

<cache>

<size>80</size>

<usage_idle>10</usage_idle>

<usage_critical>90</usage_critical>

</cache>

<background_indexing>yes</background_indexing>

<optimize_to>1000<optimize_to>

<weight_threshold>90<weight_threshold>

</index>


PreviousTopNext