SIETS Features:

Querying, Functions, Performance

Natural language based search API

Siets Server software supports rich set of linguistic search options using only natural language words or phrases in queries. Below are some examples of typical search query syntax for Siets Server XML database when using API command "SEARCH".


Free context search as in web search engines (default)

Free text natural language terms, case insensitive (default querying mode):

a little john in london

would return all documents that contain combination of relevant words "little", "John” and “London”, but ignoring words "a" and "in".

Siets Server search engine software can process data in 160 languages in the same XML document store. Multiple language content can be stored in one XML document formatted with UTF-8 character set encoding.

Search queries need to be in the same languages that are actually present in the customer data.


Capitalizing the first letter in a word for case-sensitive search

Intuitive default 'rule of thumb' for making case sensitive free text queries is to capitalize the first letter in a word:

windows

would find both words "windows" and "Windows". However,

Windows

would find only capitalized words "Windows".


Free context search, case sensitive

Free text natural language terms, case sensitive:

a Little John in London

would return all documents that contain combination of relevant capitalized words "Little", "John” and “London”, ignoring "a" and "in", and ignoring words "little", "john" and "london".


Context relevance is King

Contextually best matches based on natural language text analytics performed during every search query, will surface Siets Server results with more relevant results upfront, prioritizing natural language analytics vs algorithmic data sorting.

For the previous example free text query:

a little john in london
document with the following text content:
Little John went to the nearest underground station in London.

would be sorted upfront (surfaced) vs the following document with this text content:

John went to the nearest underground station. It was a little more distance away from his hotel. It was time to go to London.

The second document will also be found and will be present in Siets search results. Yet it will be listed only after the first result as being less contextually meaningful document from the natural language user point of view.

One can say that the first document is more relevant than the second document for the particular query context.

Standard rules for Siets Server language analytics used per each query are based on fast linguistic data pattern matching. Software is taking into account language based factors that determine human relevance of search results. Among pattern matching criteria used by Siets Server are:

  • relevant words making the query context
  • irrelevant or ignored words making the query context
  • matching word location positions per document
  • matching related patterns of word forms such as with stemming, synonyms etc
  • matching word actual frequency per entire database
  • matching word actual frequency per document
  • estimating density of matching words in the textual content, where query terms are found close to each other
  • ranking of the particular data content part matching the query
  • ranking of the document against other documents in search results
  • matching trainable ranking thresholds applied on the data model by customer

The most relevant document context matching the search query terms by combined above text analytics criteria will be surfaced by Siets Server, sorting the results by decreasing contextual relevance and providing small fragments of text snippets from documents where the best context matches were found.

Additionally transactional pagination for results will be used by Siets Server to avoid information overload in web browsers and limit number of results per one page (eg 20, 50, 100 etc), all parameters configurable for every Siets API query.

Siets Server enforces natural language content relevance sorting rules for each free text query, if not specified otherwise. That enables to surface the most relevant results upfront by the best matching linguistic information in the textual content.

Contextual relevance ranking rules can also be flexibly further customized by the owner for different content surfacing rules, custom defined for each Siets Server database through its search index ranking policy. Please kindly read other sections in this website about Siets unique information ranking model and methods how to apply ranking weights to sort, group and order search results.

Finally, search results based on standard text analytics sorting by context relevance can be combined with other information ordering rules through Siets API, for example, enabling results sorting by numeric values or dates like in classic SQL databases.

There are potentially millions of sort orders possible in a single Siets Server database. When the software is responding to free text ad hoc queries coming from millions of users, it is prioritizing results by the best ranked context match first, and only then by other sort orders.


Free natural language phrase search, case insensitive

Free text natural language phrase, case insensitive:

"a little john in london"

would return all documents that contain a context phrase "a little John in London” exactly in this order of all words, including "a" and "in", and matching all required words in upper and lower case, where "john" and "John", "london" and "London" etc are equal matches.


Free natural language phrase search, case-sensitive

Free text natural language phrase, case-sensitive:

"A Little John in London"

would return all documents that contain a context phrase "A little John in London” exactly in this order of all words, including "A" and "in", with words started with capitalized-letters matched in case-sensitive way, matching "Little", but no "little", "John", but no "john" etc.


Free text search with enforced all words matches

Use plus + symbol as the word prefix to require strong match irrespectively of this word presence in the ignored word list.

Free text with enforced all search terms matches in a natural language query:

+A little John +in London

would return all documents that contain combination of relevant words "little", "John” and “London” with usually ignored words "A" and "in".


Free text search with stemming for some natural language words

Enclose word in dollar sign $ symbols to request word substitution with a list of word forms derived from stemming and Boolean OR among all of them.

Free text with natural-language grammar rules stemming:

little $john$

would return all documents that contain combination of "little" with either word "John” or all stemming forms like “John’s” etc.


Free text search with synonyms for some natural language words

Enclose the word in percentage % symbols % to request this word substitution with a list of synonyms and Boolean OR among all of them.

Free text with synonyms using pre-loaded for each XML database file with a simple list of synonyms, used with boolean OR logic in a query:

little jonh %country%

would find terms "little" and "john" in combination with either word "country", or word "region", or word "area", provided that all 3 synonym words were listed in the synonym file line, comma-delimited, that started with word "country": "country, region, area".


Free text search with different word suffixes

Use asterix * symbol as a wildcard template for words.

Free text with a wildcard template for the word part in suffix:

little Jon*

would find all combination matches of word "little" with either "Jones", or "Jonson", or "Jonathan" etc.


Free text search with different word prefixes

Free text with a wildcard template for the word part in the prefix:

little *athan

would find all combination matches of word "little" with either "Jonathan", "Bathan", "Rathan" etc.


Free text search with middle-word wildcard

Free text with a wildcard template for the part in the middle of word:

little jo*n

would find all combination matches of word "little" with either "Jonathan", "John", "join" etc.


Free text search with letter positioning template

Use a question mark ? symbol in a free text query as a specific word letter wildcard for designated letter positions per word (or per any character string that is not a delimiter symbol):

little J?s?n

would find all combination matches of word "little" with either "Jason", "Josin", "Jasin", "Jeson" etc.


Free text search with letter selector template for the position

Use square brackets [ ] in a free text query to specify allowed letters in that letter position in the word.

Free text with a selector for only specified letters template in a particular position in the word:

Little Jonn[iy]

would find all combination matches of case-sensitive word "Little" with either "Jonny" or "Jonni" etc, but not "Jonna", "Jonne" etc.


Free text search with wildcards in phrases

Natural language phrases with wildcards use for individual phrase terms:

"software develop*"

would find either phrase "software developer", or phrase "software developers" or phrase "software development" etc.


Ignored words detection and use in natural language queries

Siets Server automatically detects ignored words above certain threshold in natural language queries.

By default, Siets ignores common words and characters such as "a", "the", “and”, ”where”, “how” etc, as well as certain single characters and single letters, because they tend to slow down the search without improving the search results.

The SIETS server detects words that appear in the SIETS storage most often and gradually adds them to the ignored words list when loading and indexing large amounts of data.

It is possible to place reasonable restrictions (percentage thresholds, word lengths etc) on ignored words list for each Siets Server data store collection so that limitations best fit the application business logic, user search requirements and type of data content.

Please note that SIETS server still creates the full-text index with all ignored words and their positions. Ignored words are actually being included into the index exactly as they occur in natural language text content, so that a user can request search queries with normally ignored word matches, if necessary.

If a common word or a character was ignored during search query by Siets Server, yet it is essential to getting the results you want, you can include it by preceding it with a plus sign +:

John +and Anna

will find only results with all three words present: "John", "and" and "Anna".


Proximity search of words at certain distance from each other

Enclose two or more words using @ symbol followed with a number, specifying nearby textual distance in maximum number of words to be searched for matches:

Proximity search for two or more words within nearby textual distance:

@ 4 John Smith @

would find document with content "John Armitage Smith", and "John Henry Armitage Smith", but would not find document with content "John went to the nearest underground station. Smith was not there...".


Boolean search AND, OR, NOT in natural language queries

Combination of all above described natural language search options per single search query with ( ) as AND, { } as OR and ~ as NOT.

Boolean AND: use ( ) brackets

John Smith
(John Smith)

would return documents that contain both words “John” or “Smith” in any order.

Boolean ( ) as AND is a default free text search criteria and can be skipped in simple one level queries with no nested other terms.

Boolean OR: use { } curly brackets

{John Smith}

would return documents that contain either the word “John” or the word “Smith”.

Boolean NOT: use symbol ~ tilde

John ~Smith

would return documents that contain the word “John", but do not contain word “Smith”.


Search only specific XML data fields using free text natural language terms

Combination meta-data search using free text natural language context is specific data fields only:

<name>john</name> <address>piccadilly london</address>

would find all documents with "John" in <name> field where <address> field is mandatory containing both text terms "Piccadilly" and "London" in any order.

Siets customers can freely choose to provide this type of easy to grasp programmable XML data filtering, slicing and pivoting into different result subsets for their intelligent end-users for powerful search driven analytics and reporting.


Search across all XML data fields using free text natural language terms

Siets Server can be instructed with "index=all" policy to index all content from all individual XML data fields into a single common full-text index.

The following query example would also work to provide reasonably good search results, if XML database has been indexed to perform search across all unstructured text and through the entire XML data model.

piccadilly london john

Above example would probably be less precise than content search per specific XML field only, still it will yield the list of results for review to user with text snippets showing context and allowing user to decide which result is most useful.

Search across all XML text content can be combined with meta data search in XML data fields.


Sanitizing natural language query options

Siets customers can flexibly sanitize all FTS queries for greater safety just leaving allowed options to perform plain text and phrase queries for users.

Then anyone with basic web type search skills and not being aware about underlying XML data model can easily search database retrieving results by default sort orders and limited to pre-programmed access logic.


Multi-level nesting of search queries with Boolean logic to build complex analytics

Siets software Boolean AND ( ), OR { } and NOT ~ query syntax can be nested in multiple levels to make different logical combinations of FTS queries with XML field level context queries.

Complex business logic, if required by customer application, can be easily created. A simple example:

{(John Smith) (Abby Brown)}

would return documents that either contains the word “John” and the word “Smith”, or the word “Abby” and the word “Brown”.


Similar text content search in natural language to another content

Additionally unstructured text analytics is possible by textual given content similarity to another textual content with tunable number of significant word occurance and concurrance thresholds for the best results in large data sets with billions of objects.


Spell-checking support with 'Did you mean that?' option

Another helpful text analytics feature is Did you mean that? type of spell-check correction function returning a list of similar by spelling words from the actual your XML database index.


Multi-language data store

Users can store and process data in multiple languages within the same XML document, avoiding tons of localization efforts in their application software code.


Fast character set transformations in server code

Additional 'perk' is fast server-side XML data conversion between national ISO charachter sets and UTF-8 data store if requested by client API.

All data is always stored in standard UTF-8 data store on Siets Server and can be queried either using UTF-8 or a national ISO character set.

Nearly effortless linguistic search experience

Natural language based querying and search paradigm is the holistic approach of Siets Server software design.

Siets Server powered customer software applications do not require from end-users to learn more complex query syntax forms than those described above.

It enables Siets Server software customers and application developers to start building applications where users need to have only basic skills how to search for relevant information.

Most of users are already pretty familiar with basics of this knowledge today from their web and corporate database search experience.

Default linguistic search options in Siets Server search query syntax will work well enough just using plain text natural language terms.

Siets Server customers can start providing easy, intuitive, fast and relevant user search experience in their own databases similar to customer satisfaction when using the world's leading web search engines.

Siets users are free to use plain text words or phrases for relevant information search in data, being the most intuitive query terms based on everyone's language knowledge.

Relevant ranking for natural language content

Siets Server provides a unique policy tool how to rank customer database search index policy through trainable system of ranking weights, uniquely applied on the custom XML data model for data fields and on the natural language text content that those fields contain.

Most web and mobile users can instantly query and analyze large data volumes at the back-end application services systems, run by Siets Server, using just the web browser and free text natural language query terms: few words or a known context phrase. If necessary, users can expanding the NLP-only query with a bit of additional and very easy to learn syntax options described above.

There is no need to learn SQL or similar complex querying languages in order to retrieve information from vast data volumes and rearrange it into grouped, sorted and ordered way, even up to the precise positioning of individual search result entries.

This Siets as a ranking engine feature enables to replace with NLP terms more complex SQL syntax queries for combined full-text and structured search, that would typically in SQL syntax look like this:

SELECT ... LIKE ... GROUP BY ... ORDER BY ... JOIN

with Siets Server ranking policy (set of relative weightings for XML data model), that will instruct Siets Server to index all customer XML data in application-specific data sorting way by desired relevance, so that complex information sorting, grouping and ordering (relevance of search results) is automatically performed when users are doing search with free text terms in natural language.

Essentially the index ranking policy enables Siets Server customers to "merge" contextual search and sorted structured data search into one single "most relevant result set" from a user point of view, when "information relevance" is being estimated in relation to the application data model and specific business need.

In Siets Server system customers, using policy tool for index ranking, can flexibly govern Siets Server search behavior aligning it with desired free text search relevance, when data is queried and analyzed in plain natural language terms. Siets Server will automatically enforce ranking policy changes during data modification and would also make content sorting rules uniformly available for all applications without the need to change application software code in all applications using Siets platform.

This relevance ranking engine feature in combination with natural language processing (NLP) at search is probably one of the most powerful Siets Server capability. It enables Siets customers to build massively scalable distributed data stores with blazing-fast and relevant search enabled in any types of customer databases with text-rich language content.

Please see more details about all NLP syntax search options, indexing and relevance "policy" methods in Developer documentation.

Read more about SIETS API specification in Developer Guide: Developer Guide / Siets API

Search-driven analytics and reporting at blazing-fast speed

Indexing XML content for razor sharp field-level results

Siets outstanding functional feature is that for any XML based data Siets Server additionally builds meta-data for each XML field and creates a "virtual search index", which can be queried separately using simple XML syntax in a query, e.g.:

attorney office <city>"new york"<city>

will return only documents containing somewhere in the text words 'attorney' and 'office', and matching in the XML field <city> containing a very specific sequential two word text phrase value of "new york".

This technique to search data by exact known or expected phrases is a very powerful and intuitive method how users without special knowledge about SQL or XML coding can perform basic analytical queries on any Siets Server database.

Automatic discovery of dates and numeric values for column-type indexing

Another analytical feature of Siets Server is that it can automatically discover and index all numeric and date values within document text parts and allow to combine full text searches with numeric range searches using classic column type data or numeric indexes.

Siets is using simple syntax for that:

value1..value2

Numeric or dates interval queries for fast analytics and reporting

Siets Server automatically recognizes use of double dots '..' in API queries and invokes columnar type of data or numeric index for specific range analytics or reporting filter instead of using full text index.

Policy based indexing of date and numeric values in XML fields

Siets Server also supports policy bases indexing of all numeric and date values within only specified XML document text parts, allowing to combine full text searches with numeric range searches in those classic column type data or numeric indexes.

For instance, a syntax in a query, e.g.:

attorney office <year>1980..2000</year>

will return only documents containing matching XML field <year> values from years 1980 until 2002, effectively narrowing search results to the required numeric interval.

Near real-time search-driven analytics and reporting

Siets Server enables to combine any full text, XML-based fielded or numeric search options to fully exploit Siets top speed performance and capacity.

Siets Server performs nearly instant retrieval of relevant ad hoc query data, filtering it from millions of documents per single server, when using columnar numeric or dates indexes. Since columnar indexes are cached and stored in RAM in Siets Server, software performance at those tasks is stellar.

Example query:

attorney office 1980..2000

will return any documents where terms 'attorney office' will have an occurrence in a document together with any of numbers 1980, 1981, ..., 2001, 2002 anywhere in a document.

If free text search driven query is used in combination with numeric search range option '..', analytical result set can be additionally sorted by Siets Server on-the-fly with descending or ascending sort order requested in the API call exactly how SQL databases do it.

In contrast to SQL databases, where analytics usually require reading, combining and sorting of tons of data every time an SQL query is executed, Siets Server does not need SQL query optimizers or other complex techniques to speed up analytical querying operations.

Context narrowing by text search terms efficiently slice Siets Server internal workload down to analytical processing of only a very small resulting data subset that needs to be sorted.

Siets Server performance for search-query driven analytics and reporting functionality could deliver near real-time, sub-second response times in most common use cases, even if used in very large distributed cluster databases with billions of records.

Read more about SIETS search options in Developer Guide: Developer Guide / Search

Classifying data for XML drill-down and pivoting by facets

Application developers can invoke Siets Server data classifying (grouping) functionality on any XML tag values, that counts together number of all search results occurances per each facet in a query and returns all facets found.

Developers can use this type of facets-generating analytics to build powerful advanced search applications with categorized by facets result sets to narrow (drill-down), expand or combine unstructured data and XML-field based search criteria as needed for a particular business logic or process.

This XML-drill down feature is specified in document policy file for an XML data field to be indexed as an analytical facet-index:

index="classify"

Siets Server with match all categories to the actual query results and will counts totals of matching results per each of the category into the predefined meta tag <menu>, returned along the set of search results to the user application.

This simplicity of data faceting can be used by customer application to build easy to use, multi-level navigation trees of faceted links and used for drill-down or expand-up browsing of search results subsets per each menu category, without asking end-users typing in those filters manually.

This type of analytical content classification feature among found facets can substantially improve customer satisfaction with web site search functionality. It also offers plenty of convenient navigation programming choices to Siets Server software developer.

In particular, blazing-fast responding e-commerce shop entries for catalogs and sub catalogs of goods sold can be build dynamically depending on what the user searches for and what data must be presented only in the facet-matching result set.

For example, if the end user issues a query and receives 1000 hits containing a full text term 'car', with just first 100 results shown per page, and car brand categories have been indexed as index="classify" items, then Siets Server API will return also in resulting XML all matching <menu> value items with numeric counters how many documents could be found per each category in the total results set per entire database, e.g.,

Ford (154)
Mercedes (12)
Toyota (40)

This type of query-driven faceted analytical information returned by Siets API to client application software, can be the used by user application to build web navigation links to instantly expand or narrow search queries beyond the current subset of results visible on the limited screen.

In our example above, use of menu items as next search query filters for Siets Server in clickable web links, it will instantly narrow down or expand up end-user choices of car brand information to respective manufacturers only, without the need to browse for review all 1000 results.

This navigation-by-results driven categories can be similarly applied to other classified XML field navigation values, e.g., a fine-tuned navigation can be built by narrowing end user choice by clicking on car model, color, car engine type, type of fuel etc.

Please note, that Siets Server supports building and using hierarchy of multi-level XML-drill down. If top category XML data field contains subfields in XML with additional values, above indexing option 'index="classify"' would return also hierarchy of XML tag with all subfields and actual search hits matching also in subfields.

In this way Siets Server engine can help to organize and build a query specific top-to-bottom XML-based navigation trees called XML-drill down which contain only those categories and subcategories where some matching results are actually present.

For end-users it is extremely powerful and convenient option which does not require entering more specific search keywords: users can just click on the itemized categories returned from Siets Server within <menu> tag values to launch the next relevant search.

End-user will also be informed about expected number of results, when looking at faceted navigation links, telling him if the search query terms might be improved if too many results seem to be generated. This could help avoiding unnecessarily broad term transactions, saving end user time and reducing workload on back-office server resources.

It also helps to avoid information overload perception by users if a listing of the complete catalog with hundreds or even thousands of all categories is required for review by user, when the end user is interested just into the small subset from all catalog categories.

To summarize, above described analytical data classification feature by simple XML-drill down option and on-the-fly generated facets within Siets Server is a remarkably useful mechanism how to improve user experience for modern web applications.

I am using this faceted search feature extensively in all my own search-driven web projects.

Read more about faceted search here: Siets XML drill-down

General platform architecture features

  • Distributed database and search engine architecture (clustering)
  • High performance full text search engine
  • Native XML data storage - can store any text or arbitrary XML-formatted data
  • Client-server application development architecture - works as a server with open API
  • Collection of data from Web sites (HTTP), FTS servers and file servers with built-in crawler
  • Easy to use web administration for centralized management of all servers and search indexes
  • Portable API based on XML message exchange via HTTP/HTTPS using GET/POST requests

Full text search engine features

  • Free text word and phrase search using natural language terms;
  • Free text word and phrase search in selected XML fields;
  • Free text search excluding frequently occurring stop-words and articles;
  • Boolean AND, OR, NOT conditionals for advanced analytical search;
  • Combining free text search with free text search in selected XML fields;
  • Synonym support in search queries using customizable vocabulary;
  • Stemming support for languages with inflections;
  • Multi-level Boolean expressions using word and phrase search, by using brackets
  • User friendly text wildcards * and letter wildcards ?? in queries;
  • Option to configure wildcard expansion coverage for performance needs
  • Word stemming support for multi-lingual data
  • Option to configure stemming expansion coverage for performance needs
  • Proximity search of terms within N words of another word
  • Case support for search for proper names
  • Stop word detection and exclusion
  • Use of any special symbols in search terms if not specified as word separators
  • Results grouping for domain (zone search)
  • Numeric range search between any integer, date or float values for full text queries
  • Geospatial search by 3D coordinates in XML fields x,y,x;
  • Results sorting by distance for GPS-based location search applications
  • Ordering of results for numeric range search in ascending, descending, rate or relevance order
  • Search within specific document areas using XML markup for META data indexing
  • Field searching in XML structured and indexed data
  • XML-drill down feature, allowing to narrow/expand search results for any categorized navigation
  • Multi-level nested boolean combination of all of the above in a single query;
  • Return of multi-level values and counters for on-the-fly result sets using XML-drill down
  • Misspelling detection and correction using alternative words from vocabulary
  • Option to configure fuzziness level and expansion coverage of alternatives
  • Results ordering by pre-assigned document rating or date
  • Results ordering by text relevance with flexible relevance definition schemas for indexing
  • Search within interval of pre-assigned document ratings or dates
  • Search only within document title, content or other parts defined by relevance ranking
  • Text snippet composition with search query matching context from documents
  • Search term highlighting in text snippets
  • Search term highlighting in cached original Web documents
  • Option to configure highlighting with specific start and end tags for better display
  • XML formatting of query and search results
  • Results customizable by XSLT for output to Web, mobile or other media
  • Suppor of more than 160 language encodings in ISO character sets and UTF-8
  • Results can be returned in any language specific encoding including UTF-8
  • Similar document search by content (related content search)
  • Define maximum number of documents per result page in every search query
  • Define starting document number for multi-page results sets in search query
  • Calculate and return expected total number of hits in every page of multi-page result sets
  • Return total query search time spent by the engine
  • Restrict maximum number of documents in any search result for performance needs
  • Extensively documented API provide excellent search results customization options
  • Search API commands can be performed from Web based administration tool
  • Automatic logging of 100% all search queries and Siets Server API requests
  • Support for alert events triggered by content updates using full text filtering expressions
  • Support for definition, modification and removal of alert filters from API
  • Support of re-checking of the all content against specified alert filters

Crawler and indexing software features

  • Crawler for Web site robot-type spider data collection and indexing
  • Fast URL spidering using specified number of parallel crawling processes
  • Support for most common document types: HTML, Word, Excel, PowerPoint, PDF, Macromedia Flash, TXT, RTF and XML
  • Plug in option for crawling other 200+ file formats using third party file conversion software such as Stellent filters
  • Option to store original data file as part of a Siets Server document for later retrieval using an application viewer (like Word for DOC files)
  • Multilingual content support: data are stored natively in UTF-8 encoding
  • Automatically converts documents from more than 160 language encodings to UTF-8 data
  • Crawling of Web servers, FTP servers, local and mounted file systems and file servers
  • Optional: Global Crawler application for automatic spidering of the Web and discovery of the content by following Web links, and building large scale distributed indexes on a servers farm.
  • Multiple independently scheduled crawler tasks (supports regular, exact timing and manually started tasks).
  • Advanced crawled URL and domain options (can specify file types, maximum page number and download time limits, URL and file type exclusions etc.)
  • Use of regular expressions for URL and file notations for the crawler tasks
  • Able to follow 'robots.txt' rules if required
  • Supports cookies for Web sites using them
  • Removal of duplicate Web pages based on content comparison
  • Crawling of password protected areas and SSL support
  • View and delete URLs from the index
  • Detailed crawler activities logging and error reporting
  • Separate crawler logs for downloaded files, detected duplicate files and errors
  • Option to perform crawler tasks with full or incremental index updates
  • Well documented indexing API gives flexibility to index any content from virtually any existing application using customized indexing data compilation rules performed by user application
  • Supports multiple search indexes on the same server computer
  • Supports true real-time updates and index modifications while searching
  • Supports real-time removal of documents from the index while searching
  • Does not have limit of the maximum size of an index file (index data is stored in 50Mb container files).One storage can span hundreds of gigabytes per one computer.
  • Scales to terabytes if data is distributed among cluster of N computers.
  • Flexible relevance definitions assigning different relevance levels for specific parts of a document text for later search results sorting according to the custom relevance
  • Flexible configuration of special symbols for separation of words in index
  • Meta data support using XML markup syntax for creating special tags
  • Fielded indexes using meta tags can be indexed as long strings (up to 256 symbols)
  • Can automatically index XML structured data for field searching
  • Supports hidden document content. Can store document specific hidden content (special tags)for indexing which are excluded from search result text snippets and are not part of the original document content
  • Supports exclusion of text from search index; Can store document specific info content (info tags) for later result formatting or other needs which are not indexed and are not part of the original document content
  • Supports binary data storage as part of a Siets document. Can store full binary encoded files as part of the Siets document into the storage and return it as saved and unmodified original data cache content
  • Fast retrieval of any indexed Siets document using known unique document identifier (no full text search is needed)
  • Document identifier can be any URL, file name, database primary key, custom sequence number or other unique string value.
  • Can specify custom indexing cache size and memory usage limits for each document collection for performance tuning needs
  • Option to upload massive data portions as batch files for server's background indexing using Siets API commands
  • Index update and delete commands can be performed directly from the Web based administration tool by less technical people
  • Real time status controls of indexing using Web management or through Siets API commands

Administration and management features

  • Web-based administration using standard browsers
  • Centralized management of all Siets servers across the corporate network
  • Management of multiple document collections per server
  • Management of clustered configuration of a data storage on multiple servers
  • Remote status control, startup and shutdown of servers
  • Management of regular or time specific scheduled crawler tasks
  • User administration with multiple administrators accounts with different acces rights to Siets server storages
  • Error handling using XML messaging
  • Separate configuration options for every document storage
  • Configuration files stored in easy to edit XML format
  • Each storage configuration and data is separated in its own disk directory
  • Web and Console command line tools for performing individual SIETS API commands
  • Web server module included for simple connection with user applications
  • Chronological viewing and filtering of engine's search and indexing log files
  • SNMP agent for Siets Server status checks using network management systems

Security, access control and logging features

  • User authorization with user name and password
  • Restrict user access for specific storages within corporate network
  • Scheduled crawler tasks with specified user authorization data
  • Each user access can be limited in every storage for only specific Siets API commands
  • Option to encrypt traffic between client (application) and server using SSL
  • Engine based denial-of-service attack filter for sustaining heavy query workloads - new
  • All search queries and indexing transaction results and errors are logged for auditing
  • Use of unique identifiers and timestamps for debugging and tracking of transactions
  • Dump of full indexing XML messages can be activated and used for incoming data backup copying or for better problems diagnostics
  • Automatic rotation of log files to prevent data loss because of a too large log file size
  • All log files are organized by dates to ease backup, debugging and administration tasks
  • Built-in automatic data integrity controls to prevent data loss in case of an unexpected server shutdown

Data store content management features

Siets Server developers can operate the platform as a distributed XML data store to create, retrieve or update XML documents.

Among its functionality is:

  • Create, Retrieve, Update, Delete per document or in bulk uploads;
  • Up to 18000 search transactions per minute (up to 300 per second) using memory cached data and up to 1800 search transactions per minute with disk access
  • Ingenious use of computer RAM memory for search index to minimize disk usage
  • Distributed search in parts of the same database (shards) stored on different cluster nodes
  • Flexible workload sharing among cluster servers operating as shards and replicas
  • Scales for low-latency sub-second search in billions of documents (in cluster mode)
  • Scales to petabytes in a large cluster/grid using a fleet of PC servers;
  • Real-time and consistent updates for data store and full text index;
  • Flexible indexing policy using scalable ranking for linguistic text data;
  • Alerting triggers on unstructured content changes using custom filter queries.

Multi-language support features

  • Siets Server search engine software handles more than 160 languages in the same XML document store and its search index.
  • Users can store and process data in multiple languages within the same XML document, avoiding tons of localization efforts in their application software code.
  • Fast server-side XML data conversion between national ISO character sets and UTF-8 data store if requested by client API.
  • All data always stored in standard UTF-8 data store on Siets Server that can be queried using either UTF-8 or a national ISO character set.
  • Online developer documentation resources

    • Installation Guide
    • Administration and Configuration Guide
    • Developers Guide
    • Web site and data search tutorials
    • Frequently asked questions for problem solving tips, technical support recommendations and answers on frequently asked questions
    • Code samples with Siets API client source code examples for C, PHP, Java, .NET, Perl, Delphi

    Download and installation packages

    • Installation package suitable for any Linux distribution, tested on most popular Linux distributions: RedHat, SuSE, Slackware, Debian, Mandrake.
    • Optional: custom installation service for other Linux distributions available
    • Out-of-the-box CD installation with OS (ISO image), i.e., Siets search appliance software package.
    • Free demo download of fully functional evaluation software from Internet
    • Optional: turn-key solution: installation of Siets Search Appliance on a customer hardware - customer's choice of Linux distribution, Siets Server software and our technical support, including remote installation services, configuration and problem solving over Internet
    • Optional: Licensing of the source code of the engine. It was developed in C and C++ for portability across different operating systems and to achieve maximum speeds on low-end hardware
    • Optional upgrade from Linux platform to other operating systems, such as FreeBSD, WindowsNT, Windows 2003 Simple OS migration - does not require change of any Siets Server configuration or data files. All Siets document storage data, index and log files can be simply copied onto the new operating system platform server.

    System requirements and basic hardware needs

    • Minimum: 512MB RAM
    • Recommended for smaller data sets: 2GHz CPU, 1-2GB RAM
    • Recommended for large datasets (>50GB) datasets: 4GB RAM, SCSI RAID
    • Recommended for very large datasets (>500Gb) datasets: distributed cluster setup running on multiple servers in Siets Server cluster configuration.
    • Nodes of less expensive low-end server equipment can be used for better cost controls.

    Search and indexing performance benchmarks

    Siets Server is one of the fastest full text search engines on the market with query response times less than 0.005 seconds with RAM use and less than 0.05 seconds with HDD disk use.

    Test conditions for a single server

    All tests are performed on a single server where not stated otherwise. For performance benchmarks of Siets software used in a cluster configuration running a distributed test database on several servers please see Search Performance in a Cluster Configuration. Test results should be comparable to those of Siets Server installed on equivalent hardware:

    • 2.8GHz P4 CPU
    • 1GB RAM
    • 2-disk S-ATA RAID-0
    • OS: Slackware 10.0 with Linux kernel 2.6

    In all tests number of search transactions per minute is measured on server
    side, that is, excluding transport time of the result set over network.

    Search performance on collections of different size

    Search performance on collections of different size

    As it can be seen from results and proven in practice, when data collection is small and can be entirely cached into primary memory, search performance is above 15000 transactions per minute, but when data collection is larger and at least one disk seek is required, search performance is about 1600 to 1700 transactions per minute.

    Distribution of performance across the number of search terms per query

    Performance benchmarks for number of search terms per query

    Data is presented for 2000 (memory cached) and 25000 document collections respectively.

    Both diagrams illustrate how Siets Server can speed up search queries when running in so called 'main-memory' database configuration. Unlike common logic suggests there are faster search queries for more search terms in a query. Many business applications can benefit from this Siets advantage if real high speed query support in needed for time sensitive business or industrial applications.

    Test of Siets Server on different hardware

    Note that second custom system is quite limited and below minimal requirements of Siets Server thus we discourage use of such hardware. However even on this environment Siets performance is outstanding.

    Please also note that, contrary to Siets Appliance, custom environments in this test work on Linux kernel version 2.4.

    Indexing performance on collections of different size

    Test results of Siets Search Appliance indexing speed on document collections of different number of documents and different sizes (total loading and indexing time in hours:minutes). Indexing performance test methodology:

    • all test documents were of different content.
    • average size of a document was 10 Kilobytes.
    • documents contained text with randomly selected words.
    • words were distributed according to their natural language frequency in normal text.
    • there were no special performance tuning actions for better test results done at the Linux operating system level. All system configuration parameters during tests were left as default.

    Note that indexing performance can be significantly improved (2-4 times) if testing would be done on an enterprise level server with double processors and SCSI RAID multi disk array. Also note that very large data sets can be split among multiple Siets Server hardware nodes and reduction of total indexing time is proportional to the number of cluster nodes.

    Test conditions for a cluster configuration

    In this section performance tests were performed on a single server and compared to 3-server cluster configuration.

    Test equipment for a single server (used for the reference data):

    • Double processor 1.8GHz Xeon CPU
    • 2GB RAM
    • 2x18 HDD SCSI RAID

    Test equipment for a 3-server cluster node:

    • Pentium 4 1.8Ghz CPU
    • 1GB RAM
    • 80 HDD IDE

    In all tests response times for search transactions in seconds are measured on server side, that is, excluding transport time of the result set over network.

    Database of 2,1 million different full text newspapers articles in three languages was used as the content for testing in the environment which is maximally close to real life applications.

    Search performance in a cluster configuration

    One of the most competitive advantage of Siets search engine is that search speeds are almost the same both for relevance-based searches and for rate-based searches. It enables greater variety of end-user applications where search results should be sorted differently depending on the application business logic needs.

    Search performance in a cluster configuration

    Response time, searching in cluster by relevance
    As it can be seen from test results and proven in practice, cluster configurations can be highly scalable (even to hundreds and thousands of nodes) and maintain almost the same very high speed search performance which basically does not depend from the total size of data.

    Search response time, searching by rate

    Response time, searching in cluster by rate

    Using Siets in cluster configuration gives benefits for both methods of search and maintains practically the same performance levels. These two performance goals can not be met simultaneously by most of other search engine products on the market.

    This capability of Siets system has been achieved through optimization of inverted index structure and smart algorithms effectively using PC memory to cache disk data.

    In-memory database to supercharge web services

    Siets Server uses advanced optimizations where large portions of the index are being kept into main computer memory.

    If RAM memory is large enough, Siets Server can operate as in-memory database, almost completely caching all data in RAM memory.

    Siets Server operating as in-memory database can do more than 250 queries per second on a single hardware server.

    Multi-threading, supporting multiple CPU and CPU cores

    All Siets Server control, indexing and query tasks are effectively separated in multi-threaded architecture.

    All transactions are threaded as internal operating system processes. In this way multiple searches can be performed at the same time running as separate processes.

    All indexing and update tasks are performed in background with lower priority than search queries avoiding slowdown of search queries.

    Siets Server software does not require special optimization for multi-processor hardware as it is supported by generic multi-threading on the engine.

    Simplicity of API using XML over standard http/https

    Siets API is based on simple XML 1.0 standard and exchanging of XML messages over common http protocol. This makes Siets Server programming completely open from any application and programming language. Development for Siets Server is even more simple than for Web services. Siets API does not use any document type definition schemes what .NET requires.

    Your corporate knowledge and skills can be well used to develop new applications in your favorite in house programming language. Your legacy applications can be improved by adding search functionality matching world class Internet search engines in speed and quality of results.

    No proprietary client software needed - just follow Siets API documentation and use your existing tools.

    Solid security and easy data protection at application level

    Siets Server API protocol is network and firewall friendly. It uses standard Internet 'http' protocol for messaging between Siets Server and client application.

    The message stream over 'http' can be verified and protected against malware through application level firewalls and malware filtering proxies.

    Geospatial search queries by distance sorting of results

    Siets engine's built-in numeric search functionality supports additional functionality often used by many e-commerce and mapping applications: sorting of results by geospatial coordinates.

    This feature can greatly speed up and at the same time also greatly simplify development of many applications for the location search.

    For example, GPS coordinates can be feed to the Siets Server as longitude and latitude based distance and Siets Server will return all matching results sorted according to the shortest distance from the chosen center of reference point.

    For example, in Siets Server one can nearly instantly query for all coffee shops in Nevada 1,3,5 or 25 miles around, sorted by closest distance from your GPS navigation enabled car. Siets Server returns results in less than 0.2 second time from database containing millions of records running on a standard PC server hardware.

    OLTP reliability with real-time full-text index updates

    Many legacy search tools do not support real-time full text index updates needed for many business applications. People have to do complex integration between their database, typically some SQL, and an external search tool, programming for index consistency checks.

    Siets Server supports real-time index updates for all types of indexes: full text inverted indexes, columnar type data and numeric indexes and meta-structure XML indexes. This feature enables to use Siets Server as searchable OLTP database for XML document type data.

    One can add, modify or delete any XML data document in a Siets Server database in real-time time and from any location over the Internet using TCP/IP networking and Siets API protocol. All per document updates are available for full text search immediately after the data store modification command.

    For large volume massive batch data updates Siets Server performs automatic background indexing informing client application about index status through API command 'status', so that one can always verify if bulk loading of data and its indexing is properly finished.

    Automatic index consistency checks and recovery

    Siets Server checks index reliability and automatically restores index integrity if unexpected shutdowns or equipment failures occur.

    Internally the data store and search engine software maintains tables of checksums for all major control data structures.

    Engine also does backup logging for all updates.

    Finally, for system administrators 'recover' command is supported as the last option if all other methods fail to ensure index integrity, e.g., after sudden hardware failures or unplanned loss of power to equipment without time for proper shutdown.

    This allows to complete correctly most of the last updates after unscheduled hardware crashes, when upon restart Siets Server immediately tries to fix detected index inconsistencies.

    Denial-of-service (DDOS) protection filter on engine

    For mission critical business applications SIETS Server transaction control demons are protected with denial-of-service type of attack stopping filter.

    Integrated solutions usually process incoming queries in totally unprotected way and are bound for overload problems in uncontrolled Internet environment.

    Siets Server being all-in-one platform for data storage, search and reliable processing, was designed to be protected against this type of common Internet risk.

    Siets Server software engine can be configured to run only specified number of parallel queries. When total volume of transactions grows too large in a very short time period, all other search transactions will be put on a queue for waiting until workload levels will even out.

    This feature is indispensable for many Internet search applications where sudden peak of activity can overload your servers in few minutes and make it unavailable or even crash because of overload.

    Alerting on content updates in real-time

    Siets Server API supports implementation of context sensitive triggers, which can be activated upon incoming or recent data updates by different user monitoring applications, by scheduled software agents and by reporting tools doing periodical checks on data.

    Application developers can easily create triggers for filtering Siets database content by specific keywords, phrases or Boolean expressions matching standard query syntax of Siets API.

    Each new trigger created will have its own unique ID code in Siets database. Monitoring and software agent applications can periodically examine any XML documents against select or all established filters for specific Siets storage.

    All context matches found are returned as filter IDs to the user monitoring application.

    In case of filter matching events Siets Server engine can execute the predefined script for event logging or messaging on the server side.

    Using Siets alerting feature a new set of applications can be developed such as subscription services for software agents, which periodically check recent document updates and send notification messages if the document updates contain words or phrases which match filters.

    Developers can add, modify, delete triggers and examine any of the document against established triggers in the new API set. This gives them great flexibility to activate monitoring functionality as frequently as necessary, or check for context triggers only upon new or recent updates.

    The advantage of processing context filters on the Siets Server engine is dramatically better performance. Siets context filters are processed in real-time, using Siets Server engine's generic full text index data, yielding about 10 to 100 times performance increase compared to the context filtering if done on an application server side accessing some database with separate SQL transactions for checking every content filter matches.

    This gives opportunity to examine any document against tens of thousands of filters in sub second time.

    This performance improvement becomes very important in large scale enterprise applications or Internet applications with tens of thousands of users having different individual needs for data monitoring.

    Alert triggered events can be emails, scripts writing alert messages into database files, SMS tools or any other messaging system for sending alert signal to other applications. This feature helps many businesses to subscribe for context agents and stay tuned when some content changing update events of interest happens on a Siets engine.

    For example, using context alerting, users can subscribe for a news agency press releases or technical documentation alerts by specifying keywords or phrases in text for full text search matches.

    SIETS alerting functionality is described in section: Siets API Alerting Functions

    Mirroring of full database replicas in many active copies

    Generic mirroring (replication) functionality is supported by Siets Server by its software architecture design. There is no need to invest into high-end solutions just for this functionality by Siets software customers. It's built in.

    Replication enables Siets customers to implement highly redundant operational environments where multiple Siets servers are being run in parallel on different hardware servers, load-balancing query workload among multiple copies of the entire database.

    Any Siets Server deployment present in the mirroring list of multiple servers in configuration file, will be sent the same data update from the mirror server who will receive the update query.

    In this way developers should not complicate their application logic: Siets Server software will do automatic mirroring of updates on all servers configured as 'mirrors'.

    Read more about mirroring here: Siets Cluster Mirroring

    Unlimited number of documents

    Siets Server does not limit number of documents.

    If a single server hardware RAM and disk storage space is too small to accommodate all customer XML documents, Siets Server capacity can be duplicated by installing the second hardware server of the same configuration and then splitting the data corpus by half.

    Each half a databases will be serviced then by separate hardware server, with both servers acting as one large "virtual" database storage.

    The process of splitting data into more and more parts (shards) can be periodically repeated along adding more hardware servers.

    This would effectively scale out even a giant size database, for instance, to build Internet search index with billions of searchable documents.

    Unlimited number of storages

    Siets Server can be configured to run multiple searchable storages (data stores with own XML document collection) per one hardware server.

    Each storage runs as its own Siets Server instance in hardware memory, in an isolated OS daemon process, and is using its own RAM and its own local disk storage folder to process and store data.

    Customers can start and stop individual Siets Server storages manually through Siets Enterprise Manager GUI, giving Siets Server administrators secure and full control over Siets Server database availability and security at any time.

    There are no limits on total number of Siets storages per each server but only available local RAM and disk storage space.

    With Siets you can create as many storages you would like to have per server. It is more convenient when you need testing platform or if you want to prove new application concept on a copy of your data.

    Typically customers can run some 15-20 storages per single hardware server without running into local RAM and disk storage space limitation problems.

    Unlimited number of users

    Siets Server does not limit number of users per Siets Server.

    Typically Siets Server is being used in corporate environments to service only internal application software of customer, that will take care about all end-user authentication and authorization. In practical setup that would mean at most few or some tens of API users per Siets Server in a typical corporate deployment, where applications then would service thousands and even millions of end-user web queries.

    Each Siets API client application user could be safely restricted for use of certain Siets Server storages only or for certain Siets API commands only for internal safety partitioning among developers or production system network administrators.

    One can create as many internal SIETS API users as necessary for developer or testing groups to work with a single Siets Server deployment per organization.

    For fast performance reasons between Siets Server and client application software, there is no need to encrypt Siets API internal http messaging, if not required by the business for extra security. Encryption is a well-secured environment would just unnecessarily slow down performance among Siets Server and customer application software accessing it though Siets API.

    For the same performance maximizing goal Siets Server does not maintain more complex and slower at data processing user-session based authentication system, just providing basic password based authentication per each Siets API call.

    It is assumed that client application will work as middleware in a 3-tier computing system and would not allow external users to directly access Siets Server from outside or corporate firewall.

    It is recommended to operate Siets Server only behind corporate firewalls and even without public IP address access so that no one can get direct unauthorized access to Siets Server hardware.

    It allows for significant savings in any growing business with added new users and new applications every day.

    For more customer benefits please visit section: Solutions