Searching in the TextGrid Repository

The syntax of Apache Lucene

The TextGrid Repository uses the syntax of Apache Lucene 2.9.4. The mechanism of the syntax will be explained in the following paragraphs. The demonstration largely follows the summary on the website of Apache Lucene.

Suchanfragen werden in dieser Syntax mit Suchbegriffen und sogenannten Operatoren gebildet. Suchbegriffe können über Felder näher bestimmt werden.

Search terms and search fields

It is possible to search for single words as well as phrases that are shown to belong together via quotation marks, e.g. “TextGrid Repository”. Search terms can be limited to certain fields:

field-name:search-term

or

field-name:"multipart phrase"

These are the different fields of TextGrid:

  • “title” for the title of the work
  • „edition.agent.value“ for the author
  • „language“ for the language of the work
  • „notes” for notes of the text
  • „genre“ for the genre
  • „rightsHolder“ for the rights holder of the digital version of the text
  • „work.dateOfCreation.date“ , „work.dateOfCreation.notBefore“ and „work.dateOfCreation.notAfter“ for dates of the work

The “Advanced Search” offers the possibility to choose the fields to search in the meta data directly and to connect them with operators for search queries.

Search queries can be altered in different ways. There are place holders, options for a vague search, specifying distances between words, searching in a defined range and appointing different relevance scales to search terms.

  • Place holders: For single words ? replaces one character, and * stands for any number of characters. E.g. Text?rid or *xtgrid.
  • Vague search: Adding a ~ to the word results in a vagueness of the search according to the Levenshtein distance. Following the ~ can be a value between 0 and 1. The closer the value is to 1, the higher the demanded resemblance. The standard value is 0.5.
  • Distances: When searching for phrases, adding a ~ and a number after the phrase specifies the distance between the single words within the phrase. E.g. "TextGrid Repository"~10. The number stands for how many words can lie between the words. The “Advanced Search” gives the option to directly enter the number in the searching mask.
  • Ranges: When connecting two search values with a “TO”, all values between them are found within the field. This applies to numerical values as well as words. For words the alphabetical order counts. Searches including the given search values are written within [], while searches excluding them are written within {}. E.g. edition.agent.value:[Aristophanes TO Zuckmayer] searches for all author names between “Aristophanes” and “Zuckmayer” including those names.
  • Relevance: By adding a ^ and a number after a search term or phrase, they can be marked as more relevant, e.g. TextGrid^5 Repository. The standard value is 1.

Some characters must be masked with a \ : + - && || ! ( ) { } [ ] ^ " ~ * ? : \.

Operators

Lucene uses Logical connectives to combine search terms and phrases. The standard value is OR, which is equal to ||. Logical connectives must be written in capital letters.

  • AND (equal to &&):Texts containing all of the search terms are found
  • +: The following search term must be contained in the text
  • **NOT (equal to ! or -):**The following search term must not be in the text. Using this at the beginning of the search query can slow down the searching process.

Lucene supports bracketing for the combination of logical connectives, e.g. TextGrid AND (Laboratory OR Repository) finds all texts that contain the word “TextGrid”, as well as the word “Laboratory” or “Repository”. This mechanism can be used with fields as well.