>Configuring Your Data Model

Configuring Your Data Model

The following sections describe the configuration files that govern the internal management of data records. The system searches for the files in the directories specified by the profilePath setting in the zebra.cfg file.

The Abstract Syntax

The abstract syntax definition (also known as an Abstract Record Structure, or ARS) is the focal point of the record schema description. For a given schema, the ABS file may state any or all of the following:

Several of the entries above simply refer to other files, which describe the given objects.

The Configuration Files

This section describes the syntax and use of the various tables which are used by the retrieval module.

The number of different file types may appear daunting at first, but each type corresponds fairly clearly to a single aspect of the Z39.50 retrieval facilities. Further, the average database administrator, who is simply reusing an existing profile for which tables already exist, shouldn't have to worry too much about the contents of these tables.

Generally, the files are simple ASCII files, which can be maintained using any text editor. Blank lines, and lines beginning with a (#) are ignored. Any characters on a line followed by a (#) are also ignored. All other lines contain directives, which provide some setting or value to the system. Generally, settings are characterized by a single keyword, identifying the setting, followed by a number of parameters. Some settings are repeatable (r), while others may occur only once in a file. Some settings are optional (o), while others again are mandatory (m).

The Abstract Syntax (.abs) Files

The name of this file type is slightly misleading in Z39.50 terms, since, apart from the actual abstract syntax of the profile, it also includes most of the other definitions that go into a database profile.

When a record in the canonical, SGML-like format is read from a file or from the database, the first tag of the file should reference the profile that governs the layout of the record. If the first tag of the record is, say, <gils>, the system will look for the profile definition in the file gils.abs. Profile definitions are cached, so they only have to be read once during the lifespan of the current process.

When writing your own input filters, the record-begin command introduces the profile, and should always be called first thing when introducing a new record.

The file may contain the following directives:

name symbolic-name

(m) This provides a shorthand name or description for the profile. Mostly useful for diagnostic purposes.

reference OID-name

(m) The reference name of the OID for the profile. The reference names can be found in the util module of YAZ.

attset filename

(m) The attribute set that is used for indexing and searching records belonging to this profile.

tagset filename

(o) The tag set (if any) that describe that fields of the records.

varset filename

(o) The variant set used in the profile.

maptab filename

(o,r) This points to a conversion table that might be used if the client asks for the record in a different schema from the native one.

marc filename

(o) Points to a file containing parameters for representing the record contents in the ISO2709 syntax. Read the description of the MARC representation facility below.

esetname name filename

(o,r) Associates the given element set name with an element selection file. If an (@) is given in place of the filename, this corresponds to a null mapping for the given element set name.

any tags

(o) This directive specifies a list of attributes which should be appended to the attribute list given for each element. The effect is to make every single element in the abstract syntax searchable by way of the given attributes. This directive provides an efficient way of supporting free-text searching across all elements. However, it does increase the size of the index significantly. The attributes can be qualified with a structure, as in the elm directive below.

elm path name attributes

(o,r) Adds an element to the abstract record syntax of the schema. The path follows the syntax which is suggested by the Z39.50 document - that is, a sequence of tags separated by slashes (/). Each tag is given as a comma-separated pair of tag type and -value surrounded by parenthesis. The name is the name of the element, and the attributes specifies which attributes to use when indexing the element in a comma-separated list. A ! in place of the attribute name is equivalent to specifying an attribute name identical to the element name. A - in place of the attribute name specifies that no indexing is to take place for the given element. The attributes can be qualified with field types to specify which character set should govern the indexing procedure for that field. The same data element may be indexed into several different fields, using different character set definitions. See the the Section called Field Structure and Character Sets. The default field type is w for word.

xelm xpath attributes

Specifies indexing for record nodes given by xpath. Unlike directive elm, this directive allows you to index attribute contents. The xpath uses a syntax similar to XPath. The attributes have same syntax and meaning as directive elm, except that ! refers to the nodes selected by xpath.

encoding encodingname

This directive specifies character encoding for external records. For records such as XML that specifies encoding within the file via a header this directive is ignored. If neither this directive is given, nor an encoding is set within external records, ISO-8859-1 encoding is assumed.

xpath enable/disable

If this directive is followed by enable, then extra indexing is performed to allow for XPath-like queries. If this directive is not specified - equivalent to disable - no extra XPath-indexing is performed.

Note: The mechanism for controlling indexing is not adequate for complex databases, and will probably be moved into a separate configuration table eventually.

The following is an excerpt from the abstract syntax file for the GILS profile.


      name gils
      reference GILS-schema
      attset gils.att
      tagset gils.tag
      varset var1.var

      maptab gils-usmarc.map

      # Element set names

      esetname VARIANT gils-variant.est  # for WAIS-compliance
      esetname B gils-b.est
      esetname G gils-g.est
      esetname F @

      elm (1,10)              rank                        -
      elm (1,12)              url                         -
      elm (1,14)              localControlNumber     Local-number
      elm (1,16)              dateOfLastModification Date/time-last-modified
      elm (2,1)               title                       w:!,p:!
      elm (4,1)               controlIdentifier      Identifier-standard
      elm (2,6)               abstract               Abstract
      elm (4,51)              purpose                     !
      elm (4,52)              originator                  - 
      elm (4,53)              accessConstraints           !
      elm (4,54)              useConstraints              !
      elm (4,70)              availability                -
      elm (4,70)/(4,90)       distributor                 -
      elm (4,70)/(4,90)/(2,7) distributorName             !
      elm (4,70)/(4,90)/(2,10 distributorOrganization     !
      elm (4,70)/(4,90)/(4,2) distributorStreetAddress    !
      elm (4,70)/(4,90)/(4,3) distributorCity             !
     

The Attribute Set (.att) Files

This file type describes the Use elements of an attribute set. It contains the following directives.

This is an excerpt from the GILS attribute set definition. Notice how the file describing the bib-1 attribute set is referenced.


      name gils
      reference GILS-attset
      include bib1.att

      att 2001		distributorName
      att 2002		indextermsControlled
      att 2003		purpose
      att 2004		accessConstraints
      att 2005		useConstraints
     

The Tag Set (.tag) Files

This file type defines the tagset of the profile, possibly by referencing other tag sets (most tag sets, for instance, will include tagsetG and tagsetM from the Z39.50 specification. The file may contain the following directives.

The following is an excerpt from the TagsetG definition file.


      name tagsetg
      reference TagsetG
      type 2

      tag	1	title		string
      tag	2	author		string
      tag	3	publicationPlace string
      tag	4	publicationDate	string
      tag	5	documentId	string
      tag	6	abstract	string
      tag	7	name		string
      tag	8	date		generalizedtime
      tag	9	bodyOfDisplay	string
      tag	10	organization	string
     

The Variant Set (.var) Files

The variant set file is a straightforward representation of the variant set definitions associated with the protocol. At present, only the Variant-1 set is known.

These are the directives allowed in the file.

The following is an excerpt from the file describing the variant set Variant-1.


      name variant-1
      reference Variant-1

      class 1 variantId

      type	1	variantId		octetstring

      class 2 body

      type	1	iana			string
      type	2	z39.50			string
      type	3	other			string
     

The Element Set (.est) Files

The element set specification files describe a selection of a subset of the elements of a database record. The element selection mechanism is equivalent to the one supplied by the Espec-1 syntax of the Z39.50 specification. In fact, the internal representation of an element set specification is identical to the Espec-1 structure, and we'll refer you to the description of that structure for most of the detailed semantics of the directives below.

The directives available in the element set file are as follows:

defaultVariantSetId OID-name

(o) If variants are used in the following, this should provide the name of the variantset used (it's not currently possible to specify a different set in the individual variant request). In almost all cases (certainly all profiles known to us), the name Variant-1 should be given here.

defaultVariantRequest variant-request

(o) This directive provides a default variant request for use when the individual element requests (see below) do not contain a variant request. Variant requests consist of a blank-separated list of variant components. A variant compont is a comma-separated, parenthesized triple of variant class, type, and value (the two former values being represented as integers). The value can currently only be entered as a string (this will change to depend on the definition of the variant in question). The special value (@) is interpreted as a null value, however.

simpleElement path ['variant' variant-request]

(o,r) This corresponds to a simple element request in Espec-1. The path consists of a sequence of tag-selectors, where each of these can consist of either:

The occurrences-specification can be either the string all, the string last, or an explicit value-range. The value-range is represented as an integer (the starting point), possibly followed by a plus (+) and a second integer (the number of elements, default being one).

The variant-request has the same syntax as the defaultVariantRequest above. Note that it may sometimes be useful to give an empty variant request, simply to disable the default for a specific set of fields (we aren't certain if this is proper Espec-1, but it works in this implementation).

The following is an example of an element specification belonging to the GILS profile.


      simpleelement (1,10)
      simpleelement (1,12)
      simpleelement (2,1)
      simpleelement (1,14)
      simpleelement (4,1)
      simpleelement (4,52)
     

The Schema Mapping (.map) Files

Sometimes, the client might want to receive a database record in a schema that differs from the native schema of the record. For instance, a client might only know how to process WAIS records, while the database record is represented in a more specific schema, such as GILS. In this module, a mapping of data to one of the MARC formats is also thought of as a schema mapping (mapping the elements of the record into fields consistent with the given MARC specification, prior to actually converting the data to the ISO2709). This use of the object identifier for USMARC as a schema identifier represents an overloading of the OID which might not be entirely proper. However, it represents the dual role of schema and record syntax which is assumed by the MARC family in Z39.50.

These are the directives of the schema mapping file format:

The MARC (ISO2709) Representation (.mar) Files

This file provides rules for representing a record in the ISO2709 format. The rules pertain mostly to the values of the constant-length header of the record.

Field Structure and Character Sets

In order to provide a flexible approach to national character set handling, Zebra allows the administrator to configure the set up the system to handle any 8-bit character set — including sets that require multi-octet diacritics or other multi-octet characters. The definition of a character set includes a specification of the permissible values, their sort order (this affects the display in the SCAN function), and relationships between upper- and lowercase characters. Finally, the definition includes the specification of space characters for the set.

The operator can define different character sets for different fields, typical examples being standard text fields, numerical fields, and special-purpose fields such as WWW-style linkages (URx).

The field types, and hence character sets, are associated with data elements by the .abs files (see above). The file default.idx provides the association between field type codes (as used in the .abs files) and the character map files (with the .chr suffix). The format of the .idx file is as follows

index field type code

This directive introduces a new search index code. The argument is a one-character code to be used in the .abs files to select this particular index type. An index, roughly, corresponds to a particular structure attribute during search. Refer to the Section called Search in Chapter 7.

sort field code type

This directive introduces a sort index. The argument is a one-character code to be used in the .abs fie to select this particular index type. The corresponding use attribute must be used in the sort request to refer to this particular sort index. The corresponding character map (see below) is used in the sort process.

completeness boolean

This directive enables or disables complete field indexing. The value of the boolean should be 0 (disable) or 1. If completeness is enabled, the index entry will contain the complete contents of the field (up to a limit), with words (non-space characters) separated by single space characters (normalized to " " on display). When completeness is disabled, each word is indexed as a separate entry. Complete subfield indexing is most useful for fields which are typically browsed (eg. titles, authors, or subjects), or instances where a match on a complete subfield is essential (eg. exact title searching). For fields where completeness is disabled, the search engine will interpret a search containing space characters as a word proximity search.

charmap filename

This is the filename of the character map to be used for this index for field type.

The contents of the character map files are structured as follows:

lowercase value-set

This directive introduces the basic value set of the field type. The format is an ordered list (without spaces) of the characters which may occur in "words" of the given type. The order of the entries in the list determines the sort order of the index. In addition to single characters, the following combinations are legal:

  • Backslashes may be used to introduce three-digit octal, or two-digit hex representations of single characters (preceded by x). In addition, the combinations \\, \\r, \\n, \\t, \\s (space — remember that real space-characters may not occur in the value definition), and \\ are recognized, with their usual interpretation.

  • Curly braces {} may be used to enclose ranges of single characters (possibly using the escape convention described in the preceding point), eg. {a-z} to introduce the standard range of ASCII characters. Note that the interpretation of such a range depends on the concrete representation in your local, physical character set.

  • paranthesises () may be used to enclose multi-byte characters - eg. diacritics or special national combinations (eg. Spanish "ll"). When found in the input stream (or a search term), these characters are viewed and sorted as a single character, with a sorting value depending on the position of the group in the value statement.

uppercase value-set

This directive introduces the upper-case equivalencis to the value set (if any). The number and order of the entries in the list should be the same as in the lowercase directive.

space value-set

This directive introduces the character which separate words in the input stream. Depending on the completeness mode of the field in question, these characters either terminate an index entry, or delimit individual "words" in the input stream. The order of the elements is not significant — otherwise the representation is the same as for the uppercase and lowercase directives.

map value-set target

This directive introduces a mapping between each of the members of the value-set on the left to the character on the right. The character on the right must occur in the value set (the lowercase directive) of the character set, but it may be a paranthesis-enclosed multi-octet character. This directive may be used to map diacritics to their base characters, or to map HTML-style character-representations to their natural form, etc.