The Flamenco Project

Flamenco Documentation

Preparing Your Data

For Flamenco to load your collection, the metadata about the collection has to be provided in tab-delimited files (also known as TSV files, with a ".tsv" extension). TSV files can be easily manipulated using OpenOffice or Microsoft Excel. A sample collection, containing the winners of the Nobel Prize from 1901 to 2004, is provided in the example directory of the Flamenco distribution. You can load this collection into Flamenco and browse it, and you can examine the TSV files in the example directory to see how the data needs to be formatted.

A Flamenco collection is a set of items that are all the same kind (for example, all items are books, or all items are songs, and so on). The metadata about any given item consists of its facet values and attribute values. The first step in preparing your collection is to decide which information will be in facets and which will be in attributes. Facet values are used to organize items into categories, whereas attribute values are only displayed with individual items.

In the sample collection, for instance, prize is a facet indicating the type of Nobel Prize won, whereas name is an attribute for the name of the winner. That's because it makes sense to group Nobel Prize winners into categories by the type of prize, but not by their names.

Facet values are associated with ID numbers, whereas attribute values are strings. When an item belongs to a category, and the category belongs to a particular facet, the item has that category term as a value for that facet. "Facet value" and "category term" mean the same thing. For example, since Mother Teresa won the Nobel Peace Prize, Mother Teresa has one value in the prize facet, the prize category named "peace". The value of the name attribute for Mother Teresa is the string "Mother Teresa".

The TSV files you need to provide are:

attrs.tsv

attrs.tsv gives the list of attributes. Each line in this file represents one attribute. The tab-separated fields in the line should be as follows.

Field 1Field 2
attribute identifierdisplayable name

The attribute identifier should be a short, unique name containing only letters or underscores (no spaces or punctuation). The displayable name is what will be shown in the user interface. The example below gives three attributes.

Example
nameFull Name
birthyearYear of Birth
deathyearYear of Death

facets.tsv

facets.tsv gives the list of facets. Each line in this file represents one facet. The tab-separated fields in the line should be as follows.

Field 1Field 2Field 3
facet identifier displayable name long description

The facet identifier should be a short, unique name containing only letters or underscores. (Facet and attribute identifiers must be unique among both facets and attributes.) The displayable name is what will be shown in the user interface. The long description gives a more detailed description of the facet. The example below gives four facets.

Example
genderGendergender
affiliationAffiliation affiliation at the time of the award
prizePrizetype of the Nobel Prize won
yearYearyear that the Nobel Prize was won

items.tsv

items.tsv gives the IDs and attribute values for all the items. Each line of the file represents one item. If there are n attributes, then each line should have n + 1 fields, as follows.

Field 1Field 2Field 3...Field n + 1
item identifier value for attribute 1 value for attribute 2 ... value for attribute n

Each item must have a unique identifier, which can be any number or string. It's best to use identifiers that are fairly short (less than 30 characters). The item identifier is followed by the values for each attribute, in the order that the attributes were given in attrs.tsv. The example below shows five items excerpted from a longer file, each with three attributes as given in the attrs.tsv example above.

Example
.
.
.
237Alfred Werner18661919
238Marie Curie18671934
239Jody Williams1950
240Jack Steinberger1921
241Linus Pauling19011994
.
.
.

It's fine to leave any of the attribute values blank, but note that each line still must have exactly n + 1 fields (that is, there must be exactly n tab characters). In this example, the lines for items 239 and 240 would each end in a tab character.

facet_terms.tsv

For each facet, the file named facet_terms.tsv (where facet is the facet identifier as specified in the first column of facets.tsv) gives the tree of category terms in the facet. This is the only file where each line can have a different number of fields. Each line represents one category, and gives the entire chain of ancestor categories leading down to that category. If the category is d levels deep, then the line has d + 1 fields.

Field 1Field 2... Field d - 1 Field d Field d + 1
term identifier top-level term ... grandparent term parent term category term

The term identifier must be a number unique within the facet. The tree structure is inferred by matching the category terms, so if two terms are subcategories of the same parent, make sure the parent term matches exactly.

prize is an example of a flat facet (disjoint categories with no subcategories). The prize_terms.tsv file might look like this.

Example
1chemistry
2economics
3literature
4medicine
5peace
6physics

affiliation is a hierarchical facet in the sample collection, arranging each Nobel Prize winner's affiliated organizations under the cities and countries to which they belong. Some of the lines in the affiliation_terms.tsv file might look like this.

Example
.
.
.
82Switzerland
83SwitzerlandGeneva
84SwitzerlandGenevaCERN
85SwitzerlandZurich
86SwitzerlandZurichUniversity of Zurich
.
.
.

As this example shows, categories at different levels are all distinct, and items can be assigned to them at any level. Also, two different categories can have the same category name as long as their parent categories are different.

Separate lines for each parent category (such as 82 and 83 in this example) are allowed but not required. If they are not present, Flamenco will automatically generate identifiers for the parent categories (for example, the CERN category will generate three nested categories, Switzerland, Geneva, and CERN).

facet_map.tsv

For each facet, the file named facet_terms.tsv (where facet is the facet identifier as specified in the first column of facets.tsv) assigns items to the category terms for that facet. Each line in this file has two fields.

Field 1Field 2
item identifierterm identifier

The following example puts Alfred Werner in the category for the University of Zurich and Jack Steinberger in the category for CERN.

Example
23782
23786
24084

The first line of this example is redundant but harmless. Whether or not the first line is present, item 237 (Alfred Werner) will automatically be assigned to category 82 (Switzerland), because Switzerland is a parent of category 86 (University of Zurich). The same item identifier can appear in multiple lines, which assigns the item to multiple categories in the facet.

sortkeys.tsv

sortkeys.tsv indicates which facets or attributes are to be used for sorting result lists. This file is optional. If it is present, each line corresponds to one sort key (either a facet or an attribute).

Field 1Field 2
facet or attribute identifierdescription

The first field is the identifier of an attribute or facet, as given in the first column of attrs.tsv or facets.tsv. The second field is the text that will be used for the link that the user selects in order to sort by that attribute or facet.

Example
namename
birthyearyear of birth
countrycountry

text.tsv

text.tsv supports the text search feature of Flamenco. This file is optional. If it is present, each line corresponds to one item and provides the searchable text for the item.

Field 1Field 2
item identifiersearchable text keywords

The following example shows some possible text keywords for the items in the items.tsv example above.

Example
.
.
.
237professor chemistry molecule structure
238professor sorbonne polonium radium
239campaign to ban landmines
240professor neutrino muon pion
241professor chemistry molecule protein antibody
.
.
.

Searching on the term "professor" would then yield items 237 (Alfred Werner), 238 (Marie Curie), 240 (Jack Steinberger), and 241 (Linus Pauling).

Continue to the next section: Installing Flamenco.