Most syntactic treebanks annotate variants of either phrase structure (left) or dependency structure (right).

In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empirical data.

Etymology

The term treebank was coined by linguist Geoffrey Leech in the 1980s, by analogy to other repositories such as a seedbank or bloodbank. This is because both syntactic and semantic structure are commonly represented compositionally as a tree structure. The term parsed corpus is often used interchangeably with the term treebank, with the emphasis on the primacy of sentences rather than trees.

Construction

Treebanks are often created on top of a corpus that has already been annotated with part-of-speech tags. In turn, treebanks are sometimes enhanced with semantic or other linguistic information. Treebanks can be created completely manually, where linguists annotate each sentence with syntactic structure, or semi-automatically, where a parser assigns some syntactic structure which linguists then check and, if necessary, correct. In practice, fully checking and completing the parsing of natural language corpora is a labour-intensive project that can take teams of graduate linguists several years. The level of annotation detail and the breadth of the linguistic sample determine the difficulty of the task and the length of time required to build a treebank.

Example phrase structure tree for John loves Mary
Hybrid constituency/dependency tree from the Quranic Arabic Corpus

Some treebanks follow a specific linguistic theory in their syntactic annotation (e.g. the follows HPSG) but most try to be less theory-specific. However, two main groups can be distinguished: treebanks that annotate phrase structure (for example the or ) and those that annotate dependency structure (for example the or the ).

It is important to clarify the distinction between the formal representation and the file format used to store the annotated data. Treebanks are necessarily constructed according to a particular grammar. The same grammar may be implemented by different file formats. For example, the syntactic analysis for John loves Mary, shown in the figure on the right/above, may be represented by simple labelled brackets in a text file, like this (following the notation):

This type of representation is popular because it is light on resources, and the tree structure is relatively easy to read without software tools. However, as corpora become increasingly complex, other file formats may be preferred. Alternatives include treebank-specific XML schemes, numbered indentation and various types of standoff notation.

Applications

From a computational linguistics perspective, treebanks have been used to engineer state-of-the-art natural language processing systems such as part-of-speech taggers, parsers, semantic analyzers and machine translation systems. Most computational systems utilize gold-standard treebank data. However, an automatically parsed corpus that is not corrected by human linguists can still be useful. It can provide evidence of rule frequency for a parser. A parser may be improved by applying it to large amounts of text and gathering rule frequencies. However, it should be obvious that only by a process of correcting and completing a corpus by hand is it possible then to identify rules absent from the parser knowledge base. In addition, frequencies are likely to be more accurate.

In corpus linguistics, treebanks are used to study syntactic phenomena (for example, diachronic corpora can be used to study the time course of syntactic change). Once parsed, a corpus will contain frequency evidence showing how common different grammatical structures are in use. Treebanks also provide evidence of coverage and support the discovery of new, unanticipated, grammatical phenomena.

Another use of treebanks in theoretical linguistics and psycholinguistics is interaction evidence. A completed treebank can help linguists carry out experiments as to how the decision to use one grammatical construction tends to influence the decision to form others, and to try to understand how speakers and writers make decisions as they form sentences. Interaction research is particularly fruitful as further layers of annotation, e.g. semantic, pragmatic, are added to a corpus. It is then possible to evaluate the impact of non-syntactic phenomena on grammatical choices.

In linguistics research, annotated treebank data has been used in syntactic research to test linguistic theories of sentence structure against large quantities of naturally occurring examples.[citation needed]

Semantic treebanks

A semantic treebank is a collection of natural language sentences annotated with a meaning representation. These resources use a formal representation of each sentence's semantic structure. Semantic treebanks vary in the depth of their semantic representation. A notable example of deep semantic annotation is the , developed at the University of Groningen and annotated using Discourse Representation Theory. An example of a shallow semantic treebank is PropBank, which provides annotation of verbal propositions and their arguments, without attempting to represent every word in the corpus in logical form.

LanguageTreebankSemantic FormalismDistribution / License
ChinesePropBank semanticsCC BY-NC-SA 3.0 US
EnglishAbstract Meaning Representation (AMR) BankDeep semantics?
EnglishFrameNetShallow semantics?
EnglishUniversal Conceptual Cognitive Annotation (UCCA)Deep semantics?
EnglishDeep semantics?
EnglishDeep semanticsdifferent licenses
EnglishDeep semanticsdifferent licenses
DutchDeep semanticsdifferent licenses
GermanDeep semanticsdifferent licenses
ItalianDeep semanticsdifferent licenses
EnglishDeep semantics?
EnglishDeep semantics?
EnglishDeep semantics?
EnglishDeep semantics?
EnglishPropBank semanticsdifferent licenses
FinnishPropBank semanticsCC BY-NC-SA 3.0 US
FinnishPropBank semanticsCC BY-SA 4.0
FrenchPropBank semanticsCC BY-NC-SA 3.0 US
GermanPropBank semanticsCC BY-NC-SA 3.0 US
ItalianPropBank semanticsCC BY-NC-SA 3.0 US
PortuguesePropBank semantics?
PortuguesePropBank semanticsCC BY-NC-SA 3.0 US
SpanishPropBank semanticsCC BY-NC-SA 3.0 US
TurkishPropBank semanticsCC BY-NC-SA 4.0

Syntactic treebanks

Many syntactic treebanks have been developed for a wide variety of languages:

LanguageTreebankSyntactic FormalismDistribution / License
Abaza, ATBDependencyCC BY-SA
Afrikaans, AfriBoomsDependencyCC BY-SA
Akkadian, PISANDUBDependencyCC BY-SA
Albanian, TSADependencyCC BY-SA
Amharic, ATTDependencyCC BY-SA
Ancient Greek, PerseusDependencyCC BY-NC-SA
Ancient Greek, PROIELDependencyCC BY-NC-SA
Greek (ancient)DependencyOpen source (Creative Commons license)
Greek (ancient)DependencyOpen source (Creative Commons license)
ArabicDependencyLinguistic Data Consortium
ArabicDependencyLinguistic Data Consortium
Arabic, NYUADDependencyCC BY-SA
Arabic, PADTDependencyCC BY-NC-SA
Arabic, PUDDependencyCC BY-SA
ArabicPhrase structureLinguistic Data Consortium
Armenian, ArmTDPDependencyCC BY-SA
Assyrian (Neo-Aramaic), ASDependencyCC BY-SA
Bambara, CRBDependencyCC BY-SA
Basque, BDTDependencyCC BY-NC-SA
Belarusian, HSEDependencyCC BY-SA
Bhojpuri, BhEnDependencyCC BY-SA
Bhojpuri, BHTBDependencyCC BY-SA
Breton, KEBDependencyCC BY-SA
Bulgarian, BTBDependencyCC BY-NC-SA
BulgarianHPSGFreely available for research
Buryat, BDTDependencyCC BY-SA
Cantonese, HKDependencyCC BY-SA
CatalanPhrase structureFreely available for research
Catalan, AnCoraDependencyGPL
ChineseCase grammarNot freely available
Chinese, CFLDependencyCC BY-SA
Chinese, GSDDependencyCC BY-SA
Chinese, GSDSimpDependencyCC BY-SA
Chinese, HKDependencyCC BY-SA
Chinese, PUDDependencyCC BY-SA
ChinesePhrase structureLinguistic Data Consortium
ChineseDependencyLinguistic Data Consortium
Arabic (classical)(Quranic Arabic Corpus)DependencyOpen source (GNU general public license)
Classical ArmenianDependencyOpen source (Creative Commons license)
Coptic, Coptic ScriptoriumDependencyCC BY
CroatianDependencyOpen source (Creative Commons license)
Croatian, SETDependencyCC BY-SA
CzechDependencyOpen source (Creative Commons license)
Czech, CACDependencyCC BY-SA
Czech, CLTTDependencyCC BY-SA
Czech, FicTreeDependencyCC BY-NC-SA
Czech, PDTDependencyCC BY-NC-SA
Czech, PUDDependencyCC BY-SA
DanishDependencyOpen source (GNU general public license)
DanishPhrase structureLicense fee
Danish, DDTDependencyCC BY-SA
Danish, DTBDependencyCC BY-SA
DutchPhrase structureLicense fee
Dutch, AlpinoDependencyCC BY-SA
Dutch, LassySmallDependencyCC BY-SA
DutchDependencyLicense fee
DutchDependencyOpen source (GNU general public license)
Egyptian, Pre-Coptic (PC)DependencyCC BY-SA
EnglishCombinatory categorial grammarLinguistic Data Consortium
EnglishHPSG?
EnglishPhrase structure?
EnglishDependencyLinguistic Data Consortium
English, BhEnDependencyCC BY-SA
English, ESLDependencyCC BY-SA
English, EWTDependencyCC BY-SA
English, GUMDependencyCC BY-NC-SA
English, GUMRedditDependencyCC BY
English, LinESDependencyCC BY-NC-SA
English, ParTUTDependencyCC BY-NC-SA
English, PronounsDependencyCC BY-SA
English, PUDDependencyCC BY-SA
EnglishPhrase structureOpen source (Creative Commons license)
EnglishPhrase structureFreely available for research
EnglishPhrase structureFreely available for research
EnglishPhrase structureFreely available for research
EnglishPhrase structureLinguistic Data Consortium
EnglishHPSGFreely available for research
EnglishPhrase structure
EnglishPhrase structure
EnglishDependency?
EnglishDependencyFreely available for research
EnglishPhrase structureLinguistic Data Consortium
EnglishPhrase structureAvailable online for comparison purposes
EnglishDependencyOpen source (Creative Commons license)
EnglishPhrase structureFreely available for research
Erzya, JRDependencyCC BY-SA
EstonianPhrase structure?
EstonianDependencyFreely available for research
Estonian, EDTDependencyCC BY-NC-SA
Estonian, EWTDependencyCC BY-NC-SA
Faroese, FarPaHCDependencyCC BY-SA
Faroese, OFTDependencyCC BY-SA
FinnishDependencyOpen source (Creative Commons license)
Finnish, FTBDependencyCC BY
Finnish, PUDDependencyCC BY-SA
Finnish, TDTDependencyCC BY-SA
French (spoken)Dependency and macrosyntactic annotationOpen source (Creative Commons license)
FrenchPhrase structure?
French, CrapBankDependencyCC BY-SA
French, FQBDependencyGPL
French, FTBDependencyGPL
French, GSDDependencyCC BY-SA
French, ParTUTDependencyCC BY-NC-SA
French, PUDDependencyCC BY-SA
French, SequoiaDependencyGPL
French, SpokenDependencyCC BY-SA
FrenchPhrase structureFreely available for research
FrenchPhrase structureOpen Source license LGPL-LR
FrenchPhrase structure & DependencyOpen Source license LGPL-LR
Galician, CTGDependencyCC BY-NC-SA
Galician, TreeGalDependencyGPL
GermanDependencyFreely available for research
German, GSDDependencyCC BY-SA
German, LITDependencyCC BY-NC-SA
German, PUDDependencyCC BY-SA
GermanPhrase structureFreely available for research
GermanPhrase structureFreely available for research
GermanPhrase structureFreely available for research
GermanPhrase structureFreely available for research
GermanPhrase structureFreely available for research
GermanPhrase structureLicense fee
GothicDependencyOpen source (Creative Commons license)
Gothic, PROIELDependencyCC BY-NC-SA
GreekDependencyNot freely available
Greek, GDTDependencyCC BY-NC-SA
Hebrew, HTBDependencyCC BY-NC-SA
HebrewDependencyOpen source (GNU general public license)
Hindi English, HIENCSDependencyCC BY-SA
Hindi, HDTBDependencyCC BY-NC-SA
Hindi, PUDDependencyCC BY-SA
HindiDependency?
English (historical);Phrase structureLinguistic Data Consortium (as of April 2020)
English (historical)Phrase structureFreely available for research
French (historical)Phrase structureFreely available for research
Portuguese (historical)Phrase structure?
Hungarian, SzegedDependencyCC BY-NC-SA
HungarianPhrase structure?
IcelandicPhrase structureOpen source (GNU Lesser General Public License)
Icelandic, IcePaHCDependencyCC BY-SA
Icelandic, PUDDependencyCC BY-SA
Indonesian, GSDDependencyCC BY-SA
Indonesian, PUDDependencyCC BY-SA
IndonesianPhrase structure?
Irish, IDTDependencyCC BY-SA
ItalianPhrase structure and dependencyLicense fee
ItaliandependencyFreely available for research
ItalianPhrase structure and dependencyLicense fee
Italian, ISDTDependencyCC BY-NC-SA
Italian, ParTUTDependencyCC BY-NC-SA
Italian, PoSTWITADependencyCC BY-NC-SA
Italian, PUDDependencyCC BY-SA
Italian, TWITTIRODependencyCC BY-SA
Italian, VITDependencyCC BY-NC-SA
ItaliandependencyFreely available for research
Italian??
ItalianDependencyOpen source (Creative Commons license)
ItaliandependencyFreely available for research
Japanese??
Japanese, BCCWJDependencyCC BY-NC-SA
Japanese, GSDDependencyCC BY-SA
Japanese, KTCDependencyCC BY-SA
Japanese, ModernDependencyCC BY-NC-ND
Japanese, PUDDependencyCC BY-SA
JapanesePhrase structureOpen source (Creative Commons license)
JapanesePhrase structureFreely available for research
JapaneseDependency?
Karelian, KKPPDependencyCC BY-SA
Kazakh, KTBDependencyCC BY-SA
Komi Permyak, UHDependencyCC BY-SA
Komi Zyrian, IKDPDependencyCC BY-SA
Komi Zyrian, LatticeDependencyCC BY-SA
Korean, GSDDependencyCC BY-SA
Korean, KaistDependencyCC BY-SA
Korean, PennDependencyCC BY-SA
Korean, PUDDependencyCC BY-SA
Korean, SejongDependencyCC BY-SA
KoreanPhrase structureLinguistic Data Consortium
Kurmanji, MGDependencyCC BY-SA
Latin, ITTBDependencyCC BY-NC-SA
Latin, LLCTDependencyCC BY-SA
Latin, PerseusDependencyCC BY-NC-SA
Latin, PROIELDependencyCC BY-NC-SA
LatinDependencyOpen source (Creative Commons license)
LatinDependencyOpen source (Creative Commons license)
LatinDependencyOpen source (Creative Commons license)
Latvian, LVTBDependencyCC BY-SA
Lithuanian, ALKSNISDependencyCC BY-SA
Lithuanian, HSEDependencyCC BY-SA
Livvi, KKPPDependencyCC BY-SA
Magahi, MGTBDependencyCC BY-SA
Maltese, MUDTDependencyCC BY-SA
Marathi, UFALDependencyCC BY-SA
Mbya Guarani, DooleyDependencyCC BY-NC-SA
Mbya Guarani, ThomasDependencyCC BY-NC-SA
Middle Irish, CritMITBDependencyCC BY-SA
Middle Irish, DipMITBDependencyCC BY-SA
Moksha, JRDependencyCC BY-SA
Naija, NSCDependencyCC BY-SA
North Sami, GiellaDependencyCC BY-SA
NorwegianLFG?
Norwegian, BokmaalDependencyCC BY-SA
Norwegian, NynorskDependencyCC BY-SA
Norwegian, NynorskLIADependencyCC BY-SA
Old Church Slavonic, PROIELDependencyCC BY-NC-SA
Old Church SlavonicDependencyOpen source (Creative Commons license)
Old French, SRCMFDependencyCC BY-NC-SA
Old Russian, RNCDependencyCC BY-SA
Old Russian, TOROTDependencyCC BY-NC-SA
Old RussianDependencyOpen source (Creative Commons license)
PersianDependencyFreely available for research
PersianHPSGFreely available for research
Persian, SerajiDependencyCC BY-SA
PolishHPSG?
Polish, LFGDependencyGPL
Polish, PDBDependencyCC BY-NC-SA
Polish, PUDDependencyCC BY-SA
PolishPhrase structure and DependencyOpen source (GNU general public license)
Portuguese, BosqueDependencyCC BY-SA
Portuguese, GSDDependencyCC BY-SA
Portuguese, PUDDependencyCC BY-SA
PortugueseDependency, Phrase structureOpen source (GNU general public license)
RomanianDependency?
Romanian, NonstandardDependencyCC BY-SA
Romanian, RRTDependencyCC BY-SA
Romanian, SiMoNERoDependencyCC BY-SA
Russian, GSDDependencyCC BY-SA
Russian, PUDDependencyCC BY-SA
Russian, SynTagRusDependencyCC BY-NC-SA
Russian, TaigaDependencyCC BY-SA
RussianSynTagRus Dependency Treebank (Russian National Corpus)DependencyFreely available for research
Sanskrit, UFALDependencyCC BY-SA
Sanskrit, VedicDependencyCC BY-SA
Scottish Gaelic, ARCOSGDependencyCC BY-SA
Serbian, SETDependencyCC BY-SA
Sindhi, MazharDootioDependencyCC BY-SA
Skolt Sami, GiellagasDependencyCC BY-SA
Slovak, SNKDependencyCC BY-SA
SloveneDependencyFreely available for research
Slovenian, SSJDependencyCC BY-NC-SA
Slovenian, SSTDependencyCC BY-NC-SA
SpanishPhrase structure and dependencyFreely available for research
Spanish, AnCoraDependencyGPL
Spanish, GSDDependencyCC BY-SA
Spanish, PUDDependencyCC BY-SA
SpanishPhrase structureFreely available for research
SwedishPhrase structure and dependencyFreely available for research
SwedishPhrase structureFreely available for research
Swedish, LinESDependencyCC BY-NC-SA
Swedish, PUDDependencyCC BY-SA
Swedish, TalbankenDependencyCC BY-SA
SwedishPhrase structureFreely available for research
Swedish Sign Language, SSLCDependencyCC BY-SA
Swiss German, UZHDependencyCC BY-SA
Tagalog, TRGDependencyCC BY-SA
Tagalog, UgnayanDependencyCC BY-NC-SA
Tamil, TTBDependencyCC BY-NC-SA
Telugu, MTGDependencyCC BY-SA
ThaiDependencyOpen source (GNU general public license)
Thai, PUDDependencyCC BY-SA
ThaiPhrase structureCC BY 4.0
TurkishDependencyFreely available for research
Turkish, BOUNDependencyCC BY-SA
Turkish, GBDependencyCC BY-SA
Turkish, IMSTDependencyCC BY-NC-SA
Turkish, PUDDependencyCC BY-SA
UkrainianDependencyOpen source (Creative Commons license)
Ukrainian, IUDependencyCC BY-NC-SA
Upper Sorbian, UFALDependencyCC BY-SA
UrduPhrase structureContact at Computational Learning Strategies & Practices
UrduPhrase and Hyper Dependency StructureContact at Computational Learning Strategies & Practices
Urdu, UDTBDependencyCC BY-NC-SA
Uyghur, UDTDependencyCC BY-SA
Vietnamese, VTBDependencyCC BY-SA
VietnamesePhrase structureFreely available for research
VietnameseDependencyFreely available for research
Warlpiri, UFALDependencyCC BY-SA
Welsh, CCGDependencyCC BY-SA
Wolof, WTBDependencyCC BY-SA
Yoruba, YTBDependencyCC BY-SA

To facilitate the further researches between multilingual tasks, some researchers discussed the universal annotation scheme for cross-languages. In this way, people try to utilize or merge the advantages of different treebanks corpora. For instance, The universal annotation approach for dependency treebanks; and the universal annotation approach for phrase structure treebanks.

Search tools

One of the key ways to extract evidence from a treebank is through search tools. Search tools for parsed corpora typically depend on the annotation scheme that was applied to the corpus. User interfaces range in sophistication from expression-based query systems aimed at computer programmers to full exploration environments aimed at general linguists. Wallis (2008) discusses the principles of searching treebanks in detail and reviews the state of the art around that time.

See also