Difference between revisions of "Irish/Using Nua-Chorpas na hÉireann"

From Celtic Languages
Jump to navigationJump to search
(No difference)

Revision as of 18:58, 5 November 2022

The New Corpus for Ireland or Nua-Chorpas na hÉireann is a very useful tool for checking how some things are phrased in Irish and which expressions are used by native speakers and which ones are not. Unfortunately the corpus’s help page is not accessible and the UI isn’t very user-friendly. One can find some documentation for the software used there, but it’s not corpas-specific and thus not very helpful when working with this particular corpus of Irish.

This page isn’t meant to be a comprehensive documentation of the corpus, but at least a list of hints that would make your work with the corpus a bit more efficient. For more comprehensive documentation, see [#External documentation] below.

First steps

To use the corpus, you first have to create an account using the registration form. Registration is free, but you will have to wait until your account is accepted before you’ll be able to log in and use the corpus.

Old and new interface

When you log in, you’ll see the old Sketch Engine web interface. You can use it but it is also possible to access the new interface by logging into focloir.sketchengine.eu instead. The new interface is generally much more user-friendly (and compatible with the official Sketch Engine documentation) but beware: some features don’t work with it (for example word sketches work in the old interfaces, but they don’t in the new one).

You can follow this guide in the old interface, unless it refers explicitly to the new one.

Simple querying

When you log into the corpus, you’ll see the ⟨Home⟩ (⟨Leathanach Tosaigh⟩) screen with an input form to perform a simple search. As the prompt says, you can type words or phrases in there. If you type a lemma form of a word (ie. the base form that you’d find in a dictionary), it will search for any occurrence of that word in any form in the corpus. And it will treat every word in a phrase this way.

This means that if you type bí madra ag (‘to be, dog, at’), you’ll see results such as:

  • bhí madra agamsa ‘I had a dog’,
  • tá madraí aige siúd ‘that one has dogs’, etc.

You’ll also see the number of all results at the top (Hits: 31 or Amas: 31).

If a word you type in is not a lemma form, only sentences that match this form exactly will be found. So if you type madraí ag (‘to be, dogs, at’), you’ll get results like:

  • Beidh madraí ag Waterloo ‘there will be dogs at Waterloo’,
  • tá madraí aige siúd ‘that one has dogs’,

but no instances of singular madra (and you’ll see that the number of results fell down to 8).

The default search is case-insensitive, you can type both madraí or MADRAÍ and you’ll get the same set of results.

Note! The way words in the corpus are tagged with their lemma form and part-of-speech is not perfect. You’ll have to write the exact form you’ll looking for sometimes – often you won’t find non-standard historical and dialectal forms when using the lemma form. For example the dative plural forms like Gaelaibh, fearaibh, cosaibh (and their lenited counterparts) are interpreted by the corpus as their own singular lemmata.

Wildcards (new interface only)

If you use the new interface, you can perform simple searches in the ⟨Concordance⟩ tab with ⟨Simple⟩ query type chosen. You can use wildcard characters:

  • * standing for any number of characters,
  • ? for any single character in your queries,
  • | meaning or, allowing you to list various words or forms,
  • and some more.

Thus you can for example type bí * ag and get all occurences of the verb and its forms (tá, raibh, beidh, etc.) followed by any word, followed by any form of the preposition ag, thus you’ll get a result list containing:

  • tá feidhm ag na fóralach cosúlacha…,
  • … nach raibh feidhm aige…,
  • Ceangaltais a bheidh déanta ag an gComhphobal, etc.

You can also type just a part of the word, eg. feoil* will find occurrences of every word starting with feoil and whose lemma starts with feoil, those the list will include: feoil, feola, feoilséantóir, mhuicfheoil (it’s lemmatized as feoil), feoilmhian, etc. If you type Ga?l you’ll get results for both Gael and Gall (and also gaol and gail).

You can use the | to list multiple words or phrases that are supposed to match in your query, eg. if you type snámh|léamh you’ll find all instances of the verbal nouns snámh ‘swimming’ and léamh ‘reading’, if you type bí snámh ag|bí léamh ag you’ll find all instances of the ⟨verbal noun⟩ agam ‘I can ⟨verb⟩’ construction with the verbal nouns for ‘swim’ and ‘read’, regardless of tense or grammatical person.

Filtering the results

If you want to filter the results using criteria like texts written only by native speakers or only Munster Irish, you need to enter the Concordance screen. To do that you need to click ⟨>> More⟩ (⟨>> Tuilleadh⟩) under the results list, then in the menu on the left click ⟨Filter⟩ (⟨Scagaire⟩), and that will bring you to a screen where you can select your filtering criteria and confirm them by clicking ⟨Filter Concordance⟩ (⟨Déan an Comhchordacht a Scagadh⟩). This will take you to the concordance results screen with the results filtered.

CQL

The Corpus Query Language (CQL) allows you to make complex regex-like queries, including things like looking for phrases containing specific parts of speech or inflectional forms – that’s possible because every word in the corpus is tagged with information about its part-of-speech and inflectional form. Using CQL is more complex than simple searching for words, but it enables you to be much more flexible in your searches.

TODO

External documentation

  • list of tags available in the corpus
  • Sketch Engine User Guide – a guide to newer version of the software the Corpas is using. The graphical interface presented in the guide is completely different to what you’ll find on corpas.focloir.ie, but the principles described there will generally be valid for the Corpas too. Among things you’ll find there are: