1 dataset found
  1. E

    ARCADE II Evaluation Package

    • catalogue.elra.info
    • live.european-language-grid.eu
    Updated Jun 28, 2007
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2007). ARCADE II Evaluation Package [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-E0018/
    Explore at:
    Dataset updated
    Jun 28, 2007
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_EVALUATION.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_EVALUATION.pdf

    Description

    The ARCADE II Evaluation Package was produced within the French national project ARCADE II (Evaluation of parallel text alignment systems), as part of the Technolangue programme funded by the French Ministry of Research and New Technologies (MRNT). The ARCADE II project enabled to carry out a campaign for the evaluation in the field of multilingual alignment, with even more ambitious objectives than in the ARCADE I project (within the AUPELF campaigns (Actions de recherche Concertées, 1996-1999), by including a finer alignment and by coping with many other languages (extension to French-distant languages). Thus, ARCADE II is not only an extension of ARCADE I, but also presents innovative and exploratory aspects, for instance by integrating French-distant languages, such as Arabic, Russian, Chinese, etc. This package includes the material that was used for the ARCADE II evaluation campaign. It includes resources, protocols, scoring tools, results of the campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system.The campaign is distributed over two actions: 1)Sentence alignment: it consists in evaluating the alignment of French language with Latin-script languages on one side, and with non Latin-script languages on the other side.2)Translation of named entities: it consists in identifying in the parallel Arabic corpus the translation corresponding to the named entities phrases annotated in the French corpus.The ARCADE II evaluation package contains the following data and tools:1)The JOC Corpus (Official Journal of the European Community) with Latin-script languages (English, French, German, Italian, Spanish) contains 1 million words per language (5 million words in all). The texts are aligned at the sentence level and produced in XML and UTF-8 format. 2)The MD Corpus (Le Monde Diplomatique) with non-Latin-script languages (Arabic, Chinese, Greek, Japanese, Persian, Russian,) contains manually-aligned texts at the sentence level, encoded in XML and UTF-8. The size of the different parts varies according to the language pair. A subset for the Arabic-French part was manually annotated with named entities. The size in words was calculated in the French part. The calculation is different depending on the language (such as for Arabic where many clitics are agglutinated, which reduces the number of words), and sometimes impossible (such as for Chinese, where there is no graphical separation between words):

    Arabic-FrenchChinese-FrGreek-FrJapanese-FrPersian-FrRussian-Fr
    Number of articles150 x 259 x 250 x 252 x 253 x 250 x 2
    Nu...

  2. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2007). ARCADE II Evaluation Package [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-E0018/

ARCADE II Evaluation Package

Explore at:
Dataset updated
Jun 28, 2007
Dataset provided by
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
ELRA (European Language Resources Association)
License

https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_EVALUATION.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_EVALUATION.pdf

Description

The ARCADE II Evaluation Package was produced within the French national project ARCADE II (Evaluation of parallel text alignment systems), as part of the Technolangue programme funded by the French Ministry of Research and New Technologies (MRNT). The ARCADE II project enabled to carry out a campaign for the evaluation in the field of multilingual alignment, with even more ambitious objectives than in the ARCADE I project (within the AUPELF campaigns (Actions de recherche Concertées, 1996-1999), by including a finer alignment and by coping with many other languages (extension to French-distant languages). Thus, ARCADE II is not only an extension of ARCADE I, but also presents innovative and exploratory aspects, for instance by integrating French-distant languages, such as Arabic, Russian, Chinese, etc. This package includes the material that was used for the ARCADE II evaluation campaign. It includes resources, protocols, scoring tools, results of the campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system.The campaign is distributed over two actions: 1)Sentence alignment: it consists in evaluating the alignment of French language with Latin-script languages on one side, and with non Latin-script languages on the other side.2)Translation of named entities: it consists in identifying in the parallel Arabic corpus the translation corresponding to the named entities phrases annotated in the French corpus.The ARCADE II evaluation package contains the following data and tools:1)The JOC Corpus (Official Journal of the European Community) with Latin-script languages (English, French, German, Italian, Spanish) contains 1 million words per language (5 million words in all). The texts are aligned at the sentence level and produced in XML and UTF-8 format. 2)The MD Corpus (Le Monde Diplomatique) with non-Latin-script languages (Arabic, Chinese, Greek, Japanese, Persian, Russian,) contains manually-aligned texts at the sentence level, encoded in XML and UTF-8. The size of the different parts varies according to the language pair. A subset for the Arabic-French part was manually annotated with named entities. The size in words was calculated in the French part. The calculation is different depending on the language (such as for Arabic where many clitics are agglutinated, which reduces the number of words), and sometimes impossible (such as for Chinese, where there is no graphical separation between words):

Arabic-FrenchChinese-FrGreek-FrJapanese-FrPersian-FrRussian-Fr
Number of articles150 x 259 x 250 x 252 x 253 x 250 x 2
Nu...

Search
Clear search
Close search
Google apps
Main menu