LTFW (Linguistic Tools For Weblicht)

en_GBit_IT

Linguistic services.

This is the Java porting of the perl-based tokenizer developed within the OpeNER project.

Our Java porting is available on GitHub and Docker.

Software current version: 0.2 (released on 04/10/2017).

ILC4CLARIN provides three sets of distinct web services to perform tokenization on texts for the following languages:

  • ita (or it)
  • fra (or fr)
  • deu (or deu)
  • eng (or en)
  • esp (or es)
  • nld (or nl)

The application arises an Unsupported Language Exception if the language provided is not in the list.

The services offered perform the same operation (tokenization) but, according with the endpoints, valid TCF, KAF or tabbed files can be produced.

The service that produces TCF can read from both a plain text or a valid TCF document. The mime type is set accordingly.

How to invoke the offered services

The endpoints are the following:

  • wl/tokenizer/plain (POST service to tokenize plain text and to produce a valid TCF document)
  • wl/tokenizer/lrs (GET service to tokenize a text retrieved from URL and to produce a valid TCF document)
  • wl/tokenizer/tcf (POST service to tokenize a TCF document and to produce a valid TCF document)
  • kaf/tokenizer/plain (POST service to tokenize plain texts and to produce a valid KAF document)
  • kaf/tokenizer/lrs (GET service to tokenize a text retrieved from URL and to produce a valid KAF document)
  • tab/tokenizer/plain (POST service to tokenize plain texts and to produce a tabbed document)
  • tab/tokenizer/lrs (GET service to tokenize a text retrieved from URL and to produce a tabbed document)

The language is provided as a parameter:

  • wl/tokenizer/plain?lang=iso_3_or_2_codes_lang
  • kaf/tokenizer/plain?lang=iso_3_or_2_codes_lang
  • tab/tokenizer/plain?lang=iso_3_or_2_codes_lang

PLEASE NOTE THIS CALL. For TCF when a TCF document is sent in input, NO LANGUAGE PROVIDED AS PARAMETER.

  • wl/tokenizer/tcf

For the Language Resource Switchboard (please note the lrs in the path) we added three additional endpoints.

The endpoints are the following:

  • wl/tokenizer/lrs (GET service to tokenize a text retrieved from URL and to produce a valid TCF document)
  • kaf/tokenizer/lrs (GET service to tokenize a text retrieved from URL and to produce a valid KAF document)
  • tab/tokenizer/lrs (GET service to tokenize a text retrieved from URL and to produce a tabbed document)

Both the language and the url are provided as parameters:

  • wl/tokenizer/lrs?lang=iso_3_or_2_codes_lang&url=URL
  • kaf/tokenizer/lrs?lang=iso_3_or_2_codes_lang&url=URL
  • tab/tokenizer/lrs?lang=iso_3_or_2_codes_lang&url=URL

This because the integration of services in the Language Resource Switchboard requires the URL passed as an input parameter.

How to test the services

You can test the service endpoints using curl or wget as follows:

Send the input file to the endpoints for processing:

  • with curl:

curl -H ‘content-type: text/plain’ –data-binary @plain-file.txt -X POST http://ilc4clarin.ilc.cnr.it/services/ltfw/wl/tokenizer/plain?lang=ita

curl -H ‘content-type: text/tcf+xml’ –data-binary @tcf-file.txt -X POST http://ilc4clarin.ilc.cnr.it/services/ltfw/wl/tokenizer/tcf

curl -H ‘content-type: text/plain’ –data-binary @plain-file.txt -X POST http://ilc4clarin.ilc.cnr.it/services/ltfw/kaf/tokenizer/plain?lang=ita

curl -H ‘content-type: text/plain’ –data-binary @plain-file.txt -X POST http://ilc4clarin.ilc.cnr.it/services/ltfw/tab/tokenizer/plain?lang=ita

  • with wget:

wget –post-file=plain-file.txt –header=’Content-Type: text/plain’ http://ilc4clarin.ilc.cnr.it/services/ltfw/wl/tokenizer/plain?lang=ita

wget –post-file=tcf-file.txt –header=’Content-Type: text/tcf+xml’ http://ilc4clarin.ilc.cnr.it/services/ltfw/wl/tokenizer/tcf?lang=ita

wget –post-file=plain-file.txt –header=’Content-Type: text/plain’ http://ilc4clarin.ilc.cnr.it/services/ltfw/kaf/tokenizer/plain?lang=ita

wget –post-file=plain-file.txt –header=’Content-Type: text/plain’ http://ilc4clarin.ilc.cnr.it/services/ltfw/tab/tokenizer/plain?lang=ita

To test the services for the Language Resource Switchboard:

  • with curl:

curl -X GET “http://ilc4clarin.ilc.cnr.it/services/ltfw/wl/tokenizer/lrs?lang=ita&url=https://raw.githubusercontent.com/clarin-eric/LRS-Hackathon/master/samples/resources/txt/hermes-it.txt”

curl -X GET “http://ilc4clarin.ilc.cnr.it/services/ltfw/kaf/tokenizer/lrs?lang=ita&url=https://raw.githubusercontent.com/clarin-eric/LRS-Hackathon/master/samples/resources/txt/hermes-it.txt”

curl -X GET “http://ilc4clarin.ilc.cnr.it/services/ltfw/tab/tokenizer/lrs?lang=ita&url=https://raw.githubusercontent.com/clarin-eric/LRS-Hackathon/master/samples/resources/txt/hermes-it.txt”

  • with wget:

wget “http://ilc4clarin.ilc.cnr.it/services/ltfw/wl/tokenizer/lrs?lang=ita&url=https://raw.githubusercontent.com/clarin-eric/LRS-Hackathon/master/samples/resources/txt/hermes-it.txt” [-O out_file]

wget “http://ilc4clarin.ilc.cnr.it/services/ltfw/kaf/tokenizer/lrs?lang=ita&url=https://raw.githubusercontent.com/clarin-eric/LRS-Hackathon/master/samples/resources/txt/hermes-it.txt” [-O out_file]

wget “http://ilc4clarin.ilc.cnr.it/services/ltfw/tab/tokenizer/lrs?lang=ita&url=https://raw.githubusercontent.com/clarin-eric/LRS-Hackathon/master/samples/resources/txt/hermes-it.txt” [-O out_file]

Please note that the services designed for the Language Resource Switchboard clearly work by themselves invoking the commands above.

As for plain text you can use:

Mi chiamo Riccardo. Abito a Roma

As for TCF text you can use:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="http://de.clarin.eu/images/weblicht-tutorials/resources/tcf-04/schemas/latest/d-spin_0_4.rnc" type="application/relax-ng-compact-syntax"?>
    <D-Spin xmlns="http://www.dspin.de/data" version="0.4">
        <md:MetaData xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:cmd="http://www.clarin.eu/cmd/" 
            xmlns:md="http://www.dspin.de/data/metadata" 
            xsi:schemaLocation="http://www.clarin.eu/cmd/ http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/profiles/clarin.eu:cr1:p_1320657629623/xsd">
        </md:MetaData>
            <tc:TextCorpus xmlns:tc="http://www.dspin.de/data/textcorpus" lang="it">
                <tc:text>
                    Mi chiamo Alfredo. Abito a Roma.
                </tc:text>
            </tc:TextCorpus>
    </D-Spin>


Contacts

In case of problems, please write an e-mail to the ILC4CLARIN Technical Staff with all the information needed to solve the issues, including the version number.