ILC4CLARIN Opener tokenizer wrapper

Current Software version 1.0, released on 15/04/2019. It is available at CNR-ILC github

This page describes how to use the ILC4CLARIN opener tokenizer wrapper. Official information and description can be found here

Offered service performs the same operation (a simple tokenization), but, according with the input parameters process files in different formats and returns the result in different formats as well.
According to the input data:
{
                    "language":"one_of_the_above_in_either_2_or_3_codes",
                    "iformat":"raw_or_kaf", 
                    "oformat":"one_of_the_above"}
                }
The input format is one of the following: The output format is one of the following

POST endpoints are the following:

Similarly GET endpoints have been set up for eventual integration into LRS

Examples

In each example:
Calls in the (1) manage plain texts to produce a valid TAB, TCF or KAF document;
Calls in the (2) manage TCF document to produce a valid TAB, TCF or KAF document;
Calls in the (3) manage KAF document to produce a valid TAB, TCF or KAF document;

POST

input format, language and file parameters must be supplied as parameters, while output format can be supplied if a different format as KAF is requested.

CURL:

Some examples:
    • curl -H 'Content-Type: multipart/form-data' -F 'file=@myfile' -F 'form={"language":"it","iformat":"raw","oformat":"tab"}' -X POST tokenizer/runservice
    • curl -H 'Content-Type: multipart/form-data' -F 'file=@myfile' -F 'form={"language":"it","iformat":"raw","oformat":"tcf"}' -X POST tokenizer/runservice
    • curl -H 'Content-Type: multipart/form-data' -F 'file=@myfile' -F 'form={"language":"it","iformat":"raw","oformat":"kaf"}' -X POST tokenizer/runservice
    • curl -H 'Content-Type: multipart/form-data' -F 'file=@myfile.tcf' -F 'form={"language":"it","iformat":"raw","oformat":"tab"}' -X POST tokenizer/tcf/runservice
    • curl -H 'Content-Type: multipart/form-data' -F 'file=@myfile.tcf' -F 'form={"language":"it","iformat":"raw","oformat":"tcf"}' -X POST tokenizer/tcf/runservice
    • curl -H 'Content-Type: multipart/form-data' -F 'file=@myfile.tcf' -F 'form={"language":"it","iformat":"raw","oformat":"kaf"}' -X POST tokenizer/tcf/runservice
    • *** curl -H 'Content-Type: multipart/form-data' -F 'file=@myfile.kaf' -F 'form={"language":"it","iformat":"kaf","oformat":"tab"}' -X POST tokenizer/runservice
    • *** curl -H 'Content-Type: multipart/form-data' -F 'file=@myfile.kaf' -F 'form={"language":"it","iformat":"kaf","oformat":"tcf"}' -X POST tokenizer/runservice
    • *** curl -H 'Content-Type: multipart/form-data' -F 'file=@myfile.kaf' -F 'form={"language":"it","iformat":"kaf","oformat":"kaf"}' -X POST tokenizer/runservice

GET

GET endpoints are designed mainly for LRS purposes and can be executed in both curl and wget.
The endpoint name convention is the following: In both cases, the url parameter indicates the URL where the text to analyze is found. The language and the format must be supplied as parameters:

CURL

Some examples:
    • curl -X GET "tokenizer/lrs?format=tab&lang=ita&url=my_url"
    • curl -X GET "tokenizer/lrs?format=kaf&lang=ita&url=my_url"
    • curl -X GET "tokenizer/lrs?format=tcf&lang=ita&url=my_url"
    • curl -X GET "tokenizer/tcf/lrs?format=tab&lang=ita&url=my_url"
    • curl -X GET "tokenizer/tcf/lrs?format=kaf&lang=ita&url=my_url"
    • curl -X GET "tokenizer/tcf/lrs?format=tcf&lang=ita&url=my_url"
    • curl -X GET "tokenizer/kaf/lrs?format=tab&lang=ita&url=my_url"
    • curl -X GET "tokenizer/kaf/lrs?format=kaf&lang=ita&url=my_url"
    • curl -X GET "tokenizer/kaf/lrs?format=tcf&lang=ita&url=my_url"

WGET

Some examples:
    • wget "tokenizer/lrs?format=tab&lang=ita&url=my_url" [-O out_file]
    • wget "tokenizer/lrs?format=kaf&lang=ita&url=my_url" [-O out_file]
    • wget "tokenizer/lrs?format=tcf&lang=ita&url=my_url" [-O out_file]
    • wget "tokenizer/tcf/lrs?format=tab&lang=ita&url=my_url" [-O out_file]
    • wget "tokenizer/tcf/lrs?format=kaf&lang=ita&url=my_url" [-O out_file]
    • wget "tokenizer/tcf/lrs?format=tcf&lang=ita&url=my_url" [-O out_file]
    • wget "tokenizer/kaf/lrs?format=tab&lang=ita&url=my_url" [-O out_file]
    • wget "tokenizer/kaf/lrs?format=kaf&lang=ita&url=my_url" [-O out_file]
    • wget "tokenizer/kaf/lrs?format=tcf&lang=ita&url=my_url" [-O out_file]

Example texts

As for plain text you can use

 Mi chiamo Riccardo. Abito a Roma

As for TCF text you can use

<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="http://de.clarin.eu/images/weblicht-tutorials/resources/tcf-04/schemas/latest/d-spin_0_4.rnc" type="application/relax-ng-compact-syntax"?>
    <D-Spin xmlns="http://www.dspin.de/data" version="0.4">
        <md:MetaData xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:cmd="http://www.clarin.eu/cmd/" 
            xmlns:md="http://www.dspin.de/data/metadata" 
            xsi:schemaLocation="http://www.clarin.eu/cmd/ http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/profiles/clarin.eu:cr1:p_1320657629623/xsd">
        </md:MetaData>
            <tc:TextCorpus xmlns:tc="http://www.dspin.de/data/textcorpus" lang="it">
                <tc:text>
                    Mi chiamo Alfredo. Abito a Roma.
                </tc:text>
            </tc:TextCorpus>
    </D-Spin>
            

As for KAF text you can use for ***

<KAF xml:lang="it" version="1.0"><raw>Questo รจ un testo italiano</raw></KAF>
            

or the following for **

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<KAF xml:lang="it" version="1.0">
	<kafHeader>
		<fileDesc />
		<linguisticProcessors layer="text">
			<lp name="it.cnr.ilc.panacea.service.impl.FreelingIt" version="1.0" timestamp="2019-04-12T15:04:32.096Z"/>
		</linguisticProcessors>
	</kafHeader>
	<text>
			<wf wid="w1" sent="1" para="1" offset="0" length="2"><![CDATA[Mi]]></wf>
			<wf wid="w2" sent="1" para="1" offset="3" length="6"><![CDATA[chiamo]]></wf>
			<wf wid="w3" sent="1" para="1" offset="10" length="8"><![CDATA[Riccardo]]></wf>
			<wf wid="w4" sent="1" para="1" offset="18" length="1"><![CDATA[.]]></wf>
			<wf wid="w5" sent="1" para="1" offset="19" length="5"><![CDATA[Abito]]></wf>
			<wf wid="w6" sent="1" para="1" offset="25" length="1"><![CDATA[a]]></wf>
			<wf wid="w8" sent="2" para="1" offset="27" length="4"><![CDATA[Roma]]></wf>
	</text>
</KAF>

            

As URL you may use:

https://raw.githubusercontent.com/clarin-eric/LRS-Hackathon/master/samples/resources/txt/hermes-it.txt
Please note that:

Contacts

In case of problems write an email to The ILC4CLARIN technical staff with all the information needed to solve the issues, included the version number.