ILC4CLARIN Tokenizer service

Offered service performs the same operation (a simple tokenization), but, according with the input parameters process files in different formats and returns the result in different formats as well.
According to the input data:

{
                    "language":"one_of_the_above_in_either_2_or_3_codes",
                    "iformat":"raw_or_kaf", 
                    "oformat":"one_of_the_above"}
                }

The input format is one of the following:

iformat=raw. Use this format if you provide plain texts as "My name is Riccardo. I live in Pisa."

iformat=kaf. Use this format if you provide kaffed texts as:

                        <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
                        <KAF xml:lang="it" version="1.0">
                            <raw>this is an English text</raw>
                        </KAF>

The output format is one of the following

TCF
a tabbed files (TAB)
or a KAF. This format is automatically selected if the corresponding parameter is not sent.

POST endpoints are the following:

tokenizer/runservice (POST service to analyze plain and to produce TCF, TAB, KAF valid documents, according to the input and output format parameters)
tokenizer/tcf/runservice (POST service to analyze TCF documents and to produce TCF, TAB, KAF valid documents, according to the input and output format parameters)
tokenizer/runservice*** (POST service to analyze KAF (with raw element) documents and to produce TCF, TAB, KAF valid documents, according to the input and output format parameters)

Similarly GET endpoints have been set up for eventual integration into LRS

tokenizer/lrs (GET service to analyze plain texts and to produce produce TCF, TAB, KAF valid documents, according to the format parameter)
tokenizer/kaf/lrs (GET service to analyze KAF documents and to produce produce TCF, TAB, KAF valid documents, according to the format parameter)
tokenizer/tcf/lrs (GET service to analyze TCF documents and to produce produce TCF, TAB, KAF valid documents, according to the format parameter)

Examples

In each example:
Calls in the (1) manage plain texts to produce a valid TAB, TCF or KAF document;
Calls in the (2) manage TCF document to produce a valid TAB, TCF or KAF document;
Calls in the (3) manage KAF document to produce a valid TAB, TCF or KAF document;

POST

input format, language and file parameters must be supplied as parameters, while output format can be supplied if a different format as KAF is requested.

file is where your data are
language one in en[g], es[p], fr[a], it[a], de[u], nl[d]
input format one in raw, kaf
out format one in tab, tcf, kaf. If provided.

CURL:

Some examples:

- curl -H 'Content-Type: multipart/form-data' -F 'file=@myfile' -F 'form={"language":"it","iformat":"raw","oformat":"tab"}' -X POST tokenizer/runservice
- curl -H 'Content-Type: multipart/form-data' -F 'file=@myfile' -F 'form={"language":"it","iformat":"raw","oformat":"tcf"}' -X POST tokenizer/runservice
- curl -H 'Content-Type: multipart/form-data' -F 'file=@myfile' -F 'form={"language":"it","iformat":"raw","oformat":"kaf"}' -X POST tokenizer/runservice
- curl -H 'Content-Type: multipart/form-data' -F 'file=@myfile.tcf' -F 'form={"language":"it","iformat":"raw","oformat":"tab"}' -X POST tokenizer/tcf/runservice
- curl -H 'Content-Type: multipart/form-data' -F 'file=@myfile.tcf' -F 'form={"language":"it","iformat":"raw","oformat":"tcf"}' -X POST tokenizer/tcf/runservice
- curl -H 'Content-Type: multipart/form-data' -F 'file=@myfile.tcf' -F 'form={"language":"it","iformat":"raw","oformat":"kaf"}' -X POST tokenizer/tcf/runservice
- *** curl -H 'Content-Type: multipart/form-data' -F 'file=@myfile.kaf' -F 'form={"language":"it","iformat":"kaf","oformat":"tab"}' -X POST tokenizer/runservice
- *** curl -H 'Content-Type: multipart/form-data' -F 'file=@myfile.kaf' -F 'form={"language":"it","iformat":"kaf","oformat":"tcf"}' -X POST tokenizer/runservice
- *** curl -H 'Content-Type: multipart/form-data' -F 'file=@myfile.kaf' -F 'form={"language":"it","iformat":"kaf","oformat":"kaf"}' -X POST tokenizer/runservice

GET

GET endpoints are designed mainly for LRS purposes and can be executed in both curl and wget.
The endpoint name convention is the following:

language one in en[g], es[p], fr[a], it[a], de[u], nl[d]
format one in tab, tcf, kaf
url the url

In both cases, the url parameter indicates the URL where the text to analyze is found. The language and the format must be supplied as parameters:

tokenizer/lrs to analyze plain text*
tokenizer/tcf/lrs to analyze tcf text
tokenizer/kaf/lrs to analyze kaf text**

CURL

Some examples:

- curl -X GET "tokenizer/lrs?format=tab&lang=ita&url=my_url"
- curl -X GET "tokenizer/lrs?format=kaf&lang=ita&url=my_url"
- curl -X GET "tokenizer/lrs?format=tcf&lang=ita&url=my_url"
- curl -X GET "tokenizer/tcf/lrs?format=tab&lang=ita&url=my_url"
- curl -X GET "tokenizer/tcf/lrs?format=kaf&lang=ita&url=my_url"
- curl -X GET "tokenizer/tcf/lrs?format=tcf&lang=ita&url=my_url"
- curl -X GET "tokenizer/kaf/lrs?format=tab&lang=ita&url=my_url"
- curl -X GET "tokenizer/kaf/lrs?format=kaf&lang=ita&url=my_url"
- curl -X GET "tokenizer/kaf/lrs?format=tcf&lang=ita&url=my_url"

WGET

Some examples:

- wget "tokenizer/lrs?format=tab&lang=ita&url=my_url" [-O out_file]
- wget "tokenizer/lrs?format=kaf&lang=ita&url=my_url" [-O out_file]
- wget "tokenizer/lrs?format=tcf&lang=ita&url=my_url" [-O out_file]
- wget "tokenizer/tcf/lrs?format=tab&lang=ita&url=my_url" [-O out_file]
- wget "tokenizer/tcf/lrs?format=kaf&lang=ita&url=my_url" [-O out_file]
- wget "tokenizer/tcf/lrs?format=tcf&lang=ita&url=my_url" [-O out_file]
- wget "tokenizer/kaf/lrs?format=tab&lang=ita&url=my_url" [-O out_file]
- wget "tokenizer/kaf/lrs?format=kaf&lang=ita&url=my_url" [-O out_file]
- wget "tokenizer/kaf/lrs?format=tcf&lang=ita&url=my_url" [-O out_file]

Example texts

As for plain text you can use

 Mi chiamo Riccardo. Abito a Roma

As for TCF text you can use

<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="http://de.clarin.eu/images/weblicht-tutorials/resources/tcf-04/schemas/latest/d-spin_0_4.rnc" type="application/relax-ng-compact-syntax"?>
    <D-Spin xmlns="http://www.dspin.de/data" version="0.4">
        <md:MetaData xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:cmd="http://www.clarin.eu/cmd/" 
            xmlns:md="http://www.dspin.de/data/metadata" 
            xsi:schemaLocation="http://www.clarin.eu/cmd/ http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/profiles/clarin.eu:cr1:p_1320657629623/xsd">
        </md:MetaData>
            <tc:TextCorpus xmlns:tc="http://www.dspin.de/data/textcorpus" lang="it">
                <tc:text>
                    Mi chiamo Alfredo. Abito a Roma.
                </tc:text>
            </tc:TextCorpus>
    </D-Spin>

As for KAF text you can use for ***

<KAF xml:lang="it" version="1.0"><raw>Questo è un testo italiano</raw></KAF>

or the following for **

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<KAF xml:lang="it" version="1.0">
	<kafHeader>
		<fileDesc />
		<linguisticProcessors layer="text">
			<lp name="it.cnr.ilc.panacea.service.impl.FreelingIt" version="1.0" timestamp="2019-04-12T15:04:32.096Z"/>
		</linguisticProcessors>
	</kafHeader>
	<text>
			<wf wid="w1" sent="1" para="1" offset="0" length="2"><![CDATA[Mi]]></wf>
			<wf wid="w2" sent="1" para="1" offset="3" length="6"><![CDATA[chiamo]]></wf>
			<wf wid="w3" sent="1" para="1" offset="10" length="8"><![CDATA[Riccardo]]></wf>
			<wf wid="w4" sent="1" para="1" offset="18" length="1"><![CDATA[.]]></wf>
			<wf wid="w5" sent="1" para="1" offset="19" length="5"><![CDATA[Abito]]></wf>
			<wf wid="w6" sent="1" para="1" offset="25" length="1"><![CDATA[a]]></wf>
			<wf wid="w8" sent="2" para="1" offset="27" length="4"><![CDATA[Roma]]></wf>
	</text>
</KAF>

As URL you may use:

https://raw.githubusercontent.com/clarin-eric/LRS-Hackathon/master/samples/resources/txt/hermes-it.txt

ILC4CLARIN Opener tokenizer wrapper

Examples

POST

GET

Example texts

Contacts