Current Software version 1.0, released on 15/04/2019. It is available at CNR-ILC github
This page describes how to use the ILC4CLARIN opener tokenizer wrapper. Official information and description can be found here
{ "language":"one_of_the_above_in_either_2_or_3_codes", "iformat":"raw_or_kaf", "oformat":"one_of_the_above"} }The input format is one of the following:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <KAF xml:lang="it" version="1.0"> <raw>this is an English text</raw> </KAF>
POST endpoints are the following:
tokenizer/runservice
(POST service to analyze plain and to produce TCF, TAB, KAF valid documents, according to the input and output format parameters)tokenizer/tcf/runservice
(POST service to analyze TCF documents and to produce TCF, TAB, KAF valid documents, according to the input and output format parameters)tokenizer/runservice***
(POST service to analyze KAF (with raw element) documents and to produce TCF, TAB, KAF valid documents, according to the input and output format parameters)Similarly GET endpoints have been set up for eventual integration into LRS
tokenizer/lrs
(GET service to analyze plain texts and to produce produce TCF, TAB, KAF valid documents, according to the format parameter)tokenizer/kaf/lrs
(GET service to analyze KAF documents and to produce produce TCF, TAB, KAF valid documents, according to the format parameter)tokenizer/tcf/lrs
(GET service to analyze TCF documents and to produce produce TCF, TAB, KAF valid documents, according to the format parameter)input format
, language
and file
parameters must be supplied as parameters,
while output format
can be supplied if a different format as KAF is requested.
CURL:
Some examples: curl -H 'Content-Type: multipart/form-data' -F 'file=@myfile' -F 'form={"language":"it","iformat":"raw","oformat":"tab"}' -X POST tokenizer/runservice
curl -H 'Content-Type: multipart/form-data' -F 'file=@myfile' -F 'form={"language":"it","iformat":"raw","oformat":"tcf"}' -X POST tokenizer/runservice
curl -H 'Content-Type: multipart/form-data' -F 'file=@myfile' -F 'form={"language":"it","iformat":"raw","oformat":"kaf"}' -X POST tokenizer/runservice
curl -H 'Content-Type: multipart/form-data' -F 'file=@myfile.tcf' -F 'form={"language":"it","iformat":"raw","oformat":"tab"}' -X POST tokenizer/tcf/runservice
curl -H 'Content-Type: multipart/form-data' -F 'file=@myfile.tcf' -F 'form={"language":"it","iformat":"raw","oformat":"tcf"}' -X POST tokenizer/tcf/runservice
curl -H 'Content-Type: multipart/form-data' -F 'file=@myfile.tcf' -F 'form={"language":"it","iformat":"raw","oformat":"kaf"}' -X POST tokenizer/tcf/runservice
*** curl -H 'Content-Type: multipart/form-data' -F 'file=@myfile.kaf' -F 'form={"language":"it","iformat":"kaf","oformat":"tab"}' -X POST tokenizer/runservice
*** curl -H 'Content-Type: multipart/form-data' -F 'file=@myfile.kaf' -F 'form={"language":"it","iformat":"kaf","oformat":"tcf"}' -X POST tokenizer/runservice
*** curl -H 'Content-Type: multipart/form-data' -F 'file=@myfile.kaf' -F 'form={"language":"it","iformat":"kaf","oformat":"kaf"}' -X POST tokenizer/runservice
url
parameter indicates the URL where the text to analyze is found.
The language
and the format
must be supplied as parameters:
tokenizer/lrs
to analyze plain text*tokenizer/tcf/lrs
to analyze tcf texttokenizer/kaf/lrs
to analyze kaf text** CURL
Some examples:WGET
Some examples:As for plain text you can use
Mi chiamo Riccardo. Abito a Roma
As for TCF text you can use
<?xml version="1.0" encoding="UTF-8"?> <?xml-model href="http://de.clarin.eu/images/weblicht-tutorials/resources/tcf-04/schemas/latest/d-spin_0_4.rnc" type="application/relax-ng-compact-syntax"?> <D-Spin xmlns="http://www.dspin.de/data" version="0.4"> <md:MetaData xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:cmd="http://www.clarin.eu/cmd/" xmlns:md="http://www.dspin.de/data/metadata" xsi:schemaLocation="http://www.clarin.eu/cmd/ http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/profiles/clarin.eu:cr1:p_1320657629623/xsd"> </md:MetaData> <tc:TextCorpus xmlns:tc="http://www.dspin.de/data/textcorpus" lang="it"> <tc:text> Mi chiamo Alfredo. Abito a Roma. </tc:text> </tc:TextCorpus> </D-Spin>
As for KAF text you can use for ***
<KAF xml:lang="it" version="1.0"><raw>Questo รจ un testo italiano</raw></KAF>
or the following for **
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <KAF xml:lang="it" version="1.0"> <kafHeader> <fileDesc /> <linguisticProcessors layer="text"> <lp name="it.cnr.ilc.panacea.service.impl.FreelingIt" version="1.0" timestamp="2019-04-12T15:04:32.096Z"/> </linguisticProcessors> </kafHeader> <text> <wf wid="w1" sent="1" para="1" offset="0" length="2"><![CDATA[Mi]]></wf> <wf wid="w2" sent="1" para="1" offset="3" length="6"><![CDATA[chiamo]]></wf> <wf wid="w3" sent="1" para="1" offset="10" length="8"><![CDATA[Riccardo]]></wf> <wf wid="w4" sent="1" para="1" offset="18" length="1"><![CDATA[.]]></wf> <wf wid="w5" sent="1" para="1" offset="19" length="5"><![CDATA[Abito]]></wf> <wf wid="w6" sent="1" para="1" offset="25" length="1"><![CDATA[a]]></wf> <wf wid="w8" sent="2" para="1" offset="27" length="4"><![CDATA[Roma]]></wf> </text> </KAF>
As URL you may use:
https://raw.githubusercontent.com/clarin-eric/LRS-Hackathon/master/samples/resources/txt/hermes-it.txt