This is the Java porting of the perl-based tokenizer developed within the opener project and available here
Current Software version 0.2, released on 04/10/2017
ILC4CLARIN provides three sets of distinct web services to perform tokenization on texts for the following languages:
Offered services perform the same operation (tokenization), but, according with the endpoints, valid
TCF,
KAF or tabbed files
can be produced.
The service that produces TCF can read from both a plain text or a valid TCF document. The mimetype is set accordingly.
This page explains how to invoke the offered services.
The endpoints are the following:
wl/tokenizer/plain
(POST service to tokenize plain text and to produce a TCF valid document)wl/tokenizer/lrs
(GET service to tokenize a text retrieved from URL and to produce a TCF valid document)wl/tokenizer/tcf
(POST service to tokenize TCF document and to produce a TCF valid document)kaf/tokenizer/plain
(POST service to tokenize plain texts and to produce a KAF valid document)kaf/tokenizer/lrs
(GET service to tokenize a text retrieved from URL and to produce a KAF valid document)tab/tokenizer/plain
(POST service to tokenize plain texts and to produce a tabbed document)tab/tokenizer/lrs
(GET service to tokenize a text retrieved from URL and to produce a tabbed document)The language is provided as a parameter:
wl/tokenizer/plain?lang=iso_3_or_2_codes_lang
kaf/tokenizer/plain?lang=iso_3_or_2_codes_lang
tab/tokenizer/plain?lang=iso_3_or_2_codes_lang
PLEASE NOTE THIS CALL. For TCF when a TCF document is sent in input, NO LANGUAGE PROVIDED AS PARAMETER
wl/tokenizer/tcf
For Language Resource Switchboard (please note the lrs in the path) we added three additional endpoints
The endpoints are the following:
wl/tokenizer/lrs
(GET service to tokenize a text retrieved from URL and to produce a TCF valid document)kaf/tokenizer/lrs
(GET service to tokenize a text retrieved from URL and to produce a KAF valid document)tab/tokenizer/lrs
(GET service to tokenize a text retrieved from URL and to produce a tabbed document)Both the language and the url are provided as a parameters:
wl/tokenizer/lrs?lang=iso_3_or_2_codes_lang&url=URL
kaf/tokenizer/lrs?lang=iso_3_or_2_codes_lang&url=URL
tab/tokenizer/lrs?lang=iso_3_or_2_codes_lang&url=URL
You can test the service endpoints using curl or wget as follows:
curl -H 'content-type: text/plain' --data-binary @plain-file.txt -X POST wl/tokenizer/plain?lang=ita
curl -H 'content-type: text/tcf+xml' --data-binary @tcf-file.txt -X POST wl/tokenizer/tcf
curl -H 'content-type: text/plain' --data-binary @plain-file.txt -X POST kaf/tokenizer/plain?lang=ita
curl -H 'content-type: text/plain' --data-binary @plain-file.txt -X POST tab/tokenizer/plain?lang=ita
wget --post-file=plain-file.txt --header='Content-Type: text/plain' wl/tokenizer/plain?lang=ita
wget --post-file=tcf-file.txt --header='Content-Type: text/tcf+xml' wl/tokenizer/tcf?lang=ita
wget --post-file=plain-file.txt --header='Content-Type: text/plain' kaf/tokenizer/plain?lang=ita
wget --post-file=plain-file.txt --header='Content-Type: text/plain' tab/tokenizer/plain?lang=ita
curl -X GET "wl/tokenizer/lrs?lang=ita&url=https://raw.githubusercontent.com/clarin-eric/LRS-Hackathon/master/samples/resources/txt/hermes-it.txt"
curl -X GET "kaf/tokenizer/lrs?lang=ita&url=https://raw.githubusercontent.com/clarin-eric/LRS-Hackathon/master/samples/resources/txt/hermes-it.txt"
curl -X GET "tab/tokenizer/lrs?lang=ita&url=https://raw.githubusercontent.com/clarin-eric/LRS-Hackathon/master/samples/resources/txt/hermes-it.txt"
wget "wl/tokenizer/lrs?lang=ita&url=https://raw.githubusercontent.com/clarin-eric/LRS-Hackathon/master/samples/resources/txt/hermes-it.txt" [-O out_file]
wget "kaf/tokenizer/lrs?lang=ita&url=https://raw.githubusercontent.com/clarin-eric/LRS-Hackathon/master/samples/resources/txt/hermes-it.txt" [-O out_file]
wget "tab/tokenizer/lrs?lang=ita&url=https://raw.githubusercontent.com/clarin-eric/LRS-Hackathon/master/samples/resources/txt/hermes-it.txt" [-O out_file]
Please note that services designed for Language Resource Switchboard clearly work by themselves invoking the commands above.
As for plain text you can useMi chiamo Riccardo. Abito a RomaAs for TCF text you can use
<?xml version="1.0" encoding="UTF-8"?> <?xml-model href="http://de.clarin.eu/images/weblicht-tutorials/resources/tcf-04/schemas/latest/d-spin_0_4.rnc" type="application/relax-ng-compact-syntax"?> <D-Spin xmlns="http://www.dspin.de/data" version="0.4"> <md:MetaData xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:cmd="http://www.clarin.eu/cmd/" xmlns:md="http://www.dspin.de/data/metadata" xsi:schemaLocation="http://www.clarin.eu/cmd/ http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/profiles/clarin.eu:cr1:p_1320657629623/xsd"> </md:MetaData> <tc:TextCorpus xmlns:tc="http://www.dspin.de/data/textcorpus" lang="it"> <tc:text> Mi chiamo Alfredo. Abito a Roma. </tc:text> </tc:TextCorpus> </D-Spin>