{"id":1061,"date":"2018-09-14T12:35:00","date_gmt":"2018-09-14T10:35:00","guid":{"rendered":"https:\/\/ilc4clarin.ilc.cnr.it\/en\/?p=1061"},"modified":"2024-03-20T20:24:09","modified_gmt":"2024-03-20T19:24:09","slug":"ltfw","status":"publish","type":"post","link":"https:\/\/ilc4clarin.ilc.cnr.it\/en\/ltfw\/","title":{"rendered":"LTFW (Linguistic Tools For Weblicht)"},"content":{"rendered":"<p style=\"text-align: right;\"><a href=\"https:\/\/ilc4clarin.ilc.cnr.it\/en\/ltfw\/\" title=\"EN\" class=\"current_language\"><img decoding=\"async\" src=\"https:\/\/ilc4clarin.ilc.cnr.it\/en\/wp-content\/plugins\/multisite-language-switcher\/flags\/gb.png\" alt=\"en_GB\"\/><\/a><a href=\"https:\/\/ilc4clarin.ilc.cnr.it\/\" title=\"IT\"><img decoding=\"async\" src=\"https:\/\/ilc4clarin.ilc.cnr.it\/en\/wp-content\/plugins\/multisite-language-switcher\/flags\/it.png\" alt=\"it_IT\"\/><\/a><\/p>\n<p>Linguistic services.<\/p>\n<p>This is the Java porting of the perl-based tokenizer developed within the <a href=\"https:\/\/github.com\/opener-project\/tokenizer-base\">OpeNER<\/a> project.<\/p>\n<p>Our Java porting is available on <a href=\"https:\/\/github.com\/cnr-ilc\/linguistic-tools-for-weblicht\">GitHub<\/a> and <a href=\"https:\/\/hub.docker.com\/r\/cnrilc\/ltfw\">Docker<\/a>.<\/p>\n<p>Software current version: 0.2 (released on 04\/10\/2017).<\/p>\n<p>ILC4CLARIN provides three sets of distinct web services to perform tokenization on texts for the following languages:<\/p>\n<ul>\n<li>ita (or it)<\/li>\n<li>fra (or fr)<\/li>\n<li>deu (or deu)<\/li>\n<li>eng (or en)<\/li>\n<li>esp (or es)<\/li>\n<li>nld (or nl)<\/li>\n<\/ul>\n<p>The application arises an Unsupported Language Exception if the language provided is not in the list.<\/p>\n<p>The services offered perform the same operation (tokenization) but, according with the endpoints, valid <a href=\"https:\/\/weblicht.sfs.uni-tuebingen.de\/weblichtwiki\/index.php\/The_TCF_Format\">TCF<\/a>, <a href=\"https:\/\/github.com\/opener-project\/kaf\/wiki\/KAF-structure-overview\">KAF<\/a> or tabbed files can be produced.<\/p>\n<p>The service that produces TCF can read from both a plain text or a valid TCF document. The mime type is set accordingly.<\/p>\n<h3>How to invoke the offered services<\/h3>\n<p>The endpoints are the following:<\/p>\n<ul>\n<li>wl\/tokenizer\/plain (POST service to tokenize plain text and to produce a valid TCF document)<\/li>\n<li>wl\/tokenizer\/lrs (GET service to tokenize a text retrieved from URL and to produce a valid TCF document)<\/li>\n<li>wl\/tokenizer\/tcf (POST service to tokenize a TCF document and to produce a valid TCF document)<\/li>\n<li>kaf\/tokenizer\/plain (POST service to tokenize plain texts and to produce a valid KAF document)<\/li>\n<li>kaf\/tokenizer\/lrs (GET service to tokenize a text retrieved from URL and to produce a valid KAF document)<\/li>\n<li>tab\/tokenizer\/plain (POST service to tokenize plain texts and to produce a tabbed document)<\/li>\n<li>tab\/tokenizer\/lrs (GET service to tokenize a text retrieved from URL and to produce a tabbed document)<\/li>\n<\/ul>\n<p>The language is provided as a parameter:<\/p>\n<ul>\n<li>wl\/tokenizer\/plain?lang=iso_3_or_2_codes_lang<\/li>\n<li>kaf\/tokenizer\/plain?lang=iso_3_or_2_codes_lang<\/li>\n<li>tab\/tokenizer\/plain?lang=iso_3_or_2_codes_lang<\/li>\n<\/ul>\n<p>PLEASE NOTE THIS CALL. For TCF when a TCF document is sent in input, NO LANGUAGE PROVIDED AS PARAMETER.<\/p>\n<ul>\n<li>wl\/tokenizer\/tcf<\/li>\n<\/ul>\n<p>For the Language Resource Switchboard (please note the lrs in the path) we added three additional endpoints.<\/p>\n<p>The endpoints are the following:<\/p>\n<ul>\n<li>wl\/tokenizer\/lrs (GET service to tokenize a text retrieved from URL and to produce a valid TCF document)<\/li>\n<li>kaf\/tokenizer\/lrs (GET service to tokenize a text retrieved from URL and to produce a valid KAF document)<\/li>\n<li>tab\/tokenizer\/lrs (GET service to tokenize a text retrieved from URL and to produce a tabbed document)<\/li>\n<\/ul>\n<p>Both the language and the url are provided as parameters:<\/p>\n<ul>\n<li>wl\/tokenizer\/lrs?lang=iso_3_or_2_codes_lang&amp;url=URL<\/li>\n<li>kaf\/tokenizer\/lrs?lang=iso_3_or_2_codes_lang&amp;url=URL<\/li>\n<li>tab\/tokenizer\/lrs?lang=iso_3_or_2_codes_lang&amp;url=URL<\/li>\n<\/ul>\n<p>This because the integration of services in the Language Resource Switchboard requires the URL passed as an input parameter.<\/p>\n<h3>How to test the services<\/h3>\n<p>You can test the service endpoints using curl or wget as follows:<\/p>\n<p>Send the input file to the endpoints for processing:<\/p>\n<ul>\n<li>with curl:<\/li>\n<\/ul>\n<p style=\"padding-left: 30px; text-align: left;\">curl -H &#8216;content-type: text\/plain&#8217; &#8211;data-binary @plain-file.txt -X POST http:\/\/ilc4clarin.ilc.cnr.it\/services\/ltfw\/wl\/tokenizer\/plain?lang=ita<\/p>\n<p style=\"padding-left: 30px; text-align: left;\">curl -H &#8216;content-type: text\/tcf+xml&#8217; &#8211;data-binary @tcf-file.txt -X POST http:\/\/ilc4clarin.ilc.cnr.it\/services\/ltfw\/wl\/tokenizer\/tcf<\/p>\n<p style=\"padding-left: 30px; text-align: left;\">curl -H &#8216;content-type: text\/plain&#8217; &#8211;data-binary @plain-file.txt -X POST http:\/\/ilc4clarin.ilc.cnr.it\/services\/ltfw\/kaf\/tokenizer\/plain?lang=ita<\/p>\n<p style=\"padding-left: 30px; text-align: left;\">curl -H &#8216;content-type: text\/plain&#8217; &#8211;data-binary @plain-file.txt -X POST http:\/\/ilc4clarin.ilc.cnr.it\/services\/ltfw\/tab\/tokenizer\/plain?lang=ita<\/p>\n<ul>\n<li>with wget:<\/li>\n<\/ul>\n<p style=\"padding-left: 30px; text-align: left;\">wget &#8211;post-file=plain-file.txt &#8211;header=&#8217;Content-Type: text\/plain&#8217; http:\/\/ilc4clarin.ilc.cnr.it\/services\/ltfw\/wl\/tokenizer\/plain?lang=ita<\/p>\n<p style=\"padding-left: 30px; text-align: left;\">wget &#8211;post-file=tcf-file.txt &#8211;header=&#8217;Content-Type: text\/tcf+xml&#8217; http:\/\/ilc4clarin.ilc.cnr.it\/services\/ltfw\/wl\/tokenizer\/tcf?lang=ita<\/p>\n<p style=\"padding-left: 30px; text-align: left;\">wget &#8211;post-file=plain-file.txt &#8211;header=&#8217;Content-Type: text\/plain&#8217; http:\/\/ilc4clarin.ilc.cnr.it\/services\/ltfw\/kaf\/tokenizer\/plain?lang=ita<\/p>\n<p style=\"padding-left: 30px; text-align: left;\">wget &#8211;post-file=plain-file.txt &#8211;header=&#8217;Content-Type: text\/plain&#8217; http:\/\/ilc4clarin.ilc.cnr.it\/services\/ltfw\/tab\/tokenizer\/plain?lang=ita<\/p>\n<p>To test the services for the Language Resource Switchboard:<\/p>\n<ul>\n<li>with curl:<\/li>\n<\/ul>\n<p style=\"padding-left: 30px; text-align: left;\">curl -X GET &#8220;http:\/\/ilc4clarin.ilc.cnr.it\/services\/ltfw\/wl\/tokenizer\/lrs?lang=ita&amp;url=https:\/\/raw.githubusercontent.com\/clarin-eric\/LRS-Hackathon\/master\/samples\/resources\/txt\/hermes-it.txt&#8221;<\/p>\n<p style=\"padding-left: 30px; text-align: left;\">curl -X GET &#8220;http:\/\/ilc4clarin.ilc.cnr.it\/services\/ltfw\/kaf\/tokenizer\/lrs?lang=ita&amp;url=https:\/\/raw.githubusercontent.com\/clarin-eric\/LRS-Hackathon\/master\/samples\/resources\/txt\/hermes-it.txt&#8221;<\/p>\n<p style=\"padding-left: 30px; text-align: left;\">curl -X GET &#8220;http:\/\/ilc4clarin.ilc.cnr.it\/services\/ltfw\/tab\/tokenizer\/lrs?lang=ita&amp;url=https:\/\/raw.githubusercontent.com\/clarin-eric\/LRS-Hackathon\/master\/samples\/resources\/txt\/hermes-it.txt&#8221;<\/p>\n<ul>\n<li>with wget:<\/li>\n<\/ul>\n<p style=\"padding-left: 30px; text-align: left;\">wget &#8220;http:\/\/ilc4clarin.ilc.cnr.it\/services\/ltfw\/wl\/tokenizer\/lrs?lang=ita&amp;url=https:\/\/raw.githubusercontent.com\/clarin-eric\/LRS-Hackathon\/master\/samples\/resources\/txt\/hermes-it.txt&#8221; [-O out_file]<\/p>\n<p style=\"padding-left: 30px; text-align: left;\">wget &#8220;http:\/\/ilc4clarin.ilc.cnr.it\/services\/ltfw\/kaf\/tokenizer\/lrs?lang=ita&amp;url=https:\/\/raw.githubusercontent.com\/clarin-eric\/LRS-Hackathon\/master\/samples\/resources\/txt\/hermes-it.txt&#8221; [-O out_file]<\/p>\n<p style=\"padding-left: 30px; text-align: left;\">wget &#8220;http:\/\/ilc4clarin.ilc.cnr.it\/services\/ltfw\/tab\/tokenizer\/lrs?lang=ita&amp;url=https:\/\/raw.githubusercontent.com\/clarin-eric\/LRS-Hackathon\/master\/samples\/resources\/txt\/hermes-it.txt&#8221; [-O out_file]<\/p>\n<p>Please note that the services designed for the Language Resource Switchboard clearly work by themselves invoking the commands above.<\/p>\n<p>As for plain text you can use:<\/p>\n<pre style=\"font-size: 10px;\">Mi chiamo Riccardo. Abito a Roma<\/pre>\n<p>As for TCF text you can use:<\/p>\n<pre style=\"font-size: 10px;\">&lt;?xml version=\"1.0\" encoding=\"UTF-8\"?&gt;\n&lt;?xml-model href=\"http:\/\/de.clarin.eu\/images\/weblicht-tutorials\/resources\/tcf-04\/schemas\/latest\/d-spin_0_4.rnc\" type=\"application\/relax-ng-compact-syntax\"?&gt;\n    &lt;D-Spin xmlns=\"http:\/\/www.dspin.de\/data\" version=\"0.4\"&gt;\n        &lt;md:MetaData xmlns:xsi=\"http:\/\/www.w3.org\/2001\/XMLSchema-instance\" xmlns:cmd=\"http:\/\/www.clarin.eu\/cmd\/\" \n            xmlns:md=\"http:\/\/www.dspin.de\/data\/metadata\" \n            xsi:schemaLocation=\"http:\/\/www.clarin.eu\/cmd\/ http:\/\/catalog.clarin.eu\/ds\/ComponentRegistry\/rest\/registry\/profiles\/clarin.eu:cr1:p_1320657629623\/xsd\"&gt;\n        &lt;\/md:MetaData&gt;\n            &lt;tc:TextCorpus xmlns:tc=\"http:\/\/www.dspin.de\/data\/textcorpus\" lang=\"it\"&gt;\n                &lt;tc:text&gt;\n                    Mi chiamo Alfredo. Abito a Roma.\n                &lt;\/tc:text&gt;\n            &lt;\/tc:TextCorpus&gt;\n    &lt;\/D-Spin&gt;\n<\/pre>\n<p><code><br \/>\n<\/code><\/p>\n<h3>Contacts<\/h3>\n<p>In case of problems, please write an e-mail to the <a href=\"mailto:ILC-Clarin-tech-staff@ilc.cnr.it\">ILC4CLARIN Technical Staff<\/a> with all the information needed to solve the issues, including the version number.<\/p>\n\n\n<div style=\"height:36px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n","protected":false},"excerpt":{"rendered":"<p>Linguistic services. This is the Java porting of the perl-based tokenizer developed within the OpeNER project. Our Java porting is available on GitHub and Docker. Software current version: 0.2 (released on 04\/10\/2017). ILC4CLARIN provides three sets of distinct web services to perform tokenization on texts for the following languages: ita (or it) fra (or fr) &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/ilc4clarin.ilc.cnr.it\/en\/ltfw\/\" class=\"more-link\">Read more<span class=\"screen-reader-text\"> &#8220;LTFW (Linguistic Tools For Weblicht)&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[12],"tags":[],"class_list":["post-1061","post","type-post","status-publish","format-standard","hentry","category-services"],"_links":{"self":[{"href":"https:\/\/ilc4clarin.ilc.cnr.it\/en\/wp-json\/wp\/v2\/posts\/1061","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ilc4clarin.ilc.cnr.it\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ilc4clarin.ilc.cnr.it\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ilc4clarin.ilc.cnr.it\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/ilc4clarin.ilc.cnr.it\/en\/wp-json\/wp\/v2\/comments?post=1061"}],"version-history":[{"count":4,"href":"https:\/\/ilc4clarin.ilc.cnr.it\/en\/wp-json\/wp\/v2\/posts\/1061\/revisions"}],"predecessor-version":[{"id":1086,"href":"https:\/\/ilc4clarin.ilc.cnr.it\/en\/wp-json\/wp\/v2\/posts\/1061\/revisions\/1086"}],"wp:attachment":[{"href":"https:\/\/ilc4clarin.ilc.cnr.it\/en\/wp-json\/wp\/v2\/media?parent=1061"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ilc4clarin.ilc.cnr.it\/en\/wp-json\/wp\/v2\/categories?post=1061"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ilc4clarin.ilc.cnr.it\/en\/wp-json\/wp\/v2\/tags?post=1061"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}