Multiword Extractor – ILC4CLARIN

Language-independent multiword extractor.

This is a PANACEA service. It extracts all possible candidate multiwords from POS tagged text in conll format starting from a pair of POS (of the first and last words of the pattern) in a given window size. The user must know the tagset used in the data in order to properly set the parameters.

Input: a conll-07 POS tagged text file (dependency analysis is not required, but dependency annotated text are accepted)

Output options:
TSV: tabular text format
XML: LMF-XML lexicon data

Optional parameters:
apos = POS of the first word of the search space
bpos = POS of the last word of the search space
domain = Label for the thematic or technical domain of the corpus (for instance: LABOUR, NEWS etc.)
filtering_type = type of filtering for the full multiword (First, Overmean, Sigma)
max_entry_num = the number of total candidate to be shown in the results/ inserted in the output lexicon (by default the service prints all possible candidate multiwords that pass the filter thresholds)
order_by: set the orderin which the candidates ar displayed according to raw frequency (frequency), relative frequency (frelative), loglikelyhood (ll), pointwise mutual information (mi)
output_type: tsv or lmf
prefiltering_type: this is a filter based on the statistics on the word pairs, i.e before the actual full MW expressions are extracted. Possible options: average frequency (averagef), maximum frequency (maxf)
property_file: the user may set all these parameters in a single text file to be passed to the service
window: a digit indicating the size of the window for the search space; i.e. the maximum size in terms of words for the candidate expressions to be extracted (for instance: 3)

Please note: the service potentially works on windows of size n; however, it has been tested with a max value of 5.

The tool functionalities and filtering methods are detailed here.
The code is available here.

URL: SCF Extractor (lang indip) (WSDL)

Loading… please wait.