Logo do repositório
 
A carregar...
Foto do perfil

Resultados da pesquisa

A mostrar 1 - 10 de 11
  • ASAPP: alinhamento semântico automático de palavras aplicado ao português
    Publication . Alves, Ana Oliveira; Rodrigues, Ricardo
    Apresentamos duas abordagens distintas `a tarefa de avalia¸c˜ao conjunta ASSIN onde, dada uma cole¸c˜ao de pares de frases escritas em portuguˆes, s˜ao colocados dois objectivos para cada par: (a) calcular a similaridade semˆantica entre as duas frases; e (b) verificar se uma frase do par ´e par´afrase ou inferˆencia da outra. Uma primeira abordagem, apelidada de Reciclagem, baseia-se exclusivamente em heur´ısticas sobre redes semˆanticas para a l´ıngua portuguesa. A segunda abordagem, apelidada de ASAPP, baseia-se em aprendizagem autom´atica supervisionada. Acima de tudo, os resultados da abordagem Reciclagem permitem comparar, de forma indireta, um conjunto de redes semˆanticas, atrav´es do seu desempenho nesta tarefa. Estes resultados, algo modestos, foram depois utilizados como caracter´ısticas da abordagem ASAPP, juntamente com caracter´ısticas adicionais, ao n´ıvel lexical e sint´atico. Ap´os compara¸c˜ao com os resultados da cole¸c˜ao dourada, verifica-se que a abordagem ASAPP supera a abordagem Reciclagem de forma consistente. Isto ocorre tanto para o Portuguˆes Europeu como para o Portuguˆes Brasileiro, onde o desempenho atinge uma exatid˜ao de 80.28%±0.019 para a inferˆencia textual, enquanto que a correla¸c˜ao dos valores atribu´ıdos para a similaridade semˆantica com aqueles atribu´ıdos por humanos ´e de 66.5% ± 0.021.
  • Rapport : a fact-based question answering system for portuguese
    Publication . Rodrigues, Ricardo; Gomes, Paulo Jorge de Sousa; Machado, Fernando Jorge Penousal Martins
    Question answering is one of the longest-standing problems in natural language processing. Although natural language interfaces for computer systems can be considered more common these days, the same still does not happen regarding access to specific textual information. Any full text search engine can easily retrieve documents containing user specified or closely related terms, however it is typically unable to answer user questions with small passages or short answers. The problem with question answering is that text is hard to process, due to its syntactic structure and, to a higher degree, to its semantic contents. At the sentence level, although the syntactic aspects of natural language have well known rules, the size and complexity of a sentence may make it difficult to analyze its structure. Furthermore, semantic aspects are still arduous to address, with text ambiguity being one of the hardest tasks to handle. There is also the need to correctly process the question in order to define its target, and then select and process the answers found in a text. Additionally, the selected text that may yield the answer to a given question must be further processed in order to present just a passage instead of the full text. These issues take also longer to address in languages other than English, as is the case of Portuguese, that have a lot less people working on them. This work focuses on question answering for Portuguese. In other words, our field of interest is in the presentation of short answers, passages, and possibly full sentences, but not whole documents, to questions formulated using natural language. For that purpose, we have developed a system, RAPPORT, built upon the use of open information extraction techniques for extracting triples, so called facts, characterizing information on text files, and then storing and using them for answering user queries done in natural language. These facts, in the form of subject, predicate and object, alongside other metadata, constitute the basis of the answers presented by the system. Facts work both by storing short and direct information found in a text, typically entity related information, and by containing in themselves the answers to the questions already in the form of small passages. As for the results, although there is margin for improvement, they are a tangible proof of the adequacy of our approach and its different modules for storing information and retrieving answers in question answering systems. In the process, in addition to contributing with a new approach to question answering for Portuguese, and validating the application of open information extraction to question answering, we have developed a set of tools that has been used in other natural language processing related works, such as is the case of a lemmatizer, LEMPORT, which was built from scratch, and has a high accuracy. Many of these tools result from the improvement of those found in the Apache OpenNLP toolkit, by pre-processing their input, post-processing their output, or both, and by training models for use in those tools or other, such as MaltParser. Other tools include the creation of interfaces for other resources containing, for example, synonyms, hypernyms, hyponyms, or the creation of lists of, for instance, relations between verbs and agents, using rules.
  • NLPyPort: Named Entity Recognition with CRF and Rule-Based Relation Extraction
    Publication . Ferreira, João; Oliveira, Hugo Gonçalo; Rodrigues, Ricardo
    This paper describes the application of the NLPyPort pipeline to Named Entity Recognition (NER) and Relation Extraction in Portuguese, more precisely in the scope of the IberLEF-2019 evaluation task on the topic. NER was tackled with CRF, based on several features, and trained in the HAREM collection, but results were low. This was partly caused by an issue on the submitted model, which had been trained in lowercase text, but, apparently, also due to the training data used, which highlights the different natures of HAREM, the source of the majority of the testing corpus, and SIGARRA. Relations were extracted with a set of rules bootstrapped from the examples provided by the organisation. Despite an F1-score of 0.72, we were the only participants in this task. We also express our doubts concerning the utility of the extracted relations.
  • Using Lucene for Developing a Question-Answering Agent in Portuguese
    Publication . Oliveira, Hugo Gonçalo; Filipe, Ricardo; Rodrigues, Ricardo; Alve, Ana
    Given the limitations of available platforms for creating conversational agents, and that a questionanswering agent suffices in many scenarios, we take advantage of the Information Retrieval library Lucene for developing such an agent for Portuguese. The solution described answers natural language questions based on an indexed list of FAQs. Its adaptation to different domains is a matter of changing the underlying list. Different configurations of this solution, mostly on the language analysis level, resulted in different search strategies, which were tested for answering questions about the economic activity in Portugal. In addition to comparing the different search strategies, we concluded that, towards better answers, it is fruitful to combine the results of different strategies with a voting method.
  • AIA-BDE: A Corpus of FAQs in Portuguese and their Variations
    Publication . Oliveira, Hugo Gonçalo; Ferreira, João; Santos, José; Fialho, Pedro; Rodrigues, Ricardo; Coheur, Luísa; Alves, Ana
    We present AIA-BDE, a corpus of 380 domain-oriented FAQs in Portuguese and their variations, i.e., paraphrases or entailed questions, created manually, by humans, or automatically, with Google Translate. Its aims to be used as a benchmark for FAQ retrieval and automatic question-answering, but may be useful in other contexts, such as the development of task-oriented dialogue systems, or models for natural language inference in an interrogative context. We also report on two experiments. Matching variations with their original questions was not trivial with a set of unsupervised baselines, especially for manually created variations. Besides high performances obtained with ELMo and BERT embeddings, an Information Retrieval system was surprisingly competitive when considering only the first hit. In the second experiment, text classifiers were trained with the original questions, and tested when assigning each variation to one of three possible sources, or assigning them as out-of-domain. Here, the difference between manual and automatic variations was not so significant.
  • ASAPP 2.0: Advancing the state-of-the-art of semantic textual similarity for Portuguese
    Publication . Alves, Ana; Oliveira, Hugo Gonçalo; Rodrigues, Ricardo; Encarnação, Rui
    Semantic Textual Similarity (STS) aims at computing the proximity of meaning transmitted by two sentences. In 2016, the ASSIN shared task targeted STS in Portuguese and released training and test collections. This paper describes the development of ASAPP, a system that participated in ASSIN, but has been improved since then, and now achieves the best results in this task. ASAPP learns a STS function from a broad range of lexical, syntactic, semantic and distributional features. This paper describes the features used in the current version of ASAPP, and how they are exploited in a regression algorithm to achieve the best published results for ASSIN to date, in both European and Brazilian Portuguese.
  • LemPORT: a High-Accuracy Cross-Platform Lemmatizer for Portuguese
    Publication . Rodrigues, Ricardo; Oliveira, Hugo Gonçalo; Gomes, Paulo
    Although lemmatization is a very common subtask in many natural language processing tasks, there is a lack of available true cross-platform lemmatization tools specifically targeted for Portuguese, namely for integration in projects developed in Java. To address this issue, we have developed a lemmatizer, initially just for our own use, but which we have decided to make publicly available. The lemmatizer, presented in this document, yields an overall accuracy over 98% when compared against a manually revised corpus.
  • NLPPort: A Pipeline for Portuguese NLP
    Publication . Rodrigues, Ricardo; Oliveira, Hugo Gonçalo; Gomes, Paulo
    Although there are tools for some the most common natural language processing tasks in Portuguese, there is a lack of available cross-platform tools specifically targeted for Portuguese, from end to end, namely for integration in projects developed in Java. To address this issue, we have developed and tweaked, over the last half-dozen years, NLPPort, a set of tools that can be used in a pipelined fashion, which we have made publicly available. In this paper, we present the major features of such set of tools.
  • Assessing Factoid Question-Answer Generation for Portuguese
    Publication . Ferreira, João; Rodrigues, Ricardo; Oliveira, Hugo Gonçalo
    We present work on the automatic generation of question-answer pairs in Portuguese, useful, for instance, for populating the knowledge-base of question-answering systems. This includes: (i) a new corpus of close to 600 factoid sentences, manually created from an existing corpus of questions and answers, used as our benchmark; (ii) two approaches for the automatic generation of question-answer pairs, which can be seen as baselines; (iii) results of those approaches in the corpus.