You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Modern embedding-based metrics for evaluation of generated text generallyfall into one of two paradigms: discriminative metrics that are trained todirectly predict which outputs are of higher quality according to supervisedhuman annotations, and generative metrics that are trained to evaluate textbased on the probabilities of a generative model. Both have their advantages;discriminative metrics are able to directly optimize for the problem ofdistinguishing between good and bad outputs, while generative metrics can betrained using abundant raw text. In this paper, we present a framework thatcombines the best of both worlds, using both supervised and unsupervisedsignals from whatever data we have available. We operationalize this idea bytraining T5Score, a metric that uses these training signals with mT5 as thebackbone. We perform an extensive empirical comparison with other existingmetrics on 5 datasets, 19 languages and 280 systems, demonstrating the utilityof our method. Experimental results show that: T5Score achieves the bestperformance on all datasets against existing top-scoring metrics at the segmentlevel. We release our code and models at https://github.com/qinyiwei/T5Score.
URL
Affiliations
Abstract
Translation (by gpt-3.5-turbo)
Summary (by gpt-3.5-turbo)
The text was updated successfully, but these errors were encountered: