D3code

This repository contains data resources for the following paper:

GeniL: A Multilingual Dataset on Generalizing Language (under review).

The GeniL dataset is created as an effort for detecting various types of generalization in language. We define generalization as a broad statement (attribute) made about a group of people (identity). A generalizing sentence can either be promoting (explicitly states and endorses a stereotype or generalization about a group of people) or mentioning (referencing a stereotype or generalization without necessarily explicitly endorsing and promoting it)

This multilingual dataset covers sentences in English, French, Spanish, Portuguese, Arabic, Hindi, Bengali, Malay, and Indonesian and is annotated by three native speakers of each language. Each sentence is collected from a public corpora of language (mC4; multilingual Common Crawl) and contains at least one identity group name and an attribute (identity and attribute pairs are selected from the Multilingual SeeGULL dataset.)

Dataset Description

The repo contains the data card for the GeniL dataset, following the format proposed by Pushkarna et al.. The data card includes details of the dataset such as intended usage, field names and meanings, annotator recruitment and payments. Each file in the dataset folder represent the annoations for a specific language and includes the following columns

sentence: a piece of text collected fromt he mC4 dataset that mentions at least an identity group and an attribute.
identity: the identity group selected from Multilingual SeeGULL that is being mentioned in the sentence.
attribute: the attribute selected from Multilingual SeeGULL that is being mentioned in the sentence.
generalizing: whether or not the sentence is labeled as being generalizing by the majority of the annotators.
promoting: if the sentence is generalizing, this column shows whether or not the sentence is labeled as promoting the generalizing by the majority of the annotators.

Citation

@misc{davani2024genil,
      title={GeniL: A Multilingual Dataset on Generalizing Language},
      author={Aida Mostafazadeh Davani and Sagar Gubbi and Sunipa Dev and Shachi Dave and Vinodkumar Prabhakaran},
      year={2024},
      eprint={2404.05866},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
dataset		dataset
GeniL Data Card.pdf		GeniL Data Card.pdf
LICENSE.txt		LICENSE.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

D3code

Dataset Description

Citation

About

Releases

Packages

License

google-research-datasets/GeniL

Folders and files

Latest commit

History

Repository files navigation

D3code

Dataset Description

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages