PAIR: Perspective-Aligned Information Retrieval

Paper | Setup | Quick Example | Datasets | System Audits | Validations and Additional Experiments

This repository contains data and code for the paper Measuring and Addressing Indexical Bias in Information Retrieval. For more information, please reach out to the authors:

_{Caleb Ziems}

_{William Held}

_{Jane Dwivedi-Yu}

_{Diyi Yang}

What is PAIR?

🧑‍🤝‍🧑 PAIR is designed to help you identify and mitigate indexical biases in your IR systems. 🧑‍🤝‍🧑 PAIR includes a set of evaluation metrics, data resources, and human subjects study interfaces that help you measure and experimentally understand the Search Engine Manipulation Effect.

Setup

From Source

$ git clone https://github.com/SALT-NLP/pair.git
$ cd pair
$ conda create -n pair python=3.9.16
$ conda activate pair
$ pip install -r requirements.txt

Quick Example

You can run this example in the Demo.ipynb jupyter notebook.

from src.metrics.duo import Duo, get_relevant_corpus, get_relevant_corpus_retrieved, get_relevant_ranking
from src.utils import load_wiki_balance
from beir.retrieval.search.dense import DenseRetrievalExactSearch as DRES
from beir.retrieval.models import SentenceBERT
from beir.retrieval.evaluation import EvaluateRetrieval

# ----- RETRIEVAL -----
## load the WikiBias_Natural retrieval corpus
corpus, queries, qrels = load_wiki_balance(subset='natural')
## load an IR model from BEIR
retriever = EvaluateRetrieval(DRES(SentenceBERT("msmarco-distilbert-base-tas-b"), batch_size=16))
## retrieve documents
retrieved = retriever.retrieve(corpus, queries)

from src.metrics.duo import Duo, get_relevant_corpus, get_relevant_corpus_retrieved, get_relevant_ranking
# ----- INDEXICAL BIAS EVALUATION -----
## initialize the metric 
d = Duo(embedding_model="sentence-t5-xl", step_size=1, random_state=7)

## load the synthetic corpus for fitting the Duo metric
fit_corpus, fit_queries, fit_qrels = load_wiki_balance(subset='synthetic')

## evaluate on the first query
query_idx = list(retrieved.keys())[0]

## embed documents to polarization scores
d.embed(transform_docs=get_relevant_corpus_retrieved(corpus, retrieved, query_idx, qrels), 
        fit_docs=get_relevant_corpus(fit_corpus, query_idx, fit_qrels),
       )

# compute DUO score
duo_score = d.Duo(ranking=get_relevant_ranking(retrieved, query_idx, qrels))
print(duo_score)

Datasets

You can view the WikiBalance datasets on Hugging Face.

Dataset	Huggingface Name	Gold Labels	Type	Topics	Queries	Documents
WikiBalance Synthetic	`SALT-NLP/wiki-balance-synthetic`	❌	`test`	1.4k	4k	31.5k
WikiBalance Natural	`SALT-NLP/wiki-balance-natural`	✅	`test`	288	452	4.6k

System Audits

You can replicate all system audits from Tables 4 and 5 in the paper by running the following script:

bash run_audit.sh

Only BM-25 and ColBERT require special setup to run. To set up ColBERT, follow the (BEIR demo instructions here)[https://github.com/beir-cellar/beir/tree/main/examples/retrieval/evaluation/late-interaction]. To run BM-25, use the following steps:

On Mac

Download elasticsearch.zip and unpack locally: elastic.co/downloads/elasticsearch
Edit config/elasticsearch.yml to remove security features, setting false to xpack.security.enabled, xpack.security.http.ssl.enabled, xpack.security.transport.ssl.enabled
Move to the elasticsearch directory and run elasticsearch bin/elasticsearch
Run using python -m src.modeling.run_bm25 --dataset "idea/wiki" --model "bm25" On Linux Follow these instructions: linuxize.com/post/how-to-install-elasticsearch-on-ubuntu-18-04

Validations and Additional Experiments

To print the summary tables from the paper, run print_tables.py from the main directory.
To replicate our metric validations in Table 2 (as well as Tables 6 and 7 in the Appendix), run python -m src.experiments.metric_validation
To replicate the SEME experiments, you can do the following: a. Re-run the experiments with your own participants using the HIT interface, hit/seme/hit_pair_seme.html OR b. Download the experimental data from (this Drive link)[https://drive.google.com/file/d/1TXKZueZFo_VbzMyui-V5YkQVvixysQuA/view?usp=drive_link] and place it in the hit/seme directory. c. Run python -m src.experiments.seme_experiment.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PAIR: Perspective-Aligned Information Retrieval

Paper | Setup | Quick Example | Datasets | System Audits | Validations and Additional Experiments

What is PAIR?

Setup

From Source

Quick Example

Datasets

System Audits

Validations and Additional Experiments

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
hit		hit
results		results
src		src
Demo.ipynb		Demo.ipynb
README.md		README.md
categories.csv		categories.csv
print_tables.py		print_tables.py
requirements.txt		requirements.txt
run_audit.sh		run_audit.sh

SALT-NLP/pair

Folders and files

Latest commit

History

Repository files navigation

PAIR: Perspective-Aligned Information Retrieval

Paper | Setup | Quick Example | Datasets | System Audits | Validations and Additional Experiments

What is PAIR?

Setup

From Source

Quick Example

Datasets

System Audits

Validations and Additional Experiments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages