Skip to content

tomszar/PopStruct

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

81 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Population structure analysis

Here is the pipeline for the population structure analysis done for the samples from Shriver's lab. A list of the samples used can be seen here.

Pipeline summary

We first ran a QC procedure for each dataset. After harmonizing them, we merged all datasets, and finally merge them with the reference samples from 1000 Genomes (1000G) and the Human Diversity Project (HGDP). The pipeline for merging those two reference samples can be found here

We applied an LD prune to generate appropriates files to run on Admixture, and PCA.

QC procedure

Our QC procedure was done using plink 1.9 for each dataset, both before and after merging across platforms, in the following order:

  1. Remove founders, that is, individuals with at least one parent in the dataset, and retained only autosomal chromosomes
  2. Remove SNPs with missing call rates higher than 0.1
  3. Remove SNPs with minor allele frequencies below 0.05
  4. Remove SNPs with hardy-weinberg equilibrium p-values less than 1e-50
  5. Remove samples with missing call rates higher than 0.1
  6. Remove one arbitrary individual from any pairwise comparison with a pihat >= 0.25 from an IBD estimation after LD prune

After merging platforms, the QC procedure was repeated from steps 2 to 5.

Merging platforms

Because our datasets were genotyped using different platforms, and to increase the chances of a successful merge, before attempting to merge them we harmonized our datasets using the 1000 Genomes Phase 3 (1000G) as reference sample. In doing so, we solved unknown strand issues, updated variant IDs, and updated the reference alleles. We kept all SNPs from each dataset, and we removed problematic SNPs during the merging steps.

Pipeline scripts

  1. Initial dataset split and QC
  2. Harmonize genotypes, was uploaded to Penn State HPC infrastructure
  3. Merging datasets and QC
  4. Phasing genotypes, using Penn State HPC infrastructure
  5. FineStructure, using Penn State HPC infrastructure

About

Population structure analysis for Shriver's lab genotype data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages