---
title: "Processing 1000 Genomes reference data for ancestry estimation"
author: "Hannah Meyer"
date: "`r Sys.Date()`"
output:
    pdf_document:
        fig_caption: yes
        toc: true
        toc_depth: 2
        keep_tex: true
        highlight: pygments
bibliography: references.bib
csl: plos-genetics.csl
vignette: >
  %\VignetteIndexEntry{1000Genomes}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup knitr, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```
# Introduction
Genotype quality control for genetic association studies often includes the need
for selecting samples of the same ethnic background.
To identify individuals of divergent ancestry based on genotypes, the genotypes
of the study population can be combined with genotypes of a reference dataset
consisting of individuals from known ethnicities. 
Principal component analysis (PCA) on this combined genotype panel can then be
used to detect population structure down to the level of the reference dataset.

The following vignette shows the processing steps required to use samples of the
1000 Genomes study [@a1000Genomes2015],[@b1000Genomes2015] as a reference
dataset. Using the 1000 Genomes reference, population structure down to
large-scale continental ancestry can be detected. A step-by-step instruction on
how to train a random forest classifier for population structure is found 
[here](AncestryCheck.html).


# Workflow
## Set-up
We will first set up some bash variables and create directories needed; storing
the names and directories of the reference will make it easy to use 
updated versions of the reference in the future. 
Is is also useful to keep the PLINK log-files for future reference. In order to
keep the data directory tidy, we'll create a directory for the log files and
move them to the log directory here after each analysis step.

```{bash setup, eval=FALSE}
refdir=~/reference
mkdir -p $refdir/plink_log
```

## PLINK software
In addition to [PLINK v1.9](https://www.cog-genomics.org/plink/1.9/), which is a
requirment for the `plinkQC` package, we will also need
[PLINK v2](https://www.cog-genomics.org/plink/2.0/) for processing the
downloaded the dataset. In the following, when `plink` is invoked, this
corresponds to v1.9, whereas `plink2` corresponds to v2.

## Download and decompress 1000 Genomes phase 3 data
1000 Genomes phase III (1000GenomesIII) is available in
[PLINK 2 binary format](https://www.cog-genomics.org/plink/2.0/input#pgen) at https://www.cog-genomics.org/plink/2.0/resources#1kg_phase3.
In addition, a sample file with information about the individuals' ancestry
is available. 
The following code chunk downloads and decompresses the data.
The genome build of
these files is the same as the original release of the 1000GenomesIII, namely
CGRCh37. 

NB: CGRCh38 positions in vcf format can be found [here](http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/).
The remainder of this vignette will however look at the data processing required
for the 1000GenomesIII available in
[PLINK 2 binary format](https://www.cog-genomics.org/plink/2.0/input#pgen).

NB: the links to the files below are the three boldfaced links on this page:
https://www.cog-genomics.org/plink/2.0/resources#1kg_phase3. The dropbox links
have been updated in the past.
Please refer to the original site and open an issue on github if you notice
a change. 
```{bash download, eval=FALSE}
cd $refdir

pgen=https://www.dropbox.com/s/j72j6uciq5zuzii/all_hg38.pgen.zst?dl=1
pvar=https://www.dropbox.com/s/vx09262b4k1kszy/all_hg38.pvar.zst?dl=1
sample=https://www.dropbox.com/s/2e87z6nc4qexjjm/all_hg38.psam?dl=1

wget $pgen
mv 'all_hg38.pgen.zst?dl=1' all_hg38.pgen.zst
plink2 --zst-decompress all_hg38.pgen.zst > all_hg38.pgen

wget $pvar
mv 'all_hg38.pvar.zst?dl=1' all_hg38.pvar.zst

wget $sample
mv 'all_hg38.psam?dl=1' all_hg38.psam
```

After these steps, the 1000 Genomes dataset can be used 
for training a random forest classifier as described in the corresponding
[vignette](AncestryCheck.html).

## Convert 1000 Genomes phase 3 data to plink 1 binary format
We then convert the
[PLINK 2 binary format](https://www.cog-genomics.org/plink/2.0/input#pgen) to the
(at the moment) more standardly used
[PLINK 1 binary format](https://www.cog-genomics.org/plink/1.9/input#ped).
```{bash convert, eval=FALSE}
plink2 --pfile $refdir/all_hg38 vzs\
      --max-alleles 2 \
      --make-bed \
      --allow-extra-chr 0 \
      --out $refdir/all_hg38
mv $refdir/all_hg38.log $refdir/plink_log/all_hg38_convert.log
```

 
# References