EigenstratFormat

Contents

EigenstratFormat._allelesMethod
_alleles(bytes::Vector{UInt8}, idx::Int64)

Return the number of variant alleles for a specified individual.

bytes: Row of bytes that encode an SNP for all individuals.

idx: Index of the individual.

EigenstratFormat._bitpairMethod
_bitpair(byte::UInt8, pos::Int64)

Extract 2 bits from a byte. The bit positions start at 0. Allowed are values 0, 1, 2, 3.

EigenstratFormat._encodeMethod
_encode(
    genotype::Tuple{Char, Char},
    byte::UInt8,
    position::Int64,
    reference_snp::Tuple{Char, Char}
)

Encode the given genotype in a byte at the given bitpair position (0..3).

Each SNP is characterized by a Tuple of two alleles. The reference_snp contains the reference allele and the derived allele.

EigenstratFormat._hashitMethod
_hashit(sequence::String)

Calculate the hashsum of a String.

This is basically the same method as in Nick Patterson's original C code.

However, the original C version uses 32 bit integer values and integer overflows which may result in undefined behavior on some machines. This method uses 64 bit integers to circumvent this problem.

This method will probably fail if the sequence contains non-ASCII Unicode characters. I do not know if this is defined in any way.

EigenstratFormat._hashsumMethod
_hashsum(sequences::Vector{<:AbstractString})

Calculate the hashsum of a Vector of Strings using the _hashit() method.

This is the same method as the hasharr function in Nick Patterson's C code.

EigenstratFormat._hashsumMethod
_hashsum(sequences::Vector{<:AbstractString})

Calculate the hashsum of a Vector of Strings using the _hashit() method.

Again this is basically the same method as in the software from David Reich's laboratory: https://github.com/DReichLab/EIG

EigenstratFormat.add_individualMethod
add_individual(
    inprefix::AbstractString,
    outprefix::AbstractString,
    ind_snp_file::AbstractString,
    id::AbstractString;
    gender = "U",
    status = "Control"
)

Add an individual to a database in Eigenstrat format.

The SNPs in the database remain untouched. If the individual displayes SNPs that are not listed in the database or multiallelic ones those SNPs are removed.

inprefix: Prefix of the input database.

outprefix: Prefix of the output database.

ind_snp_file: File containing SNP results for the individual. This should work with files from Family Tree DNA Family Finder, MyHeritage, LivingDNA and 23andMe.

id: ID of the individual. For living persons I recommed the name.

gender: U, F or M (Unknown, Female or Male)

status: Control, Case or a population label.

EigenstratFormat.hash_idsMethod
hash_ids(filename::AbstractString)

Create the hashsum of .snp and .ind files. SNP and Individual files contain an ID in the first column. This method uses those IDs to calculate a hashsum. The sums are needed to store genotype data in packed format.

EigenstratFormat.read_eigenstrat_genoMethod
read_eigenstrat_geno(
    genofile::AbstractString,
    nsnp::Int64,
    nind::Int64;
    ind_idx::Vector{Int64} = [i for i = 1:nind]
)

Read a genofile in PackedAncestryMap format. The file must be unzipped.

genofile: filename

nsnp: number of SNPs listed in .snp file.

nind: number of individuals listed in .ind file.

ind_idx: Indices of individuals that should be read from the file.

XXX Check for comment lines in .snp and .ind files.

File description: File header starts with GENO or TGENO (transposed GENO). So far files in the AADR archive seem to be GENO. So this method does not support the transposed TGENO format.

The text format contains one line per genotype:

SNPID SampleID Numberofvariant_alleles

The packed format:

Each SNP entry has 2 bits: 0, 1, 2, 3=missing, that denote the number of variant alleles as described at David Reich's laboratory.

EigenstratFormat.read_eigenstrat_indMethod
read_eigenstrat_ind(indfile::AbstractString)

Read individuals from Eigenstrat .ind file. The IND flle contains information about each individual in the database.

Return a DataFrame consisting of the columns:

ID, Gender, Status

where

Gender: M (male), F (Female) or U (unknown).

Status: Case, Control or population label.

EigenstratFormat.read_eigenstrat_snpMethod
read_eigenstrat_snp(snpfile::AbstractString)

Read Eigenstrat .snp file. The SNP file contains information about each SNP.

Return a DataFrame containing the columns:

chromosome, rsid, cM, position, allele1, allele2

EigenstratFormat.read_snp_fileMethod
read_snp_file(filename::AbstractString)

Read file with autosomal results from FTDNA Family Finder, My Heritage or LivingDNA. Should also work with 24andMe files but not tested.

Return a DataFrame containing the columns:

rsid chromosome position genotype

EigenstratFormat.write_23andMeMethod
write_23andMe(filename::AbstractString, snptable)

Write a table of SNPs in 23andMe file format.

This method is included to satisfy users who use Plink. Plink supports 23andMe files.

The snp table must satisfy the Tables.jl interface.

The table must contain the columns:

rsid, chromosome, position, genotype

However exact spelling is not mandatory.

EigenstratFormat.write_eigenstrat_genoMethod
write_eigenstrat_geno(
    genofile::AbstractString,
    genomatrix::Matrix{UInt8};
    ind_hash::Int64 = 0,
    snp_hash::Int64 = 0
)

Write a genotype matrix to file in PackedAncestryMap format.

ind_hash: Hashsum of .ind file.

snp_hash: Hashsum of .snp file.

EigenstratFormat.write_eigenstrat_indMethod
write_eigenstrat_ind(filename::AbstractString, inds::AbstractDataFrame)

Write information about each individual to an .ind file.

The parameter inds contains information about each individual.

The DataFrame must have the columns

ID, Gender, Status

EigenstratFormat.write_eigenstrat_snpMethod
write_eigenstrat_snp(filename::AbstractString, snps::DataFrame)

Write .snp file in Eigenstrat format. The SNP file contains information about each SNP. The SNPs are provided as a DataFrame in parameter snps.

The DataFrame must consist of the following columns:

chromosome, rsid, cM, position, allele1, allele2

Index