EigenstratFormat
Contents
EigenstratFormat.EigenstratFormat — Module
Methods for the Eigenstrat format that is commonly used in genetics.
EigenstratFormat._alleles — Method
_alleles(bytes::Vector{UInt8}, idx::Int64)Return the number of variant alleles for a specified individual.
bytes: Row of bytes that encode an SNP for all individuals.
idx: Index of the individual.
EigenstratFormat._bitpair — Method
_bitpair(byte::UInt8, pos::Int64)Extract 2 bits from a byte. The bit positions start at 0. Allowed are values 0, 1, 2, 3.
EigenstratFormat._encode — Method
_encode(
genotype::Tuple{Char, Char},
byte::UInt8,
position::Int64,
reference_snp::Tuple{Char, Char}
)Encode the given genotype in a byte at the given bitpair position (0..3).
Each SNP is characterized by a Tuple of two alleles. The reference_snp contains the reference allele and the derived allele.
EigenstratFormat._hashit — Method
_hashit(sequence::String)Calculate the hashsum of a String.
This is basically the same method as in Nick Patterson's original C code.
However, the original C version uses 32 bit integer values and integer overflows which may result in undefined behavior on some machines. This method uses 64 bit integers to circumvent this problem.
This method will probably fail if the sequence contains non-ASCII Unicode characters. I do not know if this is defined in any way.
EigenstratFormat._hashsum — Method
_hashsum(sequences::Vector{<:AbstractString})Calculate the hashsum of a Vector of Strings using the _hashit() method.
This is the same method as the hasharr function in Nick Patterson's C code.
EigenstratFormat._hashsum — Method
_hashsum(sequences::Vector{<:AbstractString})Calculate the hashsum of a Vector of Strings using the _hashit() method.
Again this is basically the same method as in the software from David Reich's laboratory: https://github.com/DReichLab/EIG
EigenstratFormat.add_individual — Method
add_individual(
inprefix::AbstractString,
outprefix::AbstractString,
ind_snp_file::AbstractString,
id::AbstractString;
gender = "U",
status = "Control"
)Add an individual to a database in Eigenstrat format.
The SNPs in the database remain untouched. If the individual displayes SNPs that are not listed in the database or multiallelic ones those SNPs are removed.
inprefix: Prefix of the input database.
outprefix: Prefix of the output database.
ind_snp_file: File containing SNP results for the individual. This should work with files from Family Tree DNA Family Finder, MyHeritage, LivingDNA and 23andMe.
id: ID of the individual. For living persons I recommed the name.
gender: U, F or M (Unknown, Female or Male)
status: Control, Case or a population label.
EigenstratFormat.hash_ids — Method
hash_ids(filename::AbstractString)Create the hashsum of .snp and .ind files. SNP and Individual files contain an ID in the first column. This method uses those IDs to calculate a hashsum. The sums are needed to store genotype data in packed format.
EigenstratFormat.read_eigenstrat_geno — Method
read_eigenstrat_geno(
genofile::AbstractString,
nsnp::Int64,
nind::Int64;
ind_idx::Vector{Int64} = [i for i = 1:nind]
)Read a genofile in PackedAncestryMap format. The file must be unzipped.
genofile: filename
nsnp: number of SNPs listed in .snp file.
nind: number of individuals listed in .ind file.
ind_idx: Indices of individuals that should be read from the file.
XXX Check for comment lines in .snp and .ind files.
File description: File header starts with GENO or TGENO (transposed GENO). So far files in the AADR archive seem to be GENO. So this method does not support the transposed TGENO format.
The text format contains one line per genotype:
SNPID SampleID Numberofvariant_alleles
The packed format:
Each SNP entry has 2 bits: 0, 1, 2, 3=missing, that denote the number of variant alleles as described at David Reich's laboratory.
EigenstratFormat.read_eigenstrat_ind — Method
read_eigenstrat_ind(indfile::AbstractString)Read individuals from Eigenstrat .ind file. The IND flle contains information about each individual in the database.
Return a DataFrame consisting of the columns:
ID, Gender, Status
where
Gender: M (male), F (Female) or U (unknown).
Status: Case, Control or population label.
EigenstratFormat.read_eigenstrat_snp — Method
read_eigenstrat_snp(snpfile::AbstractString)Read Eigenstrat .snp file. The SNP file contains information about each SNP.
Return a DataFrame containing the columns:
chromosome, rsid, cM, position, allele1, allele2
EigenstratFormat.read_snp_file — Method
read_snp_file(filename::AbstractString)Read file with autosomal results from FTDNA Family Finder, My Heritage or LivingDNA. Should also work with 24andMe files but not tested.
Return a DataFrame containing the columns:
rsid chromosome position genotype
EigenstratFormat.write_23andMe — Method
write_23andMe(filename::AbstractString, snptable)Write a table of SNPs in 23andMe file format.
This method is included to satisfy users who use Plink. Plink supports 23andMe files.
The snp table must satisfy the Tables.jl interface.
The table must contain the columns:
rsid, chromosome, position, genotype
However exact spelling is not mandatory.
EigenstratFormat.write_eigenstrat_geno — Method
write_eigenstrat_geno(
genofile::AbstractString,
genomatrix::Matrix{UInt8};
ind_hash::Int64 = 0,
snp_hash::Int64 = 0
)Write a genotype matrix to file in PackedAncestryMap format.
ind_hash: Hashsum of .ind file.
snp_hash: Hashsum of .snp file.
EigenstratFormat.write_eigenstrat_ind — Method
write_eigenstrat_ind(filename::AbstractString, inds::AbstractDataFrame)Write information about each individual to an .ind file.
The parameter inds contains information about each individual.
The DataFrame must have the columns
ID, Gender, Status
EigenstratFormat.write_eigenstrat_snp — Method
write_eigenstrat_snp(filename::AbstractString, snps::DataFrame)Write .snp file in Eigenstrat format. The SNP file contains information about each SNP. The SNPs are provided as a DataFrame in parameter snps.
The DataFrame must consist of the following columns:
chromosome, rsid, cM, position, allele1, allele2
Index
EigenstratFormat.EigenstratFormatEigenstratFormat._allelesEigenstratFormat._bitpairEigenstratFormat._encodeEigenstratFormat._hashitEigenstratFormat._hashsumEigenstratFormat._hashsumEigenstratFormat.add_individualEigenstratFormat.hash_idsEigenstratFormat.read_eigenstrat_genoEigenstratFormat.read_eigenstrat_indEigenstratFormat.read_eigenstrat_snpEigenstratFormat.read_snp_fileEigenstratFormat.write_23andMeEigenstratFormat.write_eigenstrat_genoEigenstratFormat.write_eigenstrat_indEigenstratFormat.write_eigenstrat_snp