EGAS00001000644 GoNL aligned sequence data in BAM format.

We mapped the data to the UCSC human reference genome build 37 using BWA 0.5.9-r16. We first mapped each read pair separately using bwa aln. Then we used bwa sampe to map the paired reads together to a BAM9 file. The BAM file was then sorted by genomic position and indexed using PicardTools-1.32 SortSam. To prevent PCR artifacts from influencing the downstream analysis of our data, we used Picard to mark the duplicate reads, which were ignored in downstream analysis. We used GATK IndelRealigner on our data around known indels (from 1KG Pilot). The IndelRealigner creates all possible read alignments using the source and computes the likelihood of the data containing the indel based on the read pileup. Whenever the maximum likelihood contains an indel, the reads are realigned accordingly. Each base is associated with a phred-scaled base quality score. Calibration of Phred scores is crucial as they are used in some of the downstream analysis models. We used GATK to recalibrate the base qualities with respect to (i) the base cycle, (ii) original quality score, and (iii) dinucleotide context. To minimize issues stemming from mapping problems around indels, we decided to undergo a second round of indel realignment using the GATK IndelRealigner by family rather than by individual. For this second round, we considered two sources of possible indels: 1KG Phase 1 indels and indels aligned by BWA in the GoNL data.

Data and Resources

This dataset has no data

Additional Info

Field Value
Title EGAS00001000644 GoNL aligned sequence data in BAM format.
Description

We mapped the data to the UCSC human reference genome build 37 using BWA 0.5.9-r16. We first mapped each read pair separately using bwa aln. Then we used bwa sampe to map the paired reads together to a BAM9 file. The BAM file was then sorted by genomic position and indexed using PicardTools-1.32 SortSam. To prevent PCR artifacts from influencing the downstream analysis of our data, we used Picard to mark the duplicate reads, which were ignored in downstream analysis. We used GATK IndelRealigner on our data around known indels (from 1KG Pilot). The IndelRealigner creates all possible read alignments using the source and computes the likelihood of the data containing the indel based on the read pileup. Whenever the maximum likelihood contains an indel, the reads are realigned accordingly. Each base is associated with a phred-scaled base quality score. Calibration of Phred scores is crucial as they are used in some of the downstream analysis models. We used GATK to recalibrate the base qualities with respect to (i) the base cycle, (ii) original quality score, and (iii) dinucleotide context. To minimize issues stemming from mapping problems around indels, we decided to undergo a second round of indel realignment using the GATK IndelRealigner by family rather than by individual. For this second round, we considered two sources of possible indels: 1KG Phase 1 indels and indels aligned by BWA in the GoNL data.

Keywords
Contact points
Contact point 1
URI
http://umcgresearchdatacatalogue.nl/catalogue_rdf/api/rdf/Contacts/firstName=Gert-Jan&lastName=van%20de%20Geijn&resource=EGAS00001000644
Name
Gert-Jan van de Geijn
Name (translations)
Email
gonl@bbmri.nl
Identifier
URL
Publisher
Publisher 1
URI
http://umcgresearchdatacatalogue.nl/catalogue_rdf/api/rdf/Agents/id=UMCG&resource=EGAS00001000644
Name
University Medical Center Groningen
Name (translations)
Email
researchdatacatalogue@umcg.nl
URL
https://www.umcg.nl/
Type
Identifier
https://ror.org/03cv38k47
Creator
Creator 1
URI
http://umcgresearchdatacatalogue.nl/catalogue_rdf/api/rdf/Agents/id=UMCG&resource=EGAS00001000644
Name
University Medical Center Groningen
Name (translations)
Email
researchdatacatalogue@umcg.nl
URL
https://www.umcg.nl/
Type
Identifier
https://ror.org/03cv38k47
Landing page
Release date
Modification date
Temporal start date
Temporal end date
In Series
    Version
    Version notes
    Identifier https://ega-archive.org/datasets/EGAD00001001038
    Frequency
    Provenance
    Type
    Temporal coverage
    Temporal resolution
    Spatial coverage
    Spatial resolution in meters
    Access rights http://publications.europa.eu/resource/authority/access-right/NON_PUBLIC
    Other identifier
    Theme
    1. http://publications.europa.eu/resource/authority/data-theme/HEAL
    Language
    Documentation
    Conforms to
    Is referenced by
    Analytics
    Applicable legislation
    1. http://data.europa.eu/eli/reg/2022/868/oj
    Has version
    Code values
    Coding system
    Purpose
    Health category
    Health theme
    Legal basis
    Minimum typical age
    Maximum typical age
    Number of records
    Number of records for unique individuals.
    Personal data
    Publisher note
    Publisher type
    Trusted Data Holder
    Population coverage
    Retention period
    Health data access body
    Qualified relation
    Provenance activity
    Qualified attribution
    Quality annotations
    URI http://umcgresearchdatacatalogue.nl/catalogue_rdf/api/rdf/CollectionEvents/name=EGAS00001000644%20GoNL%20aligned%20sequence%20data%20in%20BAM%20format.&resource=EGAS00001000644