Computing GC Content

Problem

Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).

Return: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below.

Sample dataset:

> Rosalind_6404 CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC TCCCACTAATAATTCTGAGG >Rosalind_5959 CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT ATATCCATTTGTCAGCAGACACGC 
>Rosalind_0808 CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC TGGGAACCTGCGGGCAGTAGGTGGAAT

Thinking process

  1. Need to identify the name and related strings.

  2. Build a connection of name and and string.

  3. Calculate the GC-content percentage

  4. Compare the GC-content and output the highest one.

  5. need to identify the name and related strings.

Solving process

  1. observe the sample dataset, there is >; We can using it to split. recall python .split

    dataset = """>Rosalind_6404
    CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
    TCCCACTAATAATTCTGAGG
    >Rosalind_5959
    CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
    ATATCCATTTGTCAGCAGACACGC
    >Rosalind_0808
    CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
    TGGGAACCTGCGGGCAGTAGGTGGAAT"""
    
    string_split = dataset.split('>')
    
    print(string_split)
    ['', 'Rosalind_6404\nCCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC\nTCCCACTAATAATTCTGAGG\n', 'Rosalind_5959\nCCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT\nATATCCATTTGTCAGCAGACACGC\n', 'Rosalind_0808\nCCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC\nTGGGAACCTGCGGGCAGTAGGTGGAAT']
  • This method have some problem, there is empty output “” in the front. and it also need to further split of \n
  1. Using regular expressions may be more convenient.
  • Observe the pattern

    1. start with >Rosalind_0000
    • regular expression: >(Rosalind_\d{4})
    import re
    
    refine_dataset = re.findall(r'>(Rosalind_\d{4})(.+?)(?=>Rosalind_|\Z)', dataset.replace('\n',''), re.S)
  • after this step, we can convert into key and value, then using key and value to calculate gc contents, then output the percentage.

import re

refine_dataset = re.findall(r'>(Rosalind_\d{4})(.+?)(?=>Rosalind_|\Z)', dataset.replace('\n',''), re.S)

for DNA_string_key, DNA_string_value in refine_dataset:
  print(DNA_string_key, DNA_string_value)
Rosalind_6404 CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCCTCCCACTAATAATTCTGAGG
Rosalind_5959 CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCTATATCCATTTGTCAGCAGACACGC
Rosalind_0808 CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGACTGGGAACCTGCGGGCAGTAGGTGGAAT
  • Next we can thinking about how to calculate GC contents

Sample Output: >Rosalind_0808 >60.919540

example_DNA_string_value = """CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
TGGGAACCTGCGGGCAGTAGGTGGAAT"""

example_DNA_string_value_1 = """CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGACTGGGAACCTGCGGGCAGTAGGTGGAAT"""

GC_percent_1 = (example_DNA_string_value.count('G') + example_DNA_string_value.count('C'))/len(example_DNA_string_value) *100
print(round(GC_percent_1, 4))


GC_percent_2 = (example_DNA_string_value_1.count('G') + example_DNA_string_value_1.count('C'))/len(example_DNA_string_value_1) *100
print(round(GC_percent_2, 5))
60.2273
60.91954
  • Explain: len() this will include \n, therefore influence result.
import re

dataset = """>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG
>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
ATATCCATTTGTCAGCAGACACGC
>Rosalind_0808
CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
TGGGAACCTGCGGGCAGTAGGTGGAAT"""

refine_dataset = re.findall(r'>(Rosalind_\d{4})(.+?)(?=>Rosalind_|\Z)', dataset.replace('\n',''), re.S)

GC_reslts = {}
for DNA_key, DNA_value in refine_dataset:
  GC_percent = (DNA_value.count('G') + DNA_value.count('C'))/len(DNA_value) *100
  GC_reslts[DNA_key] = GC_percent

max_key = max(GC_reslts, key = GC_reslts.get)
print(max_key)
print(round(GC_reslts[max_key], 6))
Rosalind_0808
60.91954

Final Test

import re

dataset = open('../file_inputs/rosalind_gc.txt', 'r')
with open('../file_inputs/rosalind_gc.txt', 'r') as f:
    dataset = f.read()           

refine_dataset = re.findall(r'>(Rosalind_\d{4})(.+?)(?=>Rosalind_|\Z)', dataset.replace('\n',''), re.S)

GC_reslts = {}
for DNA_key, DNA_value in refine_dataset:
  GC_percent = (DNA_value.count('G') + DNA_value.count('C'))/len(DNA_value) *100
  GC_reslts[DNA_key] = GC_percent

max_key = max(GC_reslts, key = GC_reslts.get)
print(max_key)
print(round(GC_reslts[max_key], 6))
Rosalind_0277
51.444444