Problem
Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).
Return: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below.
Sample dataset:
> Rosalind_6404 CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC TCCCACTAATAATTCTGAGG >Rosalind_5959 CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT ATATCCATTTGTCAGCAGACACGC
>Rosalind_0808 CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC TGGGAACCTGCGGGCAGTAGGTGGAAT
Thinking process
Need to identify the name and related strings.
Build a connection of name and and string.
Calculate the GC-content percentage
Compare the GC-content and output the highest one.
need to identify the name and related strings.
Solving process
observe the sample dataset, there is >; We can using it to split. recall python .split
dataset = """>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG
>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
ATATCCATTTGTCAGCAGACACGC
>Rosalind_0808
CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
TGGGAACCTGCGGGCAGTAGGTGGAAT"""
string_split = dataset.split('>' )
print (string_split)
['', 'Rosalind_6404\nCCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC\nTCCCACTAATAATTCTGAGG\n', 'Rosalind_5959\nCCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT\nATATCCATTTGTCAGCAGACACGC\n', 'Rosalind_0808\nCCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC\nTGGGAACCTGCGGGCAGTAGGTGGAAT']
This method have some problem, there is empty output “” in the front. and it also need to further split of \n
Using regular expressions may be more convenient.
Observe the pattern
start with >Rosalind_0000
regular expression: >(Rosalind_\d{4})
import re
refine_dataset = re.findall(r'> ( Rosalind_ \d {4} )( . +? ) ( ?= >Rosalind_ | \Z ) ' , dataset.replace(' \n ' ,'' ), re.S)
after this step, we can convert into key and value, then using key and value to calculate gc contents, then output the percentage.
import re
refine_dataset = re.findall(r'> ( Rosalind_ \d {4} )( . +? ) ( ?= >Rosalind_ | \Z ) ' , dataset.replace(' \n ' ,'' ), re.S)
for DNA_string_key, DNA_string_value in refine_dataset:
print (DNA_string_key, DNA_string_value)
Rosalind_6404 CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCCTCCCACTAATAATTCTGAGG
Rosalind_5959 CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCTATATCCATTTGTCAGCAGACACGC
Rosalind_0808 CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGACTGGGAACCTGCGGGCAGTAGGTGGAAT
Next we can thinking about how to calculate GC contents
Sample Output: >Rosalind_0808 >60.919540
example_DNA_string_value = """CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
TGGGAACCTGCGGGCAGTAGGTGGAAT"""
example_DNA_string_value_1 = """CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGACTGGGAACCTGCGGGCAGTAGGTGGAAT"""
GC_percent_1 = (example_DNA_string_value.count('G' ) + example_DNA_string_value.count('C' ))/ len (example_DNA_string_value) * 100
print (round (GC_percent_1, 4 ))
GC_percent_2 = (example_DNA_string_value_1.count('G' ) + example_DNA_string_value_1.count('C' ))/ len (example_DNA_string_value_1) * 100
print (round (GC_percent_2, 5 ))
Explain: len() this will include \n, therefore influence result.
import re
dataset = """>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG
>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
ATATCCATTTGTCAGCAGACACGC
>Rosalind_0808
CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
TGGGAACCTGCGGGCAGTAGGTGGAAT"""
refine_dataset = re.findall(r'> ( Rosalind_ \d {4} )( . +? ) ( ?= >Rosalind_ | \Z ) ' , dataset.replace(' \n ' ,'' ), re.S)
GC_reslts = {}
for DNA_key, DNA_value in refine_dataset:
GC_percent = (DNA_value.count('G' ) + DNA_value.count('C' ))/ len (DNA_value) * 100
GC_reslts[DNA_key] = GC_percent
max_key = max (GC_reslts, key = GC_reslts.get)
print (max_key)
print (round (GC_reslts[max_key], 6 ))
Final Test
import re
dataset = open ('../file_inputs/rosalind_gc.txt' , 'r' )
with open ('../file_inputs/rosalind_gc.txt' , 'r' ) as f:
dataset = f.read()
refine_dataset = re.findall(r'> ( Rosalind_ \d {4} )( . +? ) ( ?= >Rosalind_ | \Z ) ' , dataset.replace(' \n ' ,'' ), re.S)
GC_reslts = {}
for DNA_key, DNA_value in refine_dataset:
GC_percent = (DNA_value.count('G' ) + DNA_value.count('C' ))/ len (DNA_value) * 100
GC_reslts[DNA_key] = GC_percent
max_key = max (GC_reslts, key = GC_reslts.get)
print (max_key)
print (round (GC_reslts[max_key], 6 ))