The smallest substring that can be replaced so that the string has the same number of each character

I am trying to solve a problem that is almost like that. In particular, I am assigned a string s such that s.Length % 4 == 0 and each s[i] is one of 'A' , 'C' , 'T' or 'G' . I want to find the smallest substring that I can replace so that each of 'A' , 'C' , 'T' and 'G' displayed exactly s.Length / 4 times.

For example, with s="GAAATAAA" one optimal solution is to replace the substring "AAATA" with "TTCCG" , which leads to "GTTCCGAA" .

I described my approach in the comments below, and I wonder if it will fundamentally change that it will return me the correct answer.

 using System; using System.Collections.Generic; using System.IO; using System.Linq; using System.Text; class Solution { static string ReplacementForSteadiness(string s) { var counter = new Dictionary<char,int>() { { 'A', 0 }, { 'C', 0 }, { 'G', 0 }, { 'T', 0 } }; for(int i = 0; i < s.Length; ++i) counter[s[i]] += 1; int div = s.Length / 4; var pairs = counter.ToList(); if(pairs.All(p => p.Value == div)) return ""; // If here, that means there is an even count of characters in s. For example, if // s = "AAATGTTCTTGCGGGG", then counter = { A -> 3, T -> 5, C -> 2, G -> 6 }, // div = 4, and we know that we need to increase the number of As by 1, decrease // the number of Ts by 1, increase the number of Cs by 2 and decrease the number // of Gs by 2. // The smallest strings to replace will have 1 T and 2 Gs, to be replaced with 1 A and // 2 Cs (The order of characters in the replacement string doesn't matter). // "TGG" --> "ACC" // "GTG" --> "ACC" // "GGT" --> "ACC" // None of those strings exist in s. The next smallest strings that could be replaced // would have 1 T and 3Gs, to be replaced with 1 A and 2 of the Gs to be replaced with // Cs. Or, 2 Ts and 2Gs, 1 of the Ts to be replaced by an A and both the Gs to be replaced // by Cs. // "TGGG" --> "AGCC" // "GTGG" --> "AGCC" // "GGTG" --> "AGCC" // "GGGT" --> "AGCC" // "TTGG" --> "ATCC" // "TGTG" --> "ATCC" // "GTGT" --> "ATCC" // "GGTT" --> "ATCC" // None of those strings exist in s. Etc. string r; // ... return r; } static void Main(String[] args) { Console.ReadLine(); // n string str = Console.ReadLine(); string replacement = ReplacementForSteadiness(str); Console.WriteLine(replacement.Length); } } 
+5
source share
3 answers

If the string already has a balanced character set, then you are done and should not do anything.

Otherwise, you can always solve the problem by replacing null characters that are minimal. You do this by adding all the characters. For example, to take your test case:

GAAATAAA

The character with most occurrences is A with 6. You need 5 extra Gs, 5 extra Ts and 6 extra C. So, replace one A with the necessary characters, including A itself:

GAAATAA [AGGGGGTTTTTCCCCCC]

Since the original A is replaced by A, you have actually replaced the null characters that are the smallest possible.

0
source

I think your solution will work, but its complexity is too high.
Here is an alternative solution
If counting characters in your string returns {'A', 4}, {'C', 6}, {'G', 6}, {'T', 4}, the substring that must start with C or G ends with C or G and have a length> = 2
So, what we need to do is take each line that checks this condition, check whether it contains “bad characters” in our case, one C and one G. If its length = 2, we win, otherwise we save the temporary variable and continue our test

  using System; using System.Collections.Generic; using System.IO; using System.Linq; using System.Text; class Solution { static void Main(String[] args) { string[] inputs = { "GAAATAAA", "CACCGCTACCGC", "CAGCTAGC", "AAAAAAAA", "GAAAAAAA", "GATGAATAACCA", "ACGT" }; List<string> replacement = new List<string>(); foreach (var item in inputs) { replacement.Add(StringThatHasToBeReplaced(item)); } } static string StringThatHasToBeReplaced(string s) { var counter = new Dictionary<char, int>() { { 'A', 0 }, { 'C', 0 }, { 'G', 0 }, { 'T', 0 } }; for (int i = 0; i < s.Length; ++i) counter[s[i]] += 1; int div = s.Length / 4; var pairs = counter.ToList(); if (pairs.Where(p => p.Value != div).Count() == 0) { return null; } List<char> surplusCharacter = pairs.Where(p => p.Value > div).Select(p => p.Key).ToList(); int minLength = pairs.Where(p => p.Value > div).Sum(p => p.Value - div); string result = s; for (int i = 0; i < s.Length - minLength + 1; i++) // i is the start index { if (surplusCharacter.Contains(s[i])) { if (minLength == 1) return s[i].ToString(); for (int j = i + minLength - 1; j < s.Length; j++) // j is the end index { if (surplusCharacter.Contains(s[j])) { var substring = s.Substring(i, j - i); if (substring.Length >= result.Length) { break; } // we test if substring can be the string that need to be replaced var isValid = true; foreach (var c in surplusCharacter) { if (substring.Count(f => f == c) < counter[c] - div) { isValid = false; break; } } if (isValid) result = substring; } } } } return result; } } 

I made some changes to handle borderline cases. Here are some test cases and the result I get looks great enter image description here

0
source

Thoughts? Sorry for the messy code + python solution. At first I started writing this on my phone and felt lazy.

 import re from itertools import permutations def find_min(s): freq = {ch:0 for ch in 'ATGC'} for ch in s: freq[ch] += 1 desired_len = int(len(s)/4) fixes = {ch:desired_len-freq[ch] for ch in 'ATGC'} replacement = '' for ch in fixes: adj = fixes[ch] if adj < 0: replacement += ch*(-1*adj) perms = set(permutations(replacement)) m = len(s) to_replace = '' for rep in perms: regex = '.*?'.join([ch for ch in rep]) finds = re.findall(regex,s) if finds: x = sorted(finds, key=lambda x:len(x))[0] if m >= len(x): m = len(x) to_replace = x print_replacement(s, to_replace, fixes) def print_replacement(inp, to_replace, fixes): replacement = '' for ch in to_replace: if fixes[ch] > 0: replacement += ch for ch in fixes: if fixes[ch] > 0: replacement += ch*fixes[ch] print('{0}\t\t- Replace {1} with {2} (min length: {3})'.format(inp ,to_replace, replacement, len(replacement))) def main(): inputs = ["GAAATAAA", "CACCGCTACCGC", "CAGCTAGC", "AAAAAAAA", "GAAAAAAA", "GATGAATAACCA", "ACGT"] for inp in inputs: find_min(inp) if __name__ == '__main__': main() 

Thanks @AnotherGeek for test inputs! There were exits.

 GAAATAAA - Replace AAATA with TCCGT (min length: 5) CACCGCTACCGC - Replace CACCGC with AGAGTT (min length: 6) CAGCTAGC - Replace C with T (min length: 1) AAAAAAAA - Replace AAAAAA with CCGGTT (min length: 6) GAAAAAAA - Replace AAAAA with CCGTT (min length: 5) GATGAATAACCA - Replace ATGAA with TGCGT (min length: 5) ACGT - Replace with (min length: 0) 

I understand that this is damn inefficient. Any suggestions for improvement?

0
source

All Articles