Я ищу решение python для извлечения нескольких последовательностей из файла FASTA в несколько файлов на основе соответствия списку идентификаторов заголовков. в отдельном файле.
Это немного более сложная версия проблемы, размещенной на Извлечь последовательности из файла FASTA на основе записей в отдельном файле и https://www.biostars.org/p/2822/, который выводит только один файл для всех совпадений.
Я новичок в Python и пытаюсь найти способ:
- Возьмите файл со строками, который будет в заголовках фаста
- Все записи, которые соответствуют строке, записаны в отдельный файл fasta
Файл header_ID_strings выглядит так:
CAP357_2030_09WPI, CAP357_2040_11WPI, CAP357_2050_13WPI и т. Д.
образец моего фаст-файла выглядит так:
> CAP357_2030_009wpi_v1v3
#!/usr/bin/env python
import sys
from Bio import SeqIO
input_file = sys.argv[1]
id_file = sys.argv[2]
output_file = sys.argv[3]
wanted = set(line.rstrip("\n").split(None,1)[0] for line in open(id_file))
print "Found %i unique identifiers in %s" % (len(wanted), id_file)
index = SeqIO.index(input_file, "fasta")
records = (index[r] for r in wanted)
count = SeqIO.write(records, output_file, "fasta")
assert count == len(wanted)
print "Saved %i records from %s to %s" % (count, input_file, output_file)
056_00002_000.4 GTAAAATTAACCCCACTCTGTGTCACTCTAAATTGTACAACTGCAAAGGG
> CAP357_2040_011wpi_v1v3
#!/usr/bin/env python
import sys
from Bio import SeqIO
input_file = sys.argv[1]
id_file = sys.argv[2]
output_file = sys.argv[3]
wanted = set(line.rstrip("\n").split(None,1)[0] for line in open(id_file))
print "Found %i unique identifiers in %s" % (len(wanted), id_file)
index = SeqIO.index(input_file, "fasta")
records = (index[r] for r in wanted)
count = SeqIO.write(records, output_file, "fasta")
assert count == len(wanted)
print "Saved %i records from %s to %s" % (count, input_file, output_file)
008_00006_001.1 GTAAAATTAACCCCACTCTGTGTCACTCTAAATTGTACAACTGCAAAGGGT
> CAP357_2040_011wpi_v1v3
#!/usr/bin/env python
import sys
from Bio import SeqIO
input_file = sys.argv[1]
id_file = sys.argv[2]
output_file = sys.argv[3]
wanted = set(line.rstrip("\n").split(None,1)[0] for line in open(id_file))
print "Found %i unique identifiers in %s" % (len(wanted), id_file)
index = SeqIO.index(input_file, "fasta")
records = (index[r] for r in wanted)
count = SeqIO.write(records, output_file, "fasta")
assert count == len(wanted)
print "Saved %i records from %s to %s" % (count, input_file, output_file)
030_00002_000.4 GTAAAATTAACCCCACTCTGTGTCACTCTAAATTGTACAACTGCAAAGGGT
> CAP357_2040_011wpi_v1v3
#!/usr/bin/env python
import sys
from Bio import SeqIO
input_file = sys.argv[1]
id_file = sys.argv[2]
output_file = sys.argv[3]
wanted = set(line.rstrip("\n").split(None,1)[0] for line in open(id_file))
print "Found %i unique identifiers in %s" % (len(wanted), id_file)
index = SeqIO.index(input_file, "fasta")
records = (index[r] for r in wanted)
count = SeqIO.write(records, output_file, "fasta")
assert count == len(wanted)
print "Saved %i records from %s to %s" % (count, input_file, output_file)
004_00001_000.2 GTAAAATTAACCCCACTCTGTGTCACTCTAAATTGTACAACTGCAAAGGGT
> CAP357_2050_013wpi_v1v3
#!/usr/bin/env python
import sys
from Bio import SeqIO
input_file = sys.argv[1]
id_file = sys.argv[2]
output_file = sys.argv[3]
wanted = set(line.rstrip("\n").split(None,1)[0] for line in open(id_file))
print "Found %i unique identifiers in %s" % (len(wanted), id_file)
index = SeqIO.index(input_file, "fasta")
records = (index[r] for r in wanted)
count = SeqIO.write(records, output_file, "fasta")
assert count == len(wanted)
print "Saved %i records from %s to %s" % (count, input_file, output_file)
047_00002_000.4 GTAAAATTAACCCCACTCTGTGTCACTCTAAATTGTACAACTGCAAAGGGT
ожидаемый результат
file1: CAP357_2030_009wpi_v1v3.fasta
> CAP357_2030_009wpi_v1v3
#!/usr/bin/env python
import sys
from Bio import SeqIO
input_file = sys.argv[1]
id_file = sys.argv[2]
output_file = sys.argv[3]
wanted = set(line.rstrip("\n").split(None,1)[0] for line in open(id_file))
print "Found %i unique identifiers in %s" % (len(wanted), id_file)
index = SeqIO.index(input_file, "fasta")
records = (index[r] for r in wanted)
count = SeqIO.write(records, output_file, "fasta")
assert count == len(wanted)
print "Saved %i records from %s to %s" % (count, input_file, output_file)
056_00002_000.4 GTAAAATTAACCCCACTCTGTGTCACTCTAAATTGTACAACTGCAAAGGG
file2: CAP357_2040_011wpi_v1v3.fasta
> CAP357_2040_011wpi_v1v3
#!/usr/bin/env python
import sys
from Bio import SeqIO
input_file = sys.argv[1]
id_file = sys.argv[2]
output_file = sys.argv[3]
wanted = set(line.rstrip("\n").split(None,1)[0] for line in open(id_file))
print "Found %i unique identifiers in %s" % (len(wanted), id_file)
index = SeqIO.index(input_file, "fasta")
records = (index[r] for r in wanted)
count = SeqIO.write(records, output_file, "fasta")
assert count == len(wanted)
print "Saved %i records from %s to %s" % (count, input_file, output_file)
008_00006_001.1 GTAAAATTAACCCCACTCTGTGTCACTCTAAATTGTACAACTGCAAAGGGT
> CAP357_2040_011wpi_v1v3
#!/usr/bin/env python
import sys
from Bio import SeqIO
input_file = sys.argv[1]
id_file = sys.argv[2]
output_file = sys.argv[3]
wanted = set(line.rstrip("\n").split(None,1)[0] for line in open(id_file))
print "Found %i unique identifiers in %s" % (len(wanted), id_file)
index = SeqIO.index(input_file, "fasta")
records = (index[r] for r in wanted)
count = SeqIO.write(records, output_file, "fasta")
assert count == len(wanted)
print "Saved %i records from %s to %s" % (count, input_file, output_file)
030_00002_000.4 GTAAAATTAACCCCACTCTGTGTCACTCTAAATTGTACAACTGCAAAGGGT
> CAP357_2040_011wpi_v1v3
#!/usr/bin/env python
import sys
from Bio import SeqIO
input_file = sys.argv[1]
id_file = sys.argv[2]
output_file = sys.argv[3]
wanted = set(line.rstrip("\n").split(None,1)[0] for line in open(id_file))
print "Found %i unique identifiers in %s" % (len(wanted), id_file)
index = SeqIO.index(input_file, "fasta")
records = (index[r] for r in wanted)
count = SeqIO.write(records, output_file, "fasta")
assert count == len(wanted)
print "Saved %i records from %s to %s" % (count, input_file, output_file)
004_00001_000.2 GTAAAATTAACCCCACTCTGTGTCACTCTAAATTGTACAACTGCAAAGGGT
и т.д. ...
Этот код взят из приведенной выше ссылки, но я хочу, чтобы:
* совпадения записывались в отдельные аутфайлы
* Мне не нужно указывать каждый аутфайл отдельно, если это возможно (у меня будет до 30 аутфайлов) < Br>
#!/usr/bin/env python
import sys
from Bio import SeqIO
input_file = sys.argv[1]
id_file = sys.argv[2]
output_file = sys.argv[3]
wanted = set(line.rstrip("\n").split(None,1)[0] for line in open(id_file))
print "Found %i unique identifiers in %s" % (len(wanted), id_file)
index = SeqIO.index(input_file, "fasta")
records = (index[r] for r in wanted)
count = SeqIO.write(records, output_file, "fasta")
assert count == len(wanted)
print "Saved %i records from %s to %s" % (count, input_file, output_file)
Пока это то, что я придумал (сценарий ниже), но не знаю, как обойтись вручную, указав все файлы и переменные (я включил здесь только три)
from Bio import SeqIO
import pandas as pd
import sys
input_file = sys.argv[1]
id_file = sys.argv[2]
output_file2020 = sys.argv[3]
output_file2030 = sys.argv[4]
output_file2040 = sys.argv[5]
colnames = ["2020", "2030", "2040"]
headerlist = pd.read_csv(id_file, names = colnames, header = None)
infile = list(SeqIO.parse(input_file, "fasta"))
2020_seq = tuple(headerlist.2020)
2030_seq = tuple(headerlist.2030)
2040_seq = tuple(headerlist.2040)
count2020 = 0
count2030 = 0
count2040 = 0
for record in infile:
if record.id in 2020_seq:
SeqIO.write([record], output_file2020, "fasta")
countSU += 1
elif record.id in PI_seq:
SeqIO.write([record], output_file2030, "fasta")
countPI += 1
elif record.id in REC_seq:
SeqIO.write([record], output_file2040, "fasta")
countREC += 1
else:
print("no matches found")
print("number of SU is", count2020)
print("number of PI is", count2030)
print("number of REC is", count2040)