Borrar filtros
Borrar filtros

getting the nth term out of a sequence

2 visualizaciones (últimos 30 días)
SANGBIN LEE
SANGBIN LEE el 29 de Feb. de 2024
Editada: John D'Errico el 29 de Feb. de 2024
% Define the input and output file names
inputFileName = 'KIF11.txt';
outputFileName = 'CDS.txt';
% Read the sequence from the input file
fid = fopen(inputFileName, 'r');
sequence = fscanf(fid, '%c');
fclose(fid);
% Define the start and end positions of the CDS
cdsStart = 155;
cdsEnd = 3358;
% Extract the CDS from the sequence
cdsSequence = sequence(cdsStart:cdsEnd);
% Write the CDS sequence to a new file
fid = fopen(outputFileName, 'w');
fprintf(fid, '%s', cdsSequence);
fclose(fid);
I have the code above which is supposed to pull out the 155th term to the 3358th term in the text file that I have. For some reason when I run the code, it shows me the 153rd term to the 3356th term. Is something wrong with the code?
  3 comentarios
SANGBIN LEE
SANGBIN LEE el 29 de Feb. de 2024
thank you
Walter Roberson
Walter Roberson el 29 de Feb. de 2024
sequence = fscanf(fid, '%c');
beware: the character codes returned in sequence will include any end-of-line characters that might be there (possibly carriage return and line feed). Linear indexing into that is a bit uncertain because of the uncertainty over whether carriage returns are present or not.

Iniciar sesión para comentar.

Respuestas (1)

Dyuman Joshi
Dyuman Joshi el 29 de Feb. de 2024
Editada: Dyuman Joshi el 29 de Feb. de 2024
As @Walter has warned, a carriage return character (\r) is being read along with the data -
% Define the input and output file names
inputFileName = 'KIF11.txt';
outputFileName = 'CDS.txt';
% Read the sequence from the input file
fid = fopen(inputFileName, 'r');
sequence = fscanf(fid, '%c');
fclose(fid);
size(sequence)
ans = 1×2
1 3736
%Expected - last character of the 1st line and first character of the 2nd line
%Output is not according to that
y = sequence(70:71)
y =
'T '
double(y)
ans = 1×2
84 13
Alternatively, you can use textscan here -
Fid = fopen(inputFileName, 'r');
out = textscan(Fid, '%c')
out = 1×1 cell array
{3682×1 char}
seq = out{1};
y = seq(70:71)
y = 2×1 char array
'T' 'G'
% Define the start and end positions of the CDS
cdsStart = 155;
cdsEnd = 3358;
% Extract the CDS from the sequence
cdsSequence = sequence(cdsStart:cdsEnd);
% Write the CDS sequence to a new file
fid = fopen(outputFileName, 'w');
fprintf(fid, '%s', cdsSequence);
fclose(fid);
  1 comentario
John D'Errico
John D'Errico el 29 de Feb. de 2024
Editada: John D'Errico el 29 de Feb. de 2024
+1. I was going to point this out:
find(~ismember(sequence,'CAGT'))
ans =
Columns 1 through 8
71 142 213 284 355 426 497 568
Columns 9 through 16
639 710 781 852 923 994 1065 1136
Columns 17 through 24
1207 1278 1349 1420 1491 1562 1633 1704
Columns 25 through 32
1775 1846 1917 1988 2059 2130 2201 2272
Columns 33 through 40
2343 2414 2485 2556 2627 2698 2769 2840
Columns 41 through 48
2911 2982 3053 3124 3195 3266 3337 3408
Columns 49 through 54
3479 3550 3621 3692 3735 3736
So there are two invisible characters in there before 155. They fell where carriage return characters will lie. That explains why it looks like the sequence was read by exactly 2 characters off.
So by deleting those elements first, then an index into the repaired string would work.

Iniciar sesión para comentar.

Categorías

Más información sobre Large Files and Big Data en Help Center y File Exchange.

Etiquetas

Productos


Versión

R2023b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by