How to read a UTF-8 encoded text file as a single character vector including white spaces and unicode special characters?
Mostrar comentarios más antiguos
I am trying to read a UTF-8 encoded .txt file, "data.txt" containing sample info like this.
<title>
Fate/kaleid liner Prisma☆Illya (Fate/Kaleid Liner Prisma Illya) - MyAnimeList.net
</title>
If I try;
data = fileread('data.txt');
Sample read data:
<title>
Fate/kaleid liner Prisma☆Illya (Fate/Kaleid Liner Prisma Illya) - MyAnimeList.net
</title>
I lose the UTF8 encoded special characters. Here, '☆' is misread as '☆'.
If I try;
file = fopen('data.txt','r','n','UTF-8');
data = fscanf(file, '%s');
fclose(file);
Sample read data::
<title>Fate/kaleidlinerPrisma☆Illya(Fate/KaleidLinerPrismaIllya)-MyAnimeList.net</title>
I can retain the unicode characters but loses all the white space characters.
If I try;
file = fopen('data.txt','r','n','UTF-8');
data = textscan(file, '%s');
fclose(file);
Sample read data:
11×1 cell array
{'<title>' }
{'Fate/kaleid' }
{'liner' }
{'Prisma☆Illya' }
{'(Fate/Kaleid' }
......
It's a cell broken up by white spaces, even though it did read all the unicode correctly.
Can you give me possible way to overcome this issue?
Respuesta aceptada
Más respuestas (2)
I wrote the readfile function for this goal. It will result in a cell array, but you can concatenate them back to a long char array if you prefer.
data=cell2mat(readfile('data.txt'));
Note: this removes all newlines. You can replace them with spaces like this:
data=readfile('data.txt');
data(2,:)={' '};
data=data(:)';
data=cell2mat(data);
6 comentarios
Deepu George Kurian
el 14 de Mayo de 2020
Rik
el 14 de Mayo de 2020
What error are you getting? If you have a text file for which the reading fails, can you attach it?
You can include this function with any other code you need to run on other machines, so you don't have to worry about portability.
Deepu George Kurian
el 14 de Mayo de 2020
Rik
el 14 de Mayo de 2020
I have tested both files and I don't get any error. They are pretty long files, so the number of lines seems to match the array size you mention.
What is your issue with this code?
Deepu George Kurian
el 14 de Mayo de 2020
Deepu George Kurian
el 15 de Mayo de 2020
MathWorks Support Team
el 19 de Feb. de 2021
0 votos
As of MATLAB R2020a, fileread accomplishes the desired task.
1 comentario
This is not quite true, as it doesn't work on all Unicode characters:
fid=fopen('foo.txt','w','n','UTF-8');
fprintf(fid,'%s','😀');
fclose(fid);
fid=fopen('foo.txt','rb');fread(fid).',fclose(fid);%display raw bytes
fileread('foo.txt') % show fileread result
Categorías
Más información sobre Text Files en Centro de ayuda y File Exchange.
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!