How to read a UTF-8 encoded text file as a single character vector including white spaces and unicode special characters?

I am trying to read a UTF-8 encoded .txt file, "data.txt" containing sample info like this.
<title>
Fate/kaleid liner Prisma☆Illya (Fate/Kaleid Liner Prisma Illya) - MyAnimeList.net
</title>
If I try;
data = fileread('data.txt');
Sample read data:
<title>
Fate/kaleid liner Prisma☆Illya (Fate/Kaleid Liner Prisma Illya) - MyAnimeList.net
</title>
I lose the UTF8 encoded special characters. Here, '☆' is misread as '☆'.
If I try;
file = fopen('data.txt','r','n','UTF-8');
data = fscanf(file, '%s');
fclose(file);
Sample read data::
<title>Fate/kaleidlinerPrismaIllya(Fate/KaleidLinerPrismaIllya)-MyAnimeList.net</title>
I can retain the unicode characters but loses all the white space characters.
If I try;
file = fopen('data.txt','r','n','UTF-8');
data = textscan(file, '%s');
fclose(file);
Sample read data:
11×1 cell array
{'<title>' }
{'Fate/kaleid' }
{'liner' }
{'Prisma☆Illya' }
{'(Fate/Kaleid' }
......
It's a cell broken up by white spaces, even though it did read all the unicode correctly.
Can you give me possible way to overcome this issue?

 Respuesta aceptada

file = fopen('data.txt','r','n','UTF-8');
data = fread(file, [1 inf], '*char');
fclose(file)

3 comentarios

Thanks
This solved the issue of white spaces and most of those unicode characters well.
But now characters like é, à are read as �
Also my next step is to extract data from this txt file, make it to a table and store as .csv using writetable(). But I checked now and the issue of these unicode characters pops up again while saving by writetable().
Any suggestions?
You have a problem: 11577.txt and New.txt are ISO-8896-1 Latin1 encoded, but 14829.txt is UTF-8 encoded.
It is sometimes possible to tell the difference between the two encodings, but there is no provided routine for doing that.
If you were using R2020a or later, then fileread() would be enough: R2020a improved encoding detection and automatic use of encodings.
Ahh....... You just reminded me of a stupid mistake I made while acquiring those crude data. Thanks man. I corrected it and it works perfectly now.

Iniciar sesión para comentar.

Más respuestas (2)

Rik
Rik el 14 de Mayo de 2020
Editada: Rik el 14 de Mayo de 2020
I wrote the readfile function for this goal. It will result in a cell array, but you can concatenate them back to a long char array if you prefer.
data=cell2mat(readfile('data.txt'));
Note: this removes all newlines. You can replace them with spaces like this:
data=readfile('data.txt');
data(2,:)={' '};
data=data(:)';
data=cell2mat(data);

6 comentarios

Thanks.
The first code is working fine except the issue you mentioned yourself.
The second line of second code gives an error as the output of readfile() I am getting is a 1x1649 cell.
More importantly what I really hope for, is an in-built function to satisfy my requirements, because this code is supposed to be run on various machines and all of them might not have installed the add on.
What error are you getting? If you have a text file for which the reading fails, can you attach it?
You can include this function with any other code you need to run on other machines, so you don't have to worry about portability.
You can find two of the files I am working with here.
I have tested both files and I don't get any error. They are pretty long files, so the number of lines seems to match the array size you mention.
What is your issue with this code?
I am sorry. Its working now. Guess I made some mistake, the first time.
But I have encountered another issue with this. The character ° is not read correctly.
I have added a file 'New.txt' which contains this character in the link I shared before.
You can use this code on extracted character vector, to check if output is correct.
title = char(extractBetween(data, '<title>', ' - MyAnime'));
Required Output:
' Gintama°: Aizome Kaori-hen'
Observed Output:
' Gintama�: Aizome Kaori-hen'
Sorry man. As Rik pointed out here, that last error was due to my stupidity while acquiring those crude data. Both your answers work perfectly now. I am choosing his, just because I can have one extra function less there. But thanks a lot

Iniciar sesión para comentar.

As of MATLAB R2020a, fileread accomplishes the desired task.

1 comentario

This is not quite true, as it doesn't work on all Unicode characters:
fid=fopen('foo.txt','w','n','UTF-8');
fprintf(fid,'%s','😀');
fclose(fid);
fid=fopen('foo.txt','rb');fread(fid).',fclose(fid);%display raw bytes
ans = 1×4
240 159 152 128
fileread('foo.txt') % show fileread result
ans = ''

Iniciar sesión para comentar.

Productos

Versión

R2018b

Preguntada:

el 14 de Mayo de 2020

Editada:

Rik
el 19 de Feb. de 2021

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by