Processing Big Data Files

1 visualización (últimos 30 días)
Ugur Acar
Ugur Acar el 24 de Oct. de 2019
Respondida: Fangjun Jiang el 24 de Oct. de 2019
I have txt file of 120MB. It has around 3600000 rows in it. I need to read this data using script generated from import data menu.
But when i tried to run script it gives out of memory error. Is there any other way to read that big data ?
I have i7-7700HQ cpu @2.80Ghz and 8 gb of RAM, msi laptop computer.
%% Initialize variables.
filename = 'sicaklik.txt';
delimiter = '|';
startRow = 2;
formatSpec = '%s%s%s%s%s%s%s%[^\n\r]';
%% Open the text file.
fileID = fopen(filename,'r','n','UTF-8');
%% Skip the BOM (Byte Order Mark).
fseek(fileID, 3, 'bof');
%%Read columns of data according to the format.
dataArray = textscan(fileID, formatSpec, 'Delimiter', delimiter, 'TextType', 'string', 'HeaderLines' ,startRow-1, 'ReturnOnError', false, 'EndOfLine', '\r\n');
%% Close the text file.
fclose(fileID);
% Convert the contents of columns containing numeric text to numbers.
%% Replace non-numeric text with NaN.
raw = repmat({''},length(dataArray{1}),length(dataArray)-1);
%%
for col=1:length(dataArray)-1
raw(1:length(dataArray{col}),col) = mat2cell(dataArray{col}, ones(length(dataArray{col}), 1));
end
%%
numericData = NaN(size(dataArray{1},1),size(dataArray,2));
for col=[1,3,4,5,6,7]
% Converts text in the input cell array to numbers. Replaced non-numeric
% text with NaN.
rawData = dataArray{col};
for row=1:size(rawData, 1)
% Create a regular expression to detect and remove non-numeric prefixes and
% suffixes.
regexstr = '(?<prefix>.*?)(?<numbers>([-]*(\d+[\,]*)+[\.]{0,1}\d*[eEdD]{0,1}[-+]*\d*[i]{0,1})|([-]*(\d+[\,]*)*[\.]{1,1}\d+[eEdD]{0,1}[-+]*\d*[i]{0,1}))(?<suffix>.*)';
try
result = regexp(rawData(row), regexstr, 'names');
numbers = result.numbers;
% Detected commas in non-thousand locations.
invalidThousandsSeparator = false;
if numbers.contains(',')
thousandsRegExp = '^\d+?(\,\d{3})*\.{0,1}\d*$';
if isempty(regexp(numbers, thousandsRegExp, 'once'))
numbers = NaN;
invalidThousandsSeparator = true;
end
end
% Convert numeric text to numbers.
if ~invalidThousandsSeparator
numbers = textscan(char(strrep(numbers, ',', '')), '%f');
numericData(row, col) = numbers{1};
raw{row, col} = numbers{1};
end
catch
raw{row, col} = rawData{row};
end
end
end
%% Split data into numeric and string columns.
rawNumericColumns = raw(:, [1,3,4,5,6,7]);
rawStringColumns = string(raw(:, 2));
%% Make sure any text containing <undefined> is properly converted to an <undefined> categorical
idx = (rawStringColumns(:, 1) == "<undefined>");
rawStringColumns(idx, 1) = "";
%% Create output variable
all_cities = table;
all_cities.Istasyon_No = cell2mat(rawNumericColumns(:, 1));
all_cities.Istasyon_Adi = categorical(rawStringColumns(:, 1));
all_cities.YIL = cell2mat(rawNumericColumns(:, 2));
all_cities.AY = cell2mat(rawNumericColumns(:, 3));
all_cities.GUN = cell2mat(rawNumericColumns(:, 4));
all_cities.SAAT = cell2mat(rawNumericColumns(:, 5));
all_cities.SICAKLIK_C = cell2mat(rawNumericColumns(:, 6));
%Clear temporary variables
clearvars filename delimiter startRow formatSpec fileID dataArray ans raw col numericData rawData row regexstr result numbers invalidThousandsSeparator thousandsRegExp rawNumericColumns rawStringColumns idx;

Respuestas (1)

Fangjun Jiang
Fangjun Jiang el 24 de Oct. de 2019
Split the large file to smaller files and apply Tall Array

Etiquetas

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by