Convert an xml file into a MATLAB structure for easy access to the data.
Wouter Falkena (2021). xml2struct (https://www.mathworks.com/matlabcentral/fileexchange/28518-xml2struct), MATLAB Central File Exchange. Retrieved .
Inspired: Meteomatics Weather API Connector, Download elevations from Google Maps (API key required), xml2struct , with bug fix and added features, acampb311/xml2struct, maxsich/loadSPE, Microscopy Image Browser (MIB), Microscopy Image Browser 2 (MIB2), prowlpush: Prowl Notifications for MATLAB, Freehand Prostate Annotation, colladaParser
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!Create scripts with code, output, and formatted text in a single executable document.
Lifesaver!
It is really fast!
I have been using this great code for a long time, but see now that the following xml is not parsed correctly: http://www.steinheim-institut.de:80/cgi-bin/epidat?id=ad2-179-teip5
E.g. the TEI.facsimile contains two children <graphic> but only the first one gets imported.
TEI.teiHeader.profileDesc contains two childre <language> after a comment. The comment is imported but neither of the two <language> is.
In TEI.text.body.div.div none of the many <lb> children gets imported.
Thanks. Very handy.
To read from URLs, add this between lines 38 and 39.
elseif contains(fiile,'http')
xDoc = xmlread(file);
Perfect!
Nice!
Nice work!
What's up guys! I've noticed we already have some solutions for the "out of memory" bug above stated. Well, I've managed to walk around this issue by implementing analogous code in python. This may help someone. Code follow below:
### Import libs
import xml.etree.ElementTree as ET
import numpy as np
import scipy.io as spy
### xmltools class definition
class xmltools():
'''
2020.04
@author: Laio Marinheiro
Organize xml data structure as a python dict (which is annalogous of a matlab
struct). This is inspired in xml2struct.m in matlab.
This also uses
'''
def LoadXML(self, xml_filename):
xml_tree = ET.parse(xml_filename)
xml_root = xml_tree.getroot()
return xml_root
def ChildProp(self,parent):
child_name = [] # name of each child
unique_child_names = [] # unique name of children
child_len = {} # length of each child (number of subchildren of each child)
child_idx = {} # index of the child on parent
k = 0
for child in parent:
child_name.append(child.tag)
child_len[child.tag] = len(child) # irrelevant =/
child_idx[child.tag] = k
if child.tag not in unique_child_names:
unique_child_names.append(child.tag)
k = k + 1
for unique_name in unique_child_names:
k = 1
for name in child_name:
if name == unique_name:
child_len[unique_name] = k
k = k + 1
ChildProp = {}
ChildProp['names'] = child_name
ChildProp['unique_names'] = unique_child_names
ChildProp['len'] = child_len
ChildProp['index'] = child_idx
return ChildProp
def GetDict(self,parent):
Dict = {}
if len(parent) == 0: # parent is the last field
if parent.text is not None:
Dict['Text'] = parent.text
else:
Dict['Text'] = '' # empty string
else: # if parent is not the last, it has children
childprop = toolbox.ChildProp(parent)
if len(parent) == len(childprop['index']): # children are struct fields
for child in parent:
Dict[child.tag] = self.GetDict(child)
else: # children are elements of list (or lists)
unique_names = list(childprop['index'].keys())
for list_name in unique_names: # for each list
Dict[list_name] = np.zeros((childprop['len'][list_name],), dtype = np.object)
k = 0
for child in parent:
if child.tag == list_name:
Dict[list_name][k] = self.GetDict(child)
k = k + 1
if parent.attrib != {}:
Dict['Attributes'] = parent.attrib
return Dict
def Dict2Mat(self,xml_filename,dictionary):
spy.savemat(xml_filename[0:-4]+'.mat',mdict={'pythonxml': dictionary})
def xml2struct(self,xml_filename):
xml_root = self.LoadXML(xml_filename)
xml_dict = {xml_root.tag:{}}
xml_dict[xml_root.tag] = self.GetDict(xml_root)
self.Dict2Mat(xml_filename,xml_dict)
### Example
toolbox = xmltools()
xml_filename_str = 'your_filename.xml'
toolbox.xml2struct(xml_filename_str )
Thanks!
Better than mathworks functions!
It works, and is easy to use!
Worked straight out of the box.
thank you!!!
great work! Thank you
where can I find output file of xml2struct
powerful and accurate
I wonder if there is any python equivalent of this script?
well done!
Thanks a lot!
Great stuff! Thanks a lot!
This function works perfectly! Thank you very very much!
Great work! Thanks
Works perfect for me, this sould be part of MATLAB
should be part of MATLAB
Works beautifully and is robust.
Nice work and Thanks!
Super. Great work.
Nice! Exactly what I was looking for!
Nice function. Thank you!
Nice submission!
My advice would be to remove line 137, 138 and 139 with:
name = matlab.lang.makeValidName(name);
This matlab-function will make sure all not valid Matlab names are replaced with an underscore.
Nice, should be included in MATLAB by default.
Seems to be working well - just wondering how to save the structure back to an XML again? I couldn't find this mentioned in the comments.
very useful for modifying xml files
really useful script, but rather slow on large xmls, the xmlread only used 1/10th of the overall time.
My improvement idea:
change
if (~isempty(regexprep(text.(textflag),'[\s]*','')))
to
if ~all(isspace(text.(textflag)))
and get a overall speedup of factor 2 (in my test case at least)
Thank you so so much!
thanks a lot, great work
thank you
I had to make a few modifications to get my XML file to work. I will put them below, but as this is my first time using this file type, mileage may vary.
Line 95 in version current as of 3/7/2017
children.(name) = text;
That overwrote all of the child nodes that had data stored in them, given that the last node to be parsed was a comment (i.e. it only contained a string). Other nodes contained numerical values held as a string value. Here was my fix:
if isfield(text,'Text')
children.(name) = str2num(text.Text);
else
children.(name).('Comment') = text.Comment;
end
Overall very helpful, and was exactly what I needed after I put the replacement lines in.
Thanks!
That's Great!!! Thank you so much!!!
This is a very usefull script! I used it on reading AUTOSAR XML - Files! On reading AR-XML files I found two challanges:
1. The long replacemement texts for {'-'|':'|'.'} within xml-tags leads to the problem, that matlab fieldnames will become longer than 63 chars! I reduced them to {'_'|'c'|'d'}! That helped!
2. In case of XML Comments <!-- Comment --> to be used inside the XML-file the script Fails! I fixed this issue! Have a look at it!
Replace: Inside the function: parseChildNodes(...)
% CDz 2016-12-21 Commented Out Due to problems with
% XML-Comments
% if(~isempty(fieldnames(text)))
% children.(name){index} = text;
% end
% CDz 2016-12-21 Added to Handle XML - Comments
if(~isempty(text) && isstruct(text))
if find(strcmp(fieldnames(text),'Text'))
children.(name){index}.('Text') = text.Text;
elseif find(strcmp(fieldnames(text),'Comment'))
children.(name){index}.('Comment') = text.Comment;
end
end
and
% CDz 2016-12-21 Commented out due to problems with
% XML-Comments
% if(~isempty(text) && ~isempty(fieldnames(text)))
% children.(name) = text;
% end
% CDz 2016-12-21 Added to Handle XML - Comments
if(~isempty(text) && isstruct(text))
if find(strcmp(fieldnames(text),'Text'))
children.(name).('Text') = text.Text;
elseif find(strcmp(fieldnames(text),'Comment'))
children.(name).('Comment') = text.Comment;
end
end
@ Wouter Falkena: If you are interestd I can provide you a full copy of this file that you can update this script
Well done
Simple and very useful. Very convenient.
I am a Phd student I need to apply this function on my code to get attribute of XML file
Works Great! Tried xml_toolbox but it's broken since 2014. This is a solid replacement.
please I need to edit a xml file
I have looked through the issues, implemented fixes, added some new feature to the script, and uploaded it here : https://www.mathworks.com/matlabcentral/fileexchange/58700-xml2struct . Please also try my updated version and let me know if it works better now.
Neil's fix doesn't do the trick for me...
Doesn't work right out of the box. Stéphane's fix works great!
Also same issue as Sebastien regarding comments and headers.
I received a "java.lang.OutOfMemoryError: GC overhead limit exceeded" when trying to open a Kanji dictionary file - http://www.edrdg.org/kanjidic/kanjidic2.xml.gz
Good, but rather slow for large xml files.
Daniel
Can you update the code? I am having the same problem.
Thanks for this very flexible script! TOP!
this is exactly what I was looking for.
Thanks Wouter,
It's not exactly but appreciate your response. The probem is solved now. I may upload the code if someone have the same problem.
Hi,
Found a bug : When there is text and children in the same node, the text overwrites the children.
Fix:
Replace
if(~isempty(fieldnames(text)))
children.(name){index} = text;
end
by:
if isstruct(text)
for fld=fieldnames(text)'
children.(name){index}.(fld{1}) = text.(fld{1});
end
end
And replace also:
if(~isempty(text) && ~isempty(fieldnames(text)))
children.(name) = text;
end
by:
if isstruct(text)
for fld=fieldnames(text)'
children.(name).(fld{1}) = text.(fld{1});
end
end
Thanks.
Simple to use and works great !! Thanks for sharing the work.
it takes a long time to run for larger XMLs. Is there anywhere in the code I can a waitbar to at least report progress to user? I tried the two for loops but that does not seem to be the bottleneck.
Very simple to use and it works.
"Andrew Wilson: The fix from Neill Weiss in an earlier comment/review seems to solve this, so it would be great to see that incorporated into an update!" thanks Andrew Wilson
Works great for the most part, but the issue of nodes being lost when comments are present at the same level of the hierarchy is quite frustrating. The fix from Neill Weiss in an earlier comment/review seems to solve this, so it would be great to see that incorporated into an update!
Seems to work fine except as reported by Sebastien Roy on 09/10/14 - xml comments don't work (resulting in a loss of the other data)
Downloaded this file this evening to process some XML data. worked just fine.
Sorry, pasted the wrong line.
Here is line 154 that fixes the problem for me:
text.(textflag) = char(getTextContent(theNode));
Great stuff.
Regarding that "Undefined function 'toCharArray' for input arguments of type 'double'." Error:
For me it worked to change line 154 into
text.(textflag) = char(getData(theNode))';
as it has been in an earlier version of xml2struct (mentioned in the comments in the code in line 153)
Great time saver when compared to using xmlread directly. However, there is a bug with child nodes when a text is present. The child node content will be set to the text and all other content of the child will be lost. A comment, being processed as text, will cause the same issue. Attempting to read this xml will not provide the expected result:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<!-- Should be a benign comment -->
<mystuff>Valuable data</mystuff>
</root>
Some of the attributes in the XML file had underscores at the beginning which error because of disallowed field name. Simple strrep solved the problem.
Great!
Excellent
Stop using XML and use json.org/java [1] static XML.toJSONObject() method [2], there's a precompiled jar file in my dropbox [3] or use Newton King's JSON.NET [4] which is already precompiled by him and available from codeplex [5] just download and unzip then use the version for the .NET framework on your machine. Converting between XML and JSON is described in the documentation [6] and in this SO post [7]. See MATLAB documentation for more information on using Java [8] or .NET [9] in MATLAB. It's super easy!
[1] http://json.org/java/
[2](http://json.org/javadoc/org/json/XML.html#toJSONObject(java.lang.String))
[3] https://dl.dropboxusercontent.com/u/19049582/JSON.jar
[4] http://james.newtonking.com/pages/json-net.aspx
[5] https://json.codeplex.com/
[6] http://james.newtonking.com/projects/json/help/index.html?topic=html/ConvertingJSONandXML.htm
[7] http://stackoverflow.com/a/814027/1020470
[8] http://www.mathworks.com/help/matlab/using-java-libraries-in-matlab.html
[9] http://www.mathworks.com/help/matlab/using-net-libraries-in-matlab.html
I've seen some other users report this issue but could not find how to fix this:
Undefined function 'toCharArray' for input arguments of type 'double'.
Any idea?
Regards
Works well.
Didn't fully test for empty field cases like some commenters but I got a nice structure out of my input file.
I am disappointed that a similar functionality isn't built in Matlab. xmlread and xmlwrite alone are such a pain to access and/or update xml data.
Hi,
Thanks for the file, it works great.
But I have also the same problem as Erik with empty data fields. Someone know how to fix this?
Faster than xml_read, recommended!
Thanks for the file, however I'm having an issue with empty data fields.
If I have a 100x50 XML data set which I can easily import into Excel. However there are a few fields which are empty. For example at (5,35:40), the XML data is empty.
When I use the xml2struct and then try and create a cell array in the same format (100x50) the data in row 5 between 40:50, shifts to the 35:45 position and I'm left with 5 empty spaces from 45:50 and as such the data is misaligned.
Any idea on how to deal with empty fields in order to maintain their position in the original file?
Thanks!
i was just wondering if someone could just confirm what i am doing is correct. when i want to convert xml into a matlab array, i type:
data=xml2struct('name of the file i want to convert'); ? is that all?
We are encountering the same issue reported by Raoul Herzog: Undefined function or method 'toCharArray' for input arguments of type 'double'. Is there a fix for this?
For the comment bug, @Sirius3, I changed the following code block from:
if (~strcmp(name,'#text') && ~strcmp(name,'#comment') && ~strcmp(name,'#cdata_dash_section'))
%XML allows the same elements to be defined multiple times,
%put each in a different cell
if (isfield(children,name))
if (~iscell(children.(name)))
%put existsing element into cell format
children.(name) = {children.(name)};
end
index = length(children.(name))+1;
%add new element
children.(name){index} = childs;
if(~isempty(fieldnames(text)))
children.(name){index} = text;
end
if(~isempty(attr))
children.(name){index}.('Attributes') = attr;
end
else
%add previously unknown (new) element to the structure
children.(name) = childs;
if(~isempty(text) && ~isempty(fieldnames(text)))
children.(name) = text;
end
if(~isempty(attr))
children.(name).('Attributes') = attr;
end
end
else
to
if (~strcmp(name,'#text') && ~strcmp(name,'#comment') && ~strcmp(name,'#cdata_dash_section'))
%XML allows the same elements to be defined multiple times,
%put each in a different cell
if (isfield(children,name))
if (~iscell(children.(name)))
%put existsing element into cell format
children.(name) = {children.(name)};
end
index = length(children.(name))+1;
%add new element
children.(name){index} = childs;
textFieldNames = fieldnames(text);
for t = 1:length(textFieldNames)
textFieldName = textFieldNames{t};
children.(name){index}.(textFieldName) = text.(textFieldName);
end
if(~isempty(attr))
children.(name){index}.('Attributes') = attr;
end
else
%add previously unknown (new) element to the structure
children.(name) = childs;
if(~isempty(text) && ~isempty(fieldnames(text)))
textFieldNames = fieldnames(text);
numTextFieldNames = length( textFieldNames );
for i = 1:numTextFieldNames
thisFieldName = textFieldNames{i};
children.(name).(thisFieldName) = text.(thisFieldName);
end
end
if(~isempty(attr))
children.(name).('Attributes') = attr;
end
end
else
Now, the children.(name) properties are not blown away when a comment is parsed.
bug: child nodes get lost, when there are comments between them. (line 95)
First of all thank for the excellent code.
I have a "small" problem according to the cell. In you code, if there are more MORE THAN ONE child than you create a cell, otherwise not. What should I change to have the case: Even if the node has ONLY ONE child than I create a cell (with one element)
Worked very well for me. Thank you so much.
There seems to be a bug in xml2struct :
I can provide you the corresponding xml file if needed.
??? Undefined function or method 'toCharArray' for input arguments of type 'double'.
Error in ==> xml2struct>parseAttributes at 174
str = toCharArray(toString(item(theAttributes,count-1)))';
Error in ==> xml2struct>getNodeData at 141
attr = parseAttributes(theNode);
Error in ==> xml2struct>parseChildNodes at 72
[text,name,attr,childs,textflag] = getNodeData(theChild);
Error in ==> xml2struct>getNodeData at 147
[childs,text,textflag] = parseChildNodes(theNode);
Error in ==> xml2struct>parseChildNodes at 72
[text,name,attr,childs,textflag] = getNodeData(theChild);
Error in ==> xml2struct>getNodeData at 147
[childs,text,textflag] = parseChildNodes(theNode);
Error in ==> xml2struct>parseChildNodes at 72
[text,name,attr,childs,textflag] = getNodeData(theChild);
Error in ==> xml2struct>getNodeData at 147
[childs,text,textflag] = parseChildNodes(theNode);
Error in ==> xml2struct>parseChildNodes at 72
[text,name,attr,childs,textflag] = getNodeData(theChild);
Error in ==> xml2struct>getNodeData at 147
[childs,text,textflag] = parseChildNodes(theNode);
Error in ==> xml2struct>parseChildNodes at 72
[text,name,attr,childs,textflag] = getNodeData(theChild);
Error in ==> xml2struct>getNodeData at 147
[childs,text,textflag] = parseChildNodes(theNode);
Error in ==> xml2struct>parseChildNodes at 72
[text,name,attr,childs,textflag] = getNodeData(theChild);
Error in ==> xml2struct>getNodeData at 147
[childs,text,textflag] = parseChildNodes(theNode);
Error in ==> xml2struct>parseChildNodes at 72
[text,name,attr,childs,textflag] = getNodeData(theChild);
Error in ==> xml2struct>getNodeData at 147
[childs,text,textflag] = parseChildNodes(theNode);
Error in ==> xml2struct>parseChildNodes at 72
[text,name,attr,childs,textflag] = getNodeData(theChild);
Error in ==> xml2struct>getNodeData at 147
[childs,text,textflag] = parseChildNodes(theNode);
Error in ==> xml2struct>parseChildNodes at 72
[text,name,attr,childs,textflag] = getNodeData(theChild);
Error in ==> xml2struct>getNodeData at 147
[childs,text,textflag] = parseChildNodes(theNode);
Error in ==> xml2struct>parseChildNodes at 72
[text,name,attr,childs,textflag] = getNodeData(theChild);
Error in ==> xml2struct>getNodeData at 147
[childs,text,textflag] = parseChildNodes(theNode);
Error in ==> xml2struct>parseChildNodes at 72
[text,name,attr,childs,textflag] = getNodeData(theChild);
Error in ==> xml2struct at 57
s = parseChildNodes(xDoc);
One of the problems that I personally encountered is that xml2struct can't handle CDATA blocks.
It can be easily fixed, replace line 67 with:
if (~strcmp(name,'#text') && ~strcmp(name,'#comment') && ~strcmp(name,'#cdata_dash_section'))
and line 94 with:
elseif (strcmp(name,'#text') || strcmp(name, '#cdata_dash_section'))
Works great otherwise, thanks.
Excellent! I was pulling my hair to read to numbers from XML file and with this I did it in one minute
Works great for small files. I tested it for some larger files with >100000 entries and this takes around 178 seconds.
Thank you for this suggestion Mr. Wanner. I have updated the file and it is currently under review by the MATLAB Central. It will appear here shortly.
Thanks for your work.
You might want to speed up the attribute parsing by about 40% by replacing lines 152-154 by the following:
str=theAttributes.item(count-1).toString.toCharArray()';
k=strfind(str,'=');
attr_name = regexprep(str(1:(k(1)-1)),'[-:.]','_');
attributes.(attr_name) =str((k(1)+2):(end-1));
Thanks, your auto field naming system worked great for me to work with data parsed out from XML files.
Thanks a lot! I finally came across a tool that can extract info from a ISO19115/19139 xml file.
Simple and works pretty well! The structures are a bit verbose but they're supposed to be parsed by my program anyway; any attempts to collapse some of the nested structures would only slow down the code (some similar submissions do this but are much slower). Thanks!
Thanks v. much! I used it to read a Collada file (geometry file Google Sketch-up). Worked like a charm!
You are correct. I have removed the '.xml' extension assumption, unless the file can not be found. The update file is currently under review by MATLAB Central and should appear here soon.
Warning: all XML files haven't '.xml' extension
Worked on the first try for loading an OSM data file.
I was tearing my hair out trying to figure out how to automatically access one tiny piece of data in a .xml file until I found this routine.