Issues with Hadoop Configuration

Question

Nash Gould el 25 de Sept. de 2019

0
Enlazar

Enlace directo a esta pregunta

https://la.mathworks.com/matlabcentral/answers/482174-issues-with-hadoop-configuration

Comentada: Jason Ross el 2 de Oct. de 2019

I am trying to integrate a small Hadoop cluster with 4 nodes into MatLab and in order to test if I am communicating with it properly I have built a short MapReduce script:

%Creating the Hadoop Cluster
setenv('HADOOP_HOME', 'C:\hadoop');
setenv('SPARK_HOME', 'C:\spark');
cluster = parallel.cluster.Hadoop('HadoopInstallFolder', 'C:\hadoop\',...
    'SparkInstallFolder', 'C:\\spark\');
cluster.HadoopProperties('mapred.job.tracker')='nodemaster:50031';
cluster.HadoopProperties('fs.default.name')='hdfs://nodemaster:9000';
%Creating a datastore with the library_data csv as the file
ds = datastore('hdfs:///library_input/thing.csv', 'SelectedVariableNames',...
    'Subjects', 'ReadSize', 1000);
outputDS = 'hdfs:///library_output/MatLab/';
mr = mapreducer(cluster);
outds = mapreduce(ds, @mapper, @reducer, mr, 'OutputFolder', outputDS);
function mapper(data, ~, intermKVStore)
   %Converting the data to a string array
   data = string(data.Subjects);
   
   %Storing each subject as a key
   parfor member = data
       for thing = split(member, ', ')
           add(intermKVStore, thing, 1);
       end
   end 
end
function reducer(intermKey, intermValIter, outKVStore)
    %Initializing the number of books with the chosen subject as 0
    numBooks = 0;
    
    % Adding the number of books up
    while hasnext(intermValIter)
        numBooks = numBooks + getnext(intermValIter);
    end
    
    add(outKVStore, intermKey, numBooks);
end

When I go to run this script though, it gives me the following error:

Error using mapreduce (line 124)
The HADOOP job failed to submit. It is possible that there is some issue with the HADOOP configuration.
Error in HadoopFunc (line 16)
outds = mapreduce(ds, @mapper, @reducer, mr, 'OutputFolder', outputDS);

I am probably configuring the Hadoop cluster wrong for MatLab, but would anybody be able to point me in the direction of a setting I should change in order to be able to run it?

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Answer 1

Jason Ross el 26 de Sept. de 2019

0
Enlazar

Enlace directo a esta respuesta

https://la.mathworks.com/matlabcentral/answers/482174-issues-with-hadoop-configuration#answer_393663

The instructions for R2019b are here. Note that there have been some changes in setting up and configuring the integration over time (generally -- you need to set a couple configuration options for Hadoop and make sure certain properties aren't "final"), so if you aren't on R2019b I would suggest looking at the documentation for the release you are running. Previous releases are archived here, and you should look in the "MATLAB Distributed Computing Server" topic, under "Configure a Hadoop Cluster" for the relevant steps.

It's also worth verifying that you can run the examples that Hadoop ships with from the CLI. There are instructions here for running a trivial "word count" example. It's also good to verify that the basic hadoop command set works as anticipated, commands like "hadoop dfs -ls" and mkdir/put/get work (and have the correct user IDs, permissions).

If the job is getting to submission there might also be interesting output in the job logs that could shed some more light.

3 comentarios
Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo

Jason Ross el 27 de Sept. de 2019

Editada: Jason Ross el 27 de Sept. de 2019

I recall a similar issue setting up Hadoop 3. IIRC adding the following to yarn-site.xml fixed it.

<name>yarn.nodemanager.pmem-check-enabled</name>

<value>false</value>

</property>

<name>yarn.nodemanager.vmem-check-enabled</name>

<value>false</value>

</property>

This is evidently tied to https://issues.apache.org/jira/browse/YARN-4714

Nash Gould el 27 de Sept. de 2019

Even with these values set in yarn-site.xml, I am still getting the same error.

Iniciar sesión para comentar.

Answer 2

Jason Ross el 30 de Sept. de 2019

0
Enlazar

Enlace directo a esta respuesta

https://la.mathworks.com/matlabcentral/answers/482174-issues-with-hadoop-configuration#answer_394136

Editada: Jason Ross el 30 de Sept. de 2019

That's too bad ... looking through my notes for my setup here:

Did you distribute yarn-site.xml to all the nodes and restart yarn?
core-site.xml: I have port 54310 specifed in here. In my example (similar to yours) I use this port number, e.g.

ds = datastore('hdfs://myservername:54310/datasets/sampledataset/*.csv', 'TreatAsMissing', 'NA');

hadoop-env.sh: I only have JAVA_HOME set
hdfs-site.xml: there's nothing special in here: where hdfs matadata is stored, data disrectories, secondary name node
mapred-env.sh: nothing set
mapred-site.xml: set mapreduce.map.env and mapreduce.reduce.env to /local/hadoop. mapreduce.application.classpath is set to /local/hadoop/share/hadoop/mapreduce/*,/local/hadoop/share/hadoop/mapreduce/lib/*. Other stuff sets job history and sets yarn as the mr framework.
yarn-env.sh: I set a heapsize. That's it.
I set the two properties described above, set up yarn to handle Spark, set yarn.scheduler.maximum-allocation-mb to ${yarn.nodemanager.resource.memory-mb}, and set up log retention.

3 comentarios
Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo

Jason Ross el 2 de Oct. de 2019

I have only set up Hadoop on Linux. For Windows there should be hdfs.dll and hdfs.cmd in the hadoop install/bin directory that should allow for the submission. The property settings should be the same/similar.

For memory the requirements are something like 2GB/worker minimum. So if you are running three workers it should be OK, at least for a trivial example.

Hadoop uses the "third party scheduler" model, so when you submit a job you set the ClusterMatlabRoot property to the MATLAB you want to use on the cluster. So you can have as many installations as you can hold on your cluster, you just specify this as you submit. Using native Hadoop allows you to run Mapreduce jobs and Spark jobs, and you can accept other types of submission other than MATLAB jobs, which can be an advantage if people are using other types of code to run jobs on the cluster.

With MJS/Parallel Server, you can also use previous MATLAB releases by setting the alternate MATLAB roots in the matlabroot/toolbox/parallel/bin/mjs.bat file. This will automagically select the correct MATLAB when jobs are submitted from other releases. You just need to run the latest version for the actual job manager processes, the workers will swap out as need be.

Jason Ross el 2 de Oct. de 2019

Looking at your files, I see you are specifying 9000 as the port in core-site.xml. This is not the same port as I'm using. You likely need to change your datastore line to use that port, e.g.

ds = datastore('hdfs://nodemaster:9000/library_input/thing.csv', 'SelectedVariableNames', 'Subjects', 'ReadSize', 1000);

assuming that /library_input/thing.csv is the correct full path to that file -- you should be able to browse the HDFS from the web interface to verify.

Iniciar sesión para comentar.

Issues with Hadoop Configuration

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuestas (2)

3 comentarios
Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo

3 comentarios
Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo

Ver también

Categorías

Etiquetas

Productos

Community Treasure Hunt

Issues with Hadoop Configuration

0 comentarios Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuestas (2)

3 comentarios Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo

3 comentarios Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo

Ver también

Categorías

Etiquetas

Productos

Community Treasure Hunt

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

3 comentarios
Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo

3 comentarios
Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo