dsaravanan
/
reference


			
				
					
						
						
							123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260
							<!DOCTYPE html><html lang="en"><head><meta charset="utf-8"><title>How to Install and Set Up a 3-Node Hadoop Cluster | Linode</title><meta name="viewport" content="width=device-width,initial-scale=1"><script type="31e74d6f24ce565d864e7b01-text/javascript">(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':
        new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],
        j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src=
        'https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);
        })(window,document,'script','dataLayer','GTM-T5FXXG9');</script><meta name="description" content="This Linode guide will show you how to install and set up a 3-node Hadoop cluster."><meta name="keywords" content="Hadoop,  YARN,  HDFS"><meta property="og:title" content="How to Install and Set Up a 3-Node Hadoop Cluster"><meta property="og:type" content="article"><meta property="og:url" content="https://www.linode.com/docs/databases/hadoop/how-to-install-and-set-up-hadoop-cluster/"><meta property="og:description" content="This Linode guide will show you how to install and set up a 3-node Hadoop cluster."><meta property="og:site_name" content="Linode Guides &amp; Tutorials"><meta property="og:image" content="https://www.linode.com/docs/media/images/default_social_image.png"><meta name="twitter:card" content="summary"><meta name="twitter:image" content="https://www.linode.com/docs/media/images/default_social_image_small.png"><meta name="twitter:site" content="@linode"><link rel="alternate" type="application/rss&#43;xml" href="https://www.linode.com/docs/index.xml"><link href="/docs/build/stylesheets/home-c3f32780be.min.css" rel="stylesheet" type="text/css"><script type="31e74d6f24ce565d864e7b01-text/javascript" src="//cdnjs.cloudflare.com/ajax/libs/jquery/3.2.1/jquery.min.js"></script><script src="https://cdn.jsdelivr.net/algoliasearch/3/algoliasearch.min.js" type="31e74d6f24ce565d864e7b01-text/javascript"></script><script async="async" src="//consent.trustarc.com/notice?domain=linode.com&c=teconsent&js=nj&noticeType=bb&text=true&gtm=1" crossorigin type="31e74d6f24ce565d864e7b01-text/javascript"></script><script type="31e74d6f24ce565d864e7b01-text/javascript" src="https://cdn.weglot.com/weglot.min.js"></script><script type="31e74d6f24ce565d864e7b01-text/javascript">Weglot.initialize({
                api_key: 'wg_3b3ef29c81aa81292c64d1368ee318969',
                switchers: [
                        {
                        styleOpt:{
                            fullname: true,
                            withname: true,
                            is_dropdown: true,
                            with_flags: true,
                            invert_flags: true,
                        },
                        target: "div.c-weglot-nav",
                        sibling: null
                    }
                ]
            });
</script><section class="primary first-section"><div class="container"><div class="row breadcrumb-row with-subnavigation"><div class="col-sm-12"></div></div><div class="row" itemscope itemtype="http://schema.org/TechArticle"><div class="col-sm-12"><div class="row"><div class="col-sm-9 col-sm-offset-3"></div></div><div class="row"><div id="article-body" class="col-sm-9 col-sm-push-3 doc"><h1 class="doc-title" itemprop="headline">How to Install and Set Up a 3-Node Hadoop Cluster</h1><p markdown="0" class="doc-time doc-modified-time"><small class="updated">Updated <time itemprop="dateModified" datetime="2019-07-22T09:55:05-04:00">Monday, July 22, 2019</time> by Linode</small> <small class="contributed-by">Contributed by Florent Houbart</small></p><div class="row"><div class="col-sm-10"><div markdown="0" class="signup-top"><div class="row"><div class="col-lg-9 col-md-9 col-sm-8 col-xs-12"><span><h2 id="what-is-hadoop">What is Hadoop?</h2><p>Hadoop is an open-source Apache project that allows creation of parallel processing applications on large data sets, distributed across networked nodes. It is composed of the <strong>Hadoop Distributed File System (HDFS™)</strong> that handles scalability and redundancy of data across nodes, and <strong>Hadoop YARN</strong>, a framework for job scheduling that executes data processing tasks on all nodes.</p><p><img src="hadoop-1-logo.png" alt="How to Install and Set Up a 3-Node Hadoop Cluster" title="How to Install and Set Up a 3-Node Hadoop Cluster"></p><h2 id="before-you-begin">Before You Begin</h2><ol><li><p>Follow the <a href="/docs/getting-started/">Getting Started</a> guide to create three (3) Linodes. They&rsquo;ll be referred to throughout this guide as <strong>node-master</strong>, <strong>node1</strong>, and <strong>node2</strong>. It is recommended that you set the hostname of each Linode to match this naming convention.</p><p>Run the steps in this guide from the <strong>node-master</strong> unless otherwise specified.</p></li><li><p><a href="/docs/platform/manager/remote-access/#adding-private-ip-addresses">Add a Private IP Address</a> to each Linode so that your Cluster can communicate with an additional layer of security.</p></li><li><p>Follow the <a href="/docs/security/securing-your-server/">Securing Your Server</a> guide to harden each of the three servers. Create a normal user for the Hadoop installation, and a user called <code>hadoop</code> for the Hadoop daemons. Do <strong>not</strong> create SSH keys for <code>hadoop</code> users. SSH keys will be addressed in a later section.</p></li><li><p>Install the JDK using the appropriate guide for your distribution, <a href="/docs/development/java/install-java-on-debian/">Debian</a>, <a href="/docs/development/java/install-java-on-centos/">CentOS</a> or <a href="/docs/development/java/install-java-on-ubuntu-16-04/">Ubuntu</a>, or install the latest JDK from Oracle.</p></li><li><p>The steps below use example IPs for each node. Adjust each example according to your configuration:</p><ul><li><strong>node-master</strong>: 192.0.2.1</li><li><strong>node1</strong>: 192.0.2.2</li><li><strong>node2</strong>: 192.0.2.3</li></ul><blockquote class="note"><strong class="callout-title">Note</strong><div>This guide is written for a non-root user. Commands that require elevated privileges are prefixed with <code>sudo</code>. If you’re not familiar with the <code>sudo</code> command, see the <a href="/docs/tools-reference/linux-users-and-groups">Users and Groups</a> guide. All commands in this guide are run with the <em>hadoop</em> user if not specified otherwise.</div></blockquote></li></ol><h2 id="architecture-of-a-hadoop-cluster">Architecture of a Hadoop Cluster</h2><p>Before configuring the master and worker nodes, it&rsquo;s important to understand the different components of a Hadoop cluster.</p><p>A <strong>master node</strong> maintains knowledge about the distributed file system, like the <code>inode</code> table on an <code>ext3</code> filesystem, and schedules resources allocation. <strong>node-master</strong> will handle this role in this guide, and host two daemons:</p><ul><li>The <strong>NameNode</strong> manages the distributed file system and knows where stored data blocks inside the cluster are.</li><li>The <strong>ResourceManager</strong> manages the YARN jobs and takes care of scheduling and executing processes on worker nodes.</li></ul><p><strong>Worker nodes</strong> store the actual data and provide processing power to run the jobs. They&rsquo;ll be <strong>node1</strong> and <strong>node2</strong>, and will host two daemons:</p><ul><li>The <strong>DataNode</strong> manages the physical data stored on the node; it&rsquo;s named, <code>NameNode</code>.</li><li>The <strong>NodeManager</strong> manages execution of tasks on the node.</li></ul><h2 id="configure-the-system">Configure the System</h2><h3 id="create-host-file-on-each-node">Create Host File on Each Node</h3><p>For each node to communicate with each other by name, edit the <code>/etc/hosts</code> file to add the private IP addresses of the three servers. Don&rsquo;t forget to replace the sample IP with your IP:</p><dl class="file"><dt>/etc/hosts</dt><dd><div class="highlight"><div class="chroma"><table class="lntable"><tr><td class="lntd"><pre class="chroma"><code class="language-bash" data-lang="bash"><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span></code></pre></td><td class="lntd"><pre class="chroma"><code class="language-bash" data-lang="bash"><span class="m">192</span>.0.2.1    node-master
<span class="m">192</span>.0.2.2    node1
<span class="m">192</span>.0.2.3    node2</code></pre></td></tr></table></div></div></dd></dl><h3 id="distribute-authentication-key-pairs-for-the-hadoop-user">Distribute Authentication Key-pairs for the Hadoop User</h3><p>The master node will use an SSH connection to connect to other nodes with key-pair authentication. This will allow the master node to actively manage the cluster.</p><ol><li><p>Login to <strong>node-master</strong> as the <code>hadoop</code> user, and generate an SSH key:</p><pre><code>ssh-keygen -b 4096
</code></pre><p>When generating this key, leave the password field blank so your Hadoop user can communicate unprompted.</p></li><li><p>View the <strong>node-master</strong> public key and copy it to your clipboard to use with each of your worker nodes.</p><pre><code>less /home/hadoop/.ssh/id_rsa.pub
</code></pre></li><li><p>In each Linode, make a new file <code>master.pub</code> in the <code>/home/hadoop/.ssh</code> directory. Paste your public key into this file and save your changes.</p></li><li><p>Copy your key file into the authorized key store.</p><pre><code>cat ~/.ssh/master.pub &gt;&gt; ~/.ssh/authorized_keys
</code></pre></li></ol><h3 id="download-and-unpack-hadoop-binaries">Download and Unpack Hadoop Binaries</h3><p>Log into <strong>node-master</strong> as the <code>hadoop</code> user, download the Hadoop tarball from <a href="https://hadoop.apache.org/">Hadoop project page</a>, and unzip it:</p><pre><code>cd
wget http://apache.cs.utah.edu/hadoop/common/current/hadoop-3.1.2.tar.gz
tar -xzf hadoop-3.1.2.tar.gz
mv hadoop-3.1.2 hadoop
</code></pre><h3 id="set-environment-variables">Set Environment Variables</h3><ol><li><p>Add Hadoop binaries to your PATH. Edit <code>/home/hadoop/.profile</code> and add the following line:</p><dl class="file"><dt>/home/hadoop/.profile</dt><dd><div class="highlight"><div class="chroma"><table class="lntable"><tr><td class="lntd"><pre class="chroma"><code class="language-shell" data-lang="shell"><span class="lnt">1
</span></code></pre></td><td class="lntd"><pre class="chroma"><code class="language-shell" data-lang="shell"><span class="nv">PATH</span><span class="o">=</span>/home/hadoop/hadoop/bin:/home/hadoop/hadoop/sbin:<span class="nv">$PATH</span></code></pre></td></tr></table></div></div></dd></dl></li><li><p>Add Hadoop to your PATH for the shell. Edit <code>.bashrc</code> and add the following lines:</p><dl class="file"><dt>/home/hadoop/.bashrc</dt><dd><div class="highlight"><div class="chroma"><table class="lntable"><tr><td class="lntd"><pre class="chroma"><code class="language-shell" data-lang="shell"><span class="lnt">1
</span><span class="lnt">2
</span></code></pre></td><td class="lntd"><pre class="chroma"><code class="language-shell" data-lang="shell"><span class="nb">export</span> <span class="nv">HADOOP_HOME</span><span class="o">=</span>/home/hadoop/hadoop
<span class="nb">export</span> <span class="nv">PATH</span><span class="o">=</span><span class="si">${</span><span class="nv">PATH</span><span class="si">}</span>:<span class="si">${</span><span class="nv">HADOOP_HOME</span><span class="si">}</span>/bin:<span class="si">${</span><span class="nv">HADOOP_HOME</span><span class="si">}</span>/sbin</code></pre></td></tr></table></div></div></dd></dl></li></ol><h2 id="configure-the-master-node">Configure the Master Node</h2><p>Configuration will be performed on <strong>node-master</strong> and replicated to other nodes.</p><h3 id="set-java-home">Set JAVA_HOME</h3><ol><li><p>Find your Java installation path. This is known as <code>JAVA_HOME</code>. If you installed open-jdk from your package manager, you can find the path with the command:</p><pre><code>update-alternatives --display java
</code></pre><p>Take the value of the <em>current link</em> and remove the trailing <code>/bin/java</code>. For example on Debian, the link is <code>/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java</code>, so <code>JAVA_HOME</code> should be <code>/usr/lib/jvm/java-8-openjdk-amd64/jre</code>.</p><p>If you installed java from Oracle, <code>JAVA_HOME</code> is the path where you unzipped the java archive.</p></li><li><p>Edit <code>~/hadoop/etc/hadoop/hadoop-env.sh</code> and replace this line:</p><pre><code>export JAVA_HOME=${JAVA_HOME}
</code></pre><p>with your actual java installation path. On a Debian 9 Linode with open-jdk-8 this will be as follows:</p><dl class="file"><dt>~/hadoop/etc/hadoop/hadoop-env.sh</dt><dd><div class="highlight"><div class="chroma"><table class="lntable"><tr><td class="lntd"><pre class="chroma"><code class="language-shell" data-lang="shell"><span class="lnt">1
</span></code></pre></td><td class="lntd"><pre class="chroma"><code class="language-shell" data-lang="shell"><span class="nb">export</span> <span class="nv">JAVA_HOME</span><span class="o">=</span>/usr/lib/jvm/java-8-openjdk-amd64/jre</code></pre></td></tr></table></div></div></dd></dl></li></ol><h3 id="set-namenode-location">Set NameNode Location</h3><p>Update your <code>~/hadoop/etc/hadoop/core-site.xml</code> file to set the NameNode location to <strong>node-master</strong> on port <code>9000</code>:</p><dl class="file"><dt>~/hadoop/etc/hadoop/core-site.xml</dt><dd><div class="highlight"><div class="chroma"><table class="lntable"><tr><td class="lntd"><pre class="chroma"><code class="language-xml" data-lang="xml"><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span><span class="lnt">7
</span><span class="lnt">8
</span></code></pre></td><td class="lntd"><pre class="chroma"><code class="language-xml" data-lang="xml"><span class="cp">&lt;?xml version=&#34;1.0&#34; encoding=&#34;UTF-8&#34;?&gt;</span>
<span class="cp">&lt;?xml-stylesheet type=&#34;text/xsl&#34; href=&#34;configuration.xsl&#34;?&gt;</span>
    <span class="nt">&lt;configuration&gt;</span>
        <span class="nt">&lt;property&gt;</span>
            <span class="nt">&lt;name&gt;</span>fs.default.name<span class="nt">&lt;/name&gt;</span>
            <span class="nt">&lt;value&gt;</span>hdfs://node-master:9000<span class="nt">&lt;/value&gt;</span>
        <span class="nt">&lt;/property&gt;</span>
    <span class="nt">&lt;/configuration&gt;</span></code></pre></td></tr></table></div></div></dd></dl><h3 id="set-path-for-hdfs">Set path for HDFS</h3><p>Edit <code>hdfs-site.conf</code> to resemble the following configuration:</p><dl class="file"><dt>~/hadoop/etc/hadoop/hdfs-site.xml</dt><dd><div class="highlight"><div class="chroma"><table class="lntable"><tr><td class="lntd"><pre class="chroma"><code class="language-xml" data-lang="xml"><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span></code></pre></td><td class="lntd"><pre class="chroma"><code class="language-xml" data-lang="xml"><span class="nt">&lt;configuration&gt;</span>
    <span class="nt">&lt;property&gt;</span>
            <span class="nt">&lt;name&gt;</span>dfs.namenode.name.dir<span class="nt">&lt;/name&gt;</span>
            <span class="nt">&lt;value&gt;</span>/home/hadoop/data/nameNode<span class="nt">&lt;/value&gt;</span>
    <span class="nt">&lt;/property&gt;</span>

    <span class="nt">&lt;property&gt;</span>
            <span class="nt">&lt;name&gt;</span>dfs.datanode.data.dir<span class="nt">&lt;/name&gt;</span>
            <span class="nt">&lt;value&gt;</span>/home/hadoop/data/dataNode<span class="nt">&lt;/value&gt;</span>
    <span class="nt">&lt;/property&gt;</span>

    <span class="nt">&lt;property&gt;</span>
            <span class="nt">&lt;name&gt;</span>dfs.replication<span class="nt">&lt;/name&gt;</span>
            <span class="nt">&lt;value&gt;</span>1<span class="nt">&lt;/value&gt;</span>
    <span class="nt">&lt;/property&gt;</span>
<span class="nt">&lt;/configuration&gt;</span></code></pre></td></tr></table></div></div></dd></dl><p>The last property, <code>dfs.replication</code>, indicates how many times data is replicated in the cluster. You can set <code>2</code> to have all the data duplicated on the two nodes. Don&rsquo;t enter a value higher than the actual number of worker nodes.</p><h3 id="set-yarn-as-job-scheduler">Set YARN as Job Scheduler</h3><p>Edit the <code>mapred-site.xml</code> file, setting YARN as the default framework for MapReduce operations:</p><dl class="file"><dt>~/hadoop/etc/hadoop/mapred-site.xml</dt><dd><div class="highlight"><div class="chroma"><table class="lntable"><tr><td class="lntd"><pre class="chroma"><code class="language-xml" data-lang="xml"><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span></code></pre></td><td class="lntd"><pre class="chroma"><code class="language-xml" data-lang="xml"><span class="nt">&lt;configuration&gt;</span>
    <span class="nt">&lt;property&gt;</span>
            <span class="nt">&lt;name&gt;</span>mapreduce.framework.name<span class="nt">&lt;/name&gt;</span>
            <span class="nt">&lt;value&gt;</span>yarn<span class="nt">&lt;/value&gt;</span>
    <span class="nt">&lt;/property&gt;</span>
    <span class="nt">&lt;property&gt;</span>
            <span class="nt">&lt;name&gt;</span>yarn.app.mapreduce.am.env<span class="nt">&lt;/name&gt;</span>
            <span class="nt">&lt;value&gt;</span>HADOOP_MAPRED_HOME=$HADOOP_HOME<span class="nt">&lt;/value&gt;</span>
    <span class="nt">&lt;/property&gt;</span>
    <span class="nt">&lt;property&gt;</span>
            <span class="nt">&lt;name&gt;</span>mapreduce.map.env<span class="nt">&lt;/name&gt;</span>
            <span class="nt">&lt;value&gt;</span>HADOOP_MAPRED_HOME=$HADOOP_HOME<span class="nt">&lt;/value&gt;</span>
    <span class="nt">&lt;/property&gt;</span>
    <span class="nt">&lt;property&gt;</span>
            <span class="nt">&lt;name&gt;</span>mapreduce.reduce.env<span class="nt">&lt;/name&gt;</span>
            <span class="nt">&lt;value&gt;</span>HADOOP_MAPRED_HOME=$HADOOP_HOME<span class="nt">&lt;/value&gt;</span>
    <span class="nt">&lt;/property&gt;</span>
<span class="nt">&lt;/configuration&gt;</span></code></pre></td></tr></table></div></div></dd></dl><h3 id="configure-yarn">Configure YARN</h3><p>Edit <code>yarn-site.xml</code>, which contains the configuration options for YARN. In the <code>value</code> field for the <code>yarn.resourcemanager.hostname</code>, replace <code>203.0.113.0</code> with the public IP address of <strong>node-master</strong>:</p><dl class="file"><dt>~/hadoop/etc/hadoop/yarn-site.xml</dt><dd><div class="highlight"><div class="chroma"><table class="lntable"><tr><td class="lntd"><pre class="chroma"><code class="language-xml" data-lang="xml"><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span></code></pre></td><td class="lntd"><pre class="chroma"><code class="language-xml" data-lang="xml"><span class="nt">&lt;configuration&gt;</span>
    <span class="nt">&lt;property&gt;</span>
            <span class="nt">&lt;name&gt;</span>yarn.acl.enable<span class="nt">&lt;/name&gt;</span>
            <span class="nt">&lt;value&gt;</span>0<span class="nt">&lt;/value&gt;</span>
    <span class="nt">&lt;/property&gt;</span>

    <span class="nt">&lt;property&gt;</span>
            <span class="nt">&lt;name&gt;</span>yarn.resourcemanager.hostname<span class="nt">&lt;/name&gt;</span>
            <span class="nt">&lt;value&gt;</span>203.0.113.0<span class="nt">&lt;/value&gt;</span>
    <span class="nt">&lt;/property&gt;</span>

    <span class="nt">&lt;property&gt;</span>
            <span class="nt">&lt;name&gt;</span>yarn.nodemanager.aux-services<span class="nt">&lt;/name&gt;</span>
            <span class="nt">&lt;value&gt;</span>mapreduce_shuffle<span class="nt">&lt;/value&gt;</span>
    <span class="nt">&lt;/property&gt;</span>
<span class="nt">&lt;/configuration&gt;</span></code></pre></td></tr></table></div></div></dd></dl><h3 id="configure-workers">Configure Workers</h3><p>The file <code>workers</code> is used by startup scripts to start required daemons on all nodes. Edit <code>~/hadoop/etc/hadoop/workers</code> to include both of the nodes:</p><dl class="file"><dt>~/hadoop/etc/hadoop/workers</dt><dd><div class="highlight"><div class="chroma"><table class="lntable"><tr><td class="lntd"><pre class="chroma"><code class="language-resource" data-lang="resource"><span class="lnt">1
</span><span class="lnt">2
</span></code></pre></td><td class="lntd"><pre class="chroma"><code class="language-resource" data-lang="resource">node1
node2</code></pre></td></tr></table></div></div></dd></dl><h2 id="configure-memory-allocation">Configure Memory Allocation</h2><p>Memory allocation can be tricky on low RAM nodes because default values are not suitable for nodes with less than 8GB of RAM. This section will highlight how memory allocation works for MapReduce jobs, and provide a sample configuration for 2GB RAM nodes.</p><h3 id="the-memory-allocation-properties">The Memory Allocation Properties</h3><p>A YARN job is executed with two kind of resources:</p><ul><li>An <em>Application Master</em> (AM) is responsible for monitoring the application and coordinating distributed executors in the cluster.</li><li>Some executors that are created by the AM actually run the job. For a MapReduce jobs, they&rsquo;ll perform map or reduce operation, in parallel.</li></ul><p>Both are run in <em>containers</em> on worker nodes. Each worker node runs a <em>NodeManager</em> daemon that&rsquo;s responsible for container creation on the node. The whole cluster is managed by a <em>ResourceManager</em> that schedules container allocation on all the worker-nodes, depending on capacity requirements and current charge.</p><p>Four types of resource allocations need to be configured properly for the cluster to work. These are:</p><ol><li><p>How much memory can be allocated for YARN containers on a single node. This limit should be higher than all the others; otherwise, container allocation will be rejected and applications will fail. However, it should not be the entire amount of RAM on the node.</p><p>This value is configured in <code>yarn-site.xml</code> with <code>yarn.nodemanager.resource.memory-mb</code>.</p></li><li><p>How much memory a single container can consume and the minimum memory allocation allowed. A container will never be bigger than the maximum, or else allocation will fail and will always be allocated as a multiple of the minimum amount of RAM.</p><p>Those values are configured in <code>yarn-site.xml</code> with <code>yarn.scheduler.maximum-allocation-mb</code> and <code>yarn.scheduler.minimum-allocation-mb</code>.</p></li><li><p>How much memory will be allocated to the ApplicationMaster. This is a constant value that should fit in the container maximum size.</p><p>This is configured in <code>mapred-site.xml</code> with <code>yarn.app.mapreduce.am.resource.mb</code>.</p></li><li><p>How much memory will be allocated to each map or reduce operation. This should be less than the maximum size.</p><p>This is configured in <code>mapred-site.xml</code> with properties <code>mapreduce.map.memory.mb</code> and <code>mapreduce.reduce.memory.mb</code>.</p></li></ol><p>The relationship between all those properties can be seen in the following figure:</p><p><img src="hadoop-2-memory-allocation-new.png" alt="Schema of memory allocation properties" title="Schema of memory allocation properties"></p><h3 id="sample-configuration-for-2gb-nodes">Sample Configuration for 2GB Nodes</h3><p>For 2GB nodes, a working configuration may be:</p><table><thead><tr><th>Property</th><th align="center">Value</th></tr></thead><tbody><tr><td>yarn.nodemanager.resource.memory-mb</td><td align="center">1536</td></tr><tr><td>yarn.scheduler.maximum-allocation-mb</td><td align="center">1536</td></tr><tr><td>yarn.scheduler.minimum-allocation-mb</td><td align="center">128</td></tr><tr><td>yarn.app.mapreduce.am.resource.mb</td><td align="center">512</td></tr><tr><td>mapreduce.map.memory.mb</td><td align="center">256</td></tr><tr><td>mapreduce.reduce.memory.mb</td><td align="center">256</td></tr></tbody></table><ol><li><p>Edit <code>/home/hadoop/hadoop/etc/hadoop/yarn-site.xml</code> and add the following lines:</p><dl class="file"><dt>~/hadoop/etc/hadoop/yarn-site.xml</dt><dd><div class="highlight"><div class="chroma"><table class="lntable"><tr><td class="lntd"><pre class="chroma"><code class="language-xml" data-lang="xml"><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span><span class="lnt">19
</span></code></pre></td><td class="lntd"><pre class="chroma"><code class="language-xml" data-lang="xml"><span class="nt">&lt;property&gt;</span>
        <span class="nt">&lt;name&gt;</span>yarn.nodemanager.resource.memory-mb<span class="nt">&lt;/name&gt;</span>
        <span class="nt">&lt;value&gt;</span>1536<span class="nt">&lt;/value&gt;</span>
<span class="nt">&lt;/property&gt;</span>

<span class="nt">&lt;property&gt;</span>
        <span class="nt">&lt;name&gt;</span>yarn.scheduler.maximum-allocation-mb<span class="nt">&lt;/name&gt;</span>
        <span class="nt">&lt;value&gt;</span>1536<span class="nt">&lt;/value&gt;</span>
<span class="nt">&lt;/property&gt;</span>

<span class="nt">&lt;property&gt;</span>
        <span class="nt">&lt;name&gt;</span>yarn.scheduler.minimum-allocation-mb<span class="nt">&lt;/name&gt;</span>
        <span class="nt">&lt;value&gt;</span>128<span class="nt">&lt;/value&gt;</span>
<span class="nt">&lt;/property&gt;</span>

<span class="nt">&lt;property&gt;</span>
        <span class="nt">&lt;name&gt;</span>yarn.nodemanager.vmem-check-enabled<span class="nt">&lt;/name&gt;</span>
        <span class="nt">&lt;value&gt;</span>false<span class="nt">&lt;/value&gt;</span>
<span class="nt">&lt;/property&gt;</span></code></pre></td></tr></table></div></div></dd></dl><p>The last property disables virtual-memory checking which can prevent containers from being allocated properly with JDK8 if enabled.</p></li><li><p>Edit <code>/home/hadoop/hadoop/etc/hadoop/mapred-site.xml</code> and add the following lines:</p><dl class="file"><dt>~/hadoop/etc/hadoop/mapred-site.xml</dt><dd><div class="highlight"><div class="chroma"><table class="lntable"><tr><td class="lntd"><pre class="chroma"><code class="language-xml" data-lang="xml"><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span></code></pre></td><td class="lntd"><pre class="chroma"><code class="language-xml" data-lang="xml"><span class="nt">&lt;property&gt;</span>
        <span class="nt">&lt;name&gt;</span>yarn.app.mapreduce.am.resource.mb<span class="nt">&lt;/name&gt;</span>
        <span class="nt">&lt;value&gt;</span>512<span class="nt">&lt;/value&gt;</span>
<span class="nt">&lt;/property&gt;</span>

<span class="nt">&lt;property&gt;</span>
        <span class="nt">&lt;name&gt;</span>mapreduce.map.memory.mb<span class="nt">&lt;/name&gt;</span>
        <span class="nt">&lt;value&gt;</span>256<span class="nt">&lt;/value&gt;</span>
<span class="nt">&lt;/property&gt;</span>

<span class="nt">&lt;property&gt;</span>
        <span class="nt">&lt;name&gt;</span>mapreduce.reduce.memory.mb<span class="nt">&lt;/name&gt;</span>
        <span class="nt">&lt;value&gt;</span>256<span class="nt">&lt;/value&gt;</span>
<span class="nt">&lt;/property&gt;</span></code></pre></td></tr></table></div></div></dd></dl></li></ol><h2 id="duplicate-config-files-on-each-node">Duplicate Config Files on Each Node</h2><ol><li><p>Copy the Hadoop binaries to worker nodes:</p><pre><code>cd /home/hadoop/
scp hadoop-*.tar.gz node1:/home/hadoop
scp hadoop-*.tar.gz node2:/home/hadoop
</code></pre></li><li><p>Connect to <strong>node1</strong> via SSH. A password isn&rsquo;t required, thanks to the SSH keys copied above:</p><pre><code>ssh node1
</code></pre></li><li><p>Unzip the binaries, rename the directory, and exit <strong>node1</strong> to get back on the node-master:</p><pre><code>tar -xzf hadoop-3.1.2.tar.gz
mv hadoop-3.1.2 hadoop
exit
</code></pre></li><li><p>Repeat steps 2 and 3 for <strong>node2</strong>.</p></li><li><p>Copy the Hadoop configuration files to the worker nodes:</p><pre><code>for node in node1 node2; do
    scp ~/hadoop/etc/hadoop/* $node:/home/hadoop/hadoop/etc/hadoop/;
done
</code></pre></li></ol><h2 id="format-hdfs">Format HDFS</h2><p>HDFS needs to be formatted like any classical file system. On <strong>node-master</strong>, run the following command:</p><pre><code>hdfs namenode -format
</code></pre><p>Your Hadoop installation is now configured and ready to run.</p><h2 id="run-and-monitor-hdfs">Run and monitor HDFS</h2><p>This section will walk through starting HDFS on NameNode and DataNodes, and monitoring that everything is properly working and interacting with HDFS data.</p><h3 id="start-and-stop-hdfs">Start and Stop HDFS</h3><ol><li><p>Start the HDFS by running the following script from <strong>node-master</strong>:</p><pre><code>start-dfs.sh
</code></pre><p>This will start <strong>NameNode</strong> and <strong>SecondaryNameNode</strong> on node-master, and <strong>DataNode</strong> on <strong>node1</strong> and <strong>node2</strong>, according to the configuration in the <code>workers</code> config file.</p></li><li><p>Check that every process is running with the <code>jps</code> command on each node. On <strong>node-master</strong>, you should see the following (the PID number will be different):</p><pre><code>21922 Jps
21603 NameNode
21787 SecondaryNameNode
</code></pre><p>And on <strong>node1</strong> and <strong>node2</strong> you should see the following:</p><pre><code>19728 DataNode
19819 Jps
</code></pre></li><li><p>To stop HDFS on master and worker nodes, run the following command from <strong>node-master</strong>:</p><pre><code>stop-dfs.sh
</code></pre></li></ol><h3 id="monitor-your-hdfs-cluster">Monitor your HDFS Cluster</h3><ol><li><p>You can get useful information about running your HDFS cluster with the <code>hdfs dfsadmin</code> command. Try for example:</p><pre><code>hdfs dfsadmin -report
</code></pre><p>This will print information (e.g., capacity and usage) for all running DataNodes. To get the description of all available commands, type:</p><pre><code>hdfs dfsadmin -help
</code></pre></li><li><p>You can also automatically use the friendlier web user interface. Point your browser to <a href="http://node-master-IP:9870">http://node-master-IP:9870</a>, where node-master-IP is the IP address of your node-master, and you&rsquo;ll get a user-friendly monitoring console.</p></li></ol><p><img src="hadoop-3-hdfs-webui-wide.png" alt="Screenshot of HDFS Web UI" title="Screenshot of HDFS Web UI"></p><h3 id="put-and-get-data-to-hdfs">Put and Get Data to HDFS</h3><p>Writing and reading to HDFS is done with command <code>hdfs dfs</code>. First, manually create your home directory. All other commands will use a path relative to this default home directory:</p><pre><code>hdfs dfs -mkdir -p /user/hadoop
</code></pre><p>Let&rsquo;s use some textbooks from the <a href="https://www.gutenberg.org/">Gutenberg project</a> as an example.</p><ol><li><p>Create a <em>books</em> directory in HDFS. The following command will create it in the home directory, <code>/user/hadoop/books</code>:</p><pre><code>hdfs dfs -mkdir books
</code></pre></li><li><p>Grab a few books from the Gutenberg project:</p><pre><code>cd /home/hadoop
wget -O alice.txt https://www.gutenberg.org/files/11/11-0.txt
wget -O holmes.txt https://www.gutenberg.org/files/1661/1661-0.txt
wget -O frankenstein.txt https://www.gutenberg.org/files/84/84-0.txt
</code></pre></li><li><p>Put the three books through HDFS, in the <code>books</code>directory:</p><pre><code>hdfs dfs -put alice.txt holmes.txt frankenstein.txt books
</code></pre></li><li><p>List the contents of the <code>book</code> directory:</p><pre><code>hdfs dfs -ls books
</code></pre></li><li><p>Move one of the books to the local filesystem:</p><pre><code>hdfs dfs -get books/alice.txt
</code></pre></li><li><p>You can also directly print the books from HDFS:</p><pre><code>hdfs dfs -cat books/alice.txt
</code></pre></li></ol><p>There are many commands to manage your HDFS. For a complete list, you can look at the <a href="https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html">Apache HDFS shell documentation</a>, or print help with:</p><pre><code>hdfs dfs -help
</code></pre><h2 id="run-yarn">Run YARN</h2><p>HDFS is a distributed storage system, and doesn&rsquo;t provide any services for running and scheduling tasks in the cluster. This is the role of the YARN framework. The following section is about starting, monitoring, and submitting jobs to YARN.</p><h3 id="start-and-stop-yarn">Start and Stop YARN</h3><ol><li><p>Start YARN with the script:</p><pre><code>start-yarn.sh
</code></pre></li><li><p>Check that everything is running with the <code>jps</code> command. In addition to the previous HDFS daemon, you should see a <strong>ResourceManager</strong> on <strong>node-master</strong>, and a <strong>NodeManager</strong> on <strong>node1</strong> and <strong>node2</strong>.</p></li><li><p>To stop YARN, run the following command on <strong>node-master</strong>:</p><pre><code>stop-yarn.sh
</code></pre></li></ol><h3 id="monitor-yarn">Monitor YARN</h3><ol><li><p>The <code>yarn</code> command provides utilities to manage your YARN cluster. You can also print a report of running nodes with the command:</p><pre><code>yarn node -list
</code></pre><p>Similarly, you can get a list of running applications with command:</p><pre><code>yarn application -list
</code></pre><p>To get all available parameters of the <code>yarn</code> command, see <a href="https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YarnCommands.html">Apache YARN documentation</a>.</p></li><li><p>As with HDFS, YARN provides a friendlier web UI, started by default on port <code>8088</code> of the Resource Manager. Point your browser to <a href="http://node-master-IP:8088">http://node-master-IP:8088</a>, where node-master-IP is the IP address of your node-master, and browse the UI:</p><p><img src="hadoop-4-yarn-webui-wide.png" alt="Screenshot of YARN Web UI" title="Screenshot of YARN Web UI"></p></li></ol><h3 id="submit-mapreduce-jobs-to-yarn">Submit MapReduce Jobs to YARN</h3><p>YARN jobs are packaged into <code>jar</code> files and submitted to YARN for execution with the command <code>yarn jar</code>. The Hadoop installation package provides sample applications that can be run to test your cluster. You&rsquo;ll use them to run a word count on the three books previously uploaded to HDFS.</p><ol><li><p>Submit a job with the sample <code>jar</code> to YARN. On <strong>node-master</strong>, run:</p><pre><code>yarn jar ~/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.2.jar wordcount &quot;books/*&quot; output
</code></pre><p>The last argument is where the output of the job will be saved - in HDFS.</p></li><li><p>After the job is finished, you can get the result by querying HDFS with <code>hdfs dfs -ls output</code>. In case of a success, the output will resemble:</p><pre><code>Found 2 items
-rw-r--r--   2 hadoop supergroup          0 2019-05-31 17:21 output/_SUCCESS
-rw-r--r--   2 hadoop supergroup     789726 2019-05-31 17:21 output/part-r-00000
</code></pre></li><li><p>Print the result with:</p><pre><code>hdfs dfs -cat output/part-r-00000 | less
</code></pre></li></ol><h2 id="next-steps">Next Steps</h2><p>Now that you have a YARN cluster up and running, you can:</p><ul><li>Learn how to code your own YARN jobs with <a href="https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html">Apache documentation</a>.</li><li>Install Spark on top on your YARN cluster with <a href="/docs/databases/hadoop/install-configure-run-spark-on-top-of-hadoop-yarn-cluster/">Linode Spark guide</a>.</li></ul><h2 id="more-information">More Information</h2><p>You may wish to consult the following resources for additional information on this topic. While these are provided in the hope that they will be useful, please note that we cannot vouch for the accuracy or timeliness of externally hosted materials.</p><ul><li><a href="https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YarnCommands.html">YARN Command Reference</a></li><li><a href="https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html">HDFS Shell Documentation</a></li><li><a href="https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/core-default.xml">core-site.xml properties</a></li><li><a href="https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml">hdfs-site.xml properties</a></li><li><a href="https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml">mapred-site.xml properties</a></li><li><a href="https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-common/yarn-default.xml">core-site.xml properties</a></li></ul>