Dockerfile + README - Add basic support for YARN #7

davidonlaptop · 2015-09-15T07:33:39Z

Add basic support to run YARN using this image (if not already required by #6).

Make the minimal changes to the Dockerfile and add instructions in the README for running an example mapreduce job (included in hadoop source code) on top of YARN.

codingtony · 2015-09-22T17:11:57Z

I'm looking at this and there's a chicken and egg problem.

YARN node manager needs to see the YARN resource manager on startup to be able to register.

I start the resource manager then I start the node manager with -link to link the node manager to the resource manager. When I do that I can see the node registering to the resource manager.

However when you submit a job, the resource manager needs to communicate with the node manager. I cannot link the node manager when I start the resource manager since the node manager does not exist yet.

I could try start both service within the same container...

…on#7

codingtony · 2015-09-24T18:03:05Z

teragen (1GB)

time docker run --rm  gelog/hadoop         hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar   teragen -Ddfs.block.size=134217728 -Dmapred.map.tasks=100 -Dmapred.reduce.tasks=100 1000000000 unsorted
[...]
15/09/24 17:23:14 INFO mapreduce.Job: Counters: 32
        File System Counters
                FILE: Number of bytes read=0
                FILE: Number of bytes written=10540690
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=8595
                HDFS: Number of bytes written=100000000000
                HDFS: Number of read operations=400
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=200
        Job Counters 
                Killed map tasks=2
                Launched map tasks=102
                Other local map tasks=102
                Total time spent by all maps in occupied slots (ms)=7179175
                Total time spent by all reduces in occupied slots (ms)=0
                Total time spent by all map tasks (ms)=7179175
                Total vcore-seconds taken by all map tasks=7179175
                Total megabyte-seconds taken by all map tasks=7351475200
        Map-Reduce Framework
                Map input records=1000000000
                Map output records=1000000000
                Input split bytes=8595
                Spilled Records=0
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=60387
                CPU time spent (ms)=2457730
                Physical memory (bytes) snapshot=21296259072
                Virtual memory (bytes) snapshot=95868182528
                Total committed heap usage (bytes)=19480969216
        org.apache.hadoop.examples.terasort.TeraGen$Counters
                CHECKSUM=2147523228284173905
        File Input Format Counters 
                Bytes Read=0
        File Output Format Counters 
                Bytes Written=100000000000

real    22m20.271s
user    0m0.048s
sys     0m0.026s

davidonlaptop · 2015-09-24T20:09:20Z

Nice... so how does this container connects to YARN ?

On Thu, Sep 24, 2015 at 2:03 PM, Tony Bussières notifications@github.com
wrote:

teragen (1GB)

time docker run --rm gelog/hadoop hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar teragen -Ddfs.block.size=134217728 -Dmapred.map.tasks=100 -Dmapred.reduce.tasks=100 1000000000 unsorted
[...]
15/09/24 17:23:14 INFO mapreduce.Job: Counters: 32
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=10540690
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=8595
HDFS: Number of bytes written=100000000000
HDFS: Number of read operations=400
HDFS: Number of large read operations=0
HDFS: Number of write operations=200
Job Counters
Killed map tasks=2
Launched map tasks=102
Other local map tasks=102
Total time spent by all maps in occupied slots (ms)=7179175
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=7179175
Total vcore-seconds taken by all map tasks=7179175
Total megabyte-seconds taken by all map tasks=7351475200
Map-Reduce Framework
Map input records=1000000000
Map output records=1000000000
Input split bytes=8595
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=60387
CPU time spent (ms)=2457730
Physical memory (bytes) snapshot=21296259072
Virtual memory (bytes) snapshot=95868182528
Total committed heap usage (bytes)=19480969216
org.apache.hadoop.examples.terasort.TeraGen$Counters
CHECKSUM=2147523228284173905
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=100000000000

real 22m20.271s
user 0m0.048s
sys 0m0.026s

—
Reply to this email directly or view it on GitHub
#7 (comment)
.

codingtony · 2015-09-24T20:12:50Z

Docker 1.8 :-)
It seems that we don't need --link anymore

On Thu, Sep 24, 2015 at 4:09 PM, David Lauzon notifications@github.com
wrote:

Nice... so how does this container connects to YARN ?

On Thu, Sep 24, 2015 at 2:03 PM, Tony Bussières notifications@github.com
wrote:

teragen (1GB)

time docker run --rm gelog/hadoop hadoop jar
/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar
teragen -Ddfs.block.size=134217728 -Dmapred.map.tasks=100
-Dmapred.reduce.tasks=100 1000000000 unsorted
[...]
15/09/24 17:23:14 INFO mapreduce.Job: Counters: 32
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=10540690
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=8595
HDFS: Number of bytes written=100000000000
HDFS: Number of read operations=400
HDFS: Number of large read operations=0
HDFS: Number of write operations=200
Job Counters
Killed map tasks=2
Launched map tasks=102
Other local map tasks=102
Total time spent by all maps in occupied slots (ms)=7179175
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=7179175
Total vcore-seconds taken by all map tasks=7179175
Total megabyte-seconds taken by all map tasks=7351475200
Map-Reduce Framework
Map input records=1000000000
Map output records=1000000000
Input split bytes=8595
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=60387
CPU time spent (ms)=2457730
Physical memory (bytes) snapshot=21296259072
Virtual memory (bytes) snapshot=95868182528
Total committed heap usage (bytes)=19480969216
org.apache.hadoop.examples.terasort.TeraGen$Counters
CHECKSUM=2147523228284173905
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=100000000000

real 22m20.271s
user 0m0.048s
sys 0m0.026s

—
Reply to this email directly or view it on GitHub
<
#7 (comment)

.

—
Reply to this email directly or view it on GitHub
#7 (comment)
.

codingtony · 2015-09-24T20:22:51Z

BTW It works with --name and not -h All the docker containers that runs in the same docker host auto-populates /etc/hosts with the NAME of the containers an their IP

davidonlaptop · 2015-09-26T18:56:33Z

Awesome!!

uh, it works on the --name? So we may not need the -h parameter anymore
then??

On Thu, Sep 24, 2015 at 4:22 PM, Tony Bussières notifications@github.com
wrote:

BTW It works with --name and not -h

All the docker containers that runs in the same docker host auto-populates
/etc/hosts with the NAME of the containers an their IP

—
Reply to this email directly or view it on GitHub
#7 (comment)
.

codingtony · 2015-09-27T00:09:16Z

Using -h is still a good practice IMO and you need it if you continue to use --link. However you can specify only --name to use the /etc/host "auto populate" feature.

davidonlaptop · 2015-09-27T15:49:39Z

Perfect!

It means we'll be able to simplify the README.

On Sat, Sep 26, 2015 at 8:09 PM, Tony Bussières notifications@github.com
wrote:

Using -h is still a good practice IMO and you need it if you continue to
use --link.

However you can specify only --name to use the /etc/host "auto populate"
feature.

—
Reply to this email directly or view it on GitHub
#7 (comment)
.

davidonlaptop changed the title ~~Dockerfile + README - Add support for YARN~~ Dockerfile + README - Add basic support for YARN Sep 15, 2015

davidonlaptop mentioned this issue Sep 15, 2015

Wiki - YARN - Run tasks in docker containers #12

Open

codingtony pushed a commit to codingtony/docker-ubuntu-hadoop that referenced this issue Sep 22, 2015

Basic YARN support and map-reduce jobs. Resolve issue bigdatafoundati…

7b41f31

…on#7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dockerfile + README - Add basic support for YARN #7

Dockerfile + README - Add basic support for YARN #7

davidonlaptop commented Sep 15, 2015

codingtony commented Sep 22, 2015

codingtony commented Sep 24, 2015

davidonlaptop commented Sep 24, 2015

codingtony commented Sep 24, 2015

codingtony commented Sep 24, 2015 via email

davidonlaptop commented Sep 26, 2015

codingtony commented Sep 27, 2015 via email

davidonlaptop commented Sep 27, 2015

Dockerfile + README - Add basic support for YARN #7

Dockerfile + README - Add basic support for YARN #7

Comments

davidonlaptop commented Sep 15, 2015

codingtony commented Sep 22, 2015

codingtony commented Sep 24, 2015

davidonlaptop commented Sep 24, 2015

codingtony commented Sep 24, 2015

codingtony commented Sep 24, 2015 via email

davidonlaptop commented Sep 26, 2015

codingtony commented Sep 27, 2015 via email

davidonlaptop commented Sep 27, 2015