Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dockerfile + README - Add basic support for YARN #7

Open
davidonlaptop opened this issue Sep 15, 2015 · 8 comments
Open

Dockerfile + README - Add basic support for YARN #7

davidonlaptop opened this issue Sep 15, 2015 · 8 comments

Comments

@davidonlaptop
Copy link
Member

Add basic support to run YARN using this image (if not already required by #6).

Make the minimal changes to the Dockerfile and add instructions in the README for running an example mapreduce job (included in hadoop source code) on top of YARN.

@davidonlaptop davidonlaptop changed the title Dockerfile + README - Add support for YARN Dockerfile + README - Add basic support for YARN Sep 15, 2015
@codingtony
Copy link
Contributor

I'm looking at this and there's a chicken and egg problem.

YARN node manager needs to see the YARN resource manager on startup to be able to register.

I start the resource manager then I start the node manager with -link to link the node manager to the resource manager. When I do that I can see the node registering to the resource manager.

However when you submit a job, the resource manager needs to communicate with the node manager. I cannot link the node manager when I start the resource manager since the node manager does not exist yet.

I could try start both service within the same container...

codingtony pushed a commit to codingtony/docker-ubuntu-hadoop that referenced this issue Sep 22, 2015
@codingtony
Copy link
Contributor

teragen (1GB)

time docker run --rm  gelog/hadoop         hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar   teragen -Ddfs.block.size=134217728 -Dmapred.map.tasks=100 -Dmapred.reduce.tasks=100 1000000000 unsorted
[...]
15/09/24 17:23:14 INFO mapreduce.Job: Counters: 32
        File System Counters
                FILE: Number of bytes read=0
                FILE: Number of bytes written=10540690
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=8595
                HDFS: Number of bytes written=100000000000
                HDFS: Number of read operations=400
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=200
        Job Counters 
                Killed map tasks=2
                Launched map tasks=102
                Other local map tasks=102
                Total time spent by all maps in occupied slots (ms)=7179175
                Total time spent by all reduces in occupied slots (ms)=0
                Total time spent by all map tasks (ms)=7179175
                Total vcore-seconds taken by all map tasks=7179175
                Total megabyte-seconds taken by all map tasks=7351475200
        Map-Reduce Framework
                Map input records=1000000000
                Map output records=1000000000
                Input split bytes=8595
                Spilled Records=0
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=60387
                CPU time spent (ms)=2457730
                Physical memory (bytes) snapshot=21296259072
                Virtual memory (bytes) snapshot=95868182528
                Total committed heap usage (bytes)=19480969216
        org.apache.hadoop.examples.terasort.TeraGen$Counters
                CHECKSUM=2147523228284173905
        File Input Format Counters 
                Bytes Read=0
        File Output Format Counters 
                Bytes Written=100000000000

real    22m20.271s
user    0m0.048s
sys     0m0.026s

@davidonlaptop
Copy link
Member Author

Nice... so how does this container connects to YARN ?

On Thu, Sep 24, 2015 at 2:03 PM, Tony Bussières notifications@github.com
wrote:

teragen (1GB)

time docker run --rm gelog/hadoop hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar teragen -Ddfs.block.size=134217728 -Dmapred.map.tasks=100 -Dmapred.reduce.tasks=100 1000000000 unsorted
[...]
15/09/24 17:23:14 INFO mapreduce.Job: Counters: 32
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=10540690
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=8595
HDFS: Number of bytes written=100000000000
HDFS: Number of read operations=400
HDFS: Number of large read operations=0
HDFS: Number of write operations=200
Job Counters
Killed map tasks=2
Launched map tasks=102
Other local map tasks=102
Total time spent by all maps in occupied slots (ms)=7179175
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=7179175
Total vcore-seconds taken by all map tasks=7179175
Total megabyte-seconds taken by all map tasks=7351475200
Map-Reduce Framework
Map input records=1000000000
Map output records=1000000000
Input split bytes=8595
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=60387
CPU time spent (ms)=2457730
Physical memory (bytes) snapshot=21296259072
Virtual memory (bytes) snapshot=95868182528
Total committed heap usage (bytes)=19480969216
org.apache.hadoop.examples.terasort.TeraGen$Counters
CHECKSUM=2147523228284173905
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=100000000000

real 22m20.271s
user 0m0.048s
sys 0m0.026s


Reply to this email directly or view it on GitHub
#7 (comment)
.

@codingtony
Copy link
Contributor

Docker 1.8 :-)
It seems that we don't need --link anymore

On Thu, Sep 24, 2015 at 4:09 PM, David Lauzon notifications@github.com
wrote:

Nice... so how does this container connects to YARN ?

On Thu, Sep 24, 2015 at 2:03 PM, Tony Bussières notifications@github.com
wrote:

teragen (1GB)

time docker run --rm gelog/hadoop hadoop jar
/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar
teragen -Ddfs.block.size=134217728 -Dmapred.map.tasks=100
-Dmapred.reduce.tasks=100 1000000000 unsorted
[...]
15/09/24 17:23:14 INFO mapreduce.Job: Counters: 32
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=10540690
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=8595
HDFS: Number of bytes written=100000000000
HDFS: Number of read operations=400
HDFS: Number of large read operations=0
HDFS: Number of write operations=200
Job Counters
Killed map tasks=2
Launched map tasks=102
Other local map tasks=102
Total time spent by all maps in occupied slots (ms)=7179175
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=7179175
Total vcore-seconds taken by all map tasks=7179175
Total megabyte-seconds taken by all map tasks=7351475200
Map-Reduce Framework
Map input records=1000000000
Map output records=1000000000
Input split bytes=8595
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=60387
CPU time spent (ms)=2457730
Physical memory (bytes) snapshot=21296259072
Virtual memory (bytes) snapshot=95868182528
Total committed heap usage (bytes)=19480969216
org.apache.hadoop.examples.terasort.TeraGen$Counters
CHECKSUM=2147523228284173905
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=100000000000

real 22m20.271s
user 0m0.048s
sys 0m0.026s


Reply to this email directly or view it on GitHub
<
#7 (comment)

.


Reply to this email directly or view it on GitHub
#7 (comment)
.

@codingtony
Copy link
Contributor

codingtony commented Sep 24, 2015 via email

@davidonlaptop
Copy link
Member Author

Awesome!!

uh, it works on the --name? So we may not need the -h parameter anymore
then??

On Thu, Sep 24, 2015 at 4:22 PM, Tony Bussières notifications@github.com
wrote:

BTW It works with --name and not -h

All the docker containers that runs in the same docker host auto-populates
/etc/hosts with the NAME of the containers an their IP


Reply to this email directly or view it on GitHub
#7 (comment)
.

@codingtony
Copy link
Contributor

codingtony commented Sep 27, 2015 via email

@davidonlaptop
Copy link
Member Author

Perfect!

It means we'll be able to simplify the README.

On Sat, Sep 26, 2015 at 8:09 PM, Tony Bussières notifications@github.com
wrote:

Using -h is still a good practice IMO and you need it if you continue to
use --link.

However you can specify only --name to use the /etc/host "auto populate"
feature.


Reply to this email directly or view it on GitHub
#7 (comment)
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants