Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find a way to have /data cohesive usage ? #14

Open
mikefaille opened this issue Sep 15, 2015 · 11 comments
Open

Find a way to have /data cohesive usage ? #14

mikefaille opened this issue Sep 15, 2015 · 11 comments

Comments

@mikefaille
Copy link
Member

I think /data could be use for many use case. Actually, the root of this folder contain configs and subfolders can contain anything else.

The problem ? /data is the volume and we can have subvolume. I'm not sure why config is on volume now. Maybe I miss something ?

I think the best way to use folders to mount any data is by keeping main forlder emtpy (/data) and using subfolders for configs (/data/etc), hdfs mount (/data/hdfs) or anything else.

PS : Maybe, we can use File system hierarchy standard http://www.pathname.com/fhs/pub/fhs-2.3.pdf (cause it's why I really love unix style files organisation) but i'm personnaly ok with /data usage if it's clear in README.md. I will meditate on this.

@davidonlaptop
Copy link
Member

Indeed, it would be great to use the FHS standard instead of /data. The main reason for choosing /data was that typing docker commands is shorter, which also improves the image's usability. It's true though that something like /hdfs would be more appropriate.

A compromise could be to install and configure Hadoop using FHS and have some kind of symlink usable with docker so we could optionally mount -v /HOST_VOLUME:/hdfs. Which FHS folder would you recommend for HDFS data?

@mikefaille mikefaille changed the title Find a way to have /data unique use ? Find a way to have /data cohesive usage ? Sep 16, 2015
@davidonlaptop
Copy link
Member

FYI. FHS 3.0 was released in June 2015. I just read it, and think we could use /srv/hdfs for HDFS data. Do you concur?

Reference: http://refspecs.linuxfoundation.org/FHS_3.0/fhs/ch03s17.html

@mikefaille
Copy link
Member Author

It sounds good. It's not too far to fetch too.

Btw, i'm just thinking about legibility of /usr/local/hadoop
Yes, it's very popular in community. But, it's not FHS compliant. I think FHS exist to be easy to retrieve data without knowledge for particular apps. So, since Hadoop community prefer this path and some apps like Golang prefer this path too; It should be ok. But, the best path to place tertiary hierarchy for package should be at /opt/packagename. Then, we can have : /opt/packagename/etc /opt/packagename/bin /opt/packagename/lib etc.

Although, we can think about /srv again. Their is a small issue with this path because their is only Opensuse compliant with this path and as I know, no apps use /srv. Then the issue category we have it's what I called familiarity aspect.

So, my personal recommendation to have generic and community proof standard :

  • using /data/ as /var/lib or /srv usage. Example: /data/dfs or /data/hdfs and /data/hbase but not directly /data as possible
    Explanation : current hadoop distro use this path. CDH use directly /data for hadoop but it's bad cause we lost our capacity to store data from many apps. We already need few path like hbase.rootdir property that need different path than dfs.*. And, /data is use too by Docker community so it's great if we want enforce familiarity aspect.
  • use VOLUME instruction more granularly. Not for /data but for sub-folder if needed like /data/hdfs.
  • continue to use /usr/local/hadoop since /usr/local/ is recognize by community. Personally, I prefer /opt/hadoop because the path is shorter and it strictly respect FHS.

@davidonlaptop
Copy link
Member

ok it settled then for /srv/hdfs for Hadoop HDFS data.

Agreed that we could also re-evaluate where we put the Hadoop binaries, config, etc. But why do you say that /usr/local/hadoop is not FHS compliant?

FHS 4.1 Purpose of /usr:

/usr is the second major section of the filesystem. /usr is shareable, read-only data. That means that /usr should be shareable between various FHS-compliant hosts and must not be written to. Any information that is host-specific or varies with time is stored elsewhere.

FHS 4.9 Purpose of /usr/local:

The /usr/local hierarchy is for use by the system administrator when installing software locally. It needs to be safe from being overwritten when the system software is updated. It may be used for programs and data that are shareable amongst a group of hosts, but not found in /usr.

Locally installed software must be placed within /usr/local rather than /usr unless it is being installed to replace or upgrade software in /usr.

It seems to me that using /usr/local/hadoop is more FHS compliant than using /opt/hadoop because the latter would force us to nest all the subdirectories etc, bin and so on inside /opt/hadoop/ dir. While the former allows us to put the config in /usr/local/etc/hadoop or /etc/local/hadoop. But to be honest, /etc/hadoop would be the most user-friendly!

@mikefaille
Copy link
Member Author

@davidonlaptop Having a subfolder representing package is not FHS compliant like /usr/local/package_name

@davidonlaptop
Copy link
Member

Having a subfolder representing package is not FHS compliant like /usr/local/

Do you have a reference to support this claim ?

@mikefaille
Copy link
Member Author

 > Having a subfolder representing package is not FHS compliant like /usr/local/
 Do you have a reference to support this claim ?

Yes, their is nothing about it on FHS 👍

It seems to me that using /usr/local/hadoop is more FHS compliant than using /opt/hadoop because the latter would force us to nest all the subdirectories etc, bin and so on inside /opt/hadoop/ dir.

Wrong. /opt/package_name permit us to put configs under /etc
http://refspecs.linuxfoundation.org/FHS_3.0/fhs/ch03s13.html

Nesting subdirectory is already the case for /usr/local/hadoop.

Although, the 1st goal of FHS is having understandably way to predict right path. Again, if community choose other way like /data/whatever and /usr/local/<package_name>, it should be ok to predict path better than FHS way. For nuance, I give my own point-of-view in this answer : #14 (comment)

Then, if you really want have /etc/hadoop, just symlink /usr/local/hadoop/etc/hadoop to /etc/hadoop and, for logs, /usr/local/hadoop/logs (<-- this path can be way better) to /var/log/hadoop

@davidonlaptop
Copy link
Member

Wrong. /opt/package_name permit us to put configs under /etc
http://refspecs.linuxfoundation.org/FHS_3.0/fhs/ch03s13.html

That's right. So then we have 2 choices.

Although, the 1st goal of FHS is having understandably way to predict right path.

Agree with you. Then let's compare it with the industry: CDH, HDP, MapR and SequenceIQ. Let's see what they are doing.

@mikefaille
Copy link
Member Author

To give my comments, I already check Dockerfile from : CDH, Mapr and SequenceIQ.

MapR use : /mapr (i really dont like it)

CDH is FHS compliant : //var/lib/hadoop-hdfs/cache/${user.name}/dfs/data

SequenceIQ use Docker community way (little awful but it work for me only if we add /data/package_name as subfolder) : /data/package_name

@davidonlaptop
Copy link
Member

Links to SequenceIQ dockerfiles (for future reference):

@mikefaille
Copy link
Member Author

SequenceIQ seems use the default path under /tmp/hadoop-${user.name}/dfs/data
But, in Docker /tmp path is an tmpfs mount. So, even container could lost data if we shutdown it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants