123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449 |
- ORANGEFS
- ========
- OrangeFS is an LGPL userspace scale-out parallel storage system. It is ideal
- for large storage problems faced by HPC, BigData, Streaming Video,
- Genomics, Bioinformatics.
- Orangefs, originally called PVFS, was first developed in 1993 by
- Walt Ligon and Eric Blumer as a parallel file system for Parallel
- Virtual Machine (PVM) as part of a NASA grant to study the I/O patterns
- of parallel programs.
- Orangefs features include:
- * Distributes file data among multiple file servers
- * Supports simultaneous access by multiple clients
- * Stores file data and metadata on servers using local file system
- and access methods
- * Userspace implementation is easy to install and maintain
- * Direct MPI support
- * Stateless
- MAILING LIST
- ============
- http://beowulf-underground.org/mailman/listinfo/pvfs2-users
- DOCUMENTATION
- =============
- http://www.orangefs.org/documentation/
- USERSPACE FILESYSTEM SOURCE
- ===========================
- http://www.orangefs.org/download
- Orangefs versions prior to 2.9.3 would not be compatible with the
- upstream version of the kernel client.
- BUILDING THE USERSPACE FILESYSTEM ON A SINGLE SERVER
- ====================================================
- When Orangefs is upstream, "--with-kernel" shouldn't be needed, but
- until then the path to where the kernel with the Orangefs kernel client
- patch was built is needed to ensure that pvfs2-client-core (the bridge
- between kernel space and user space) will build properly. You can omit
- --prefix if you don't care that things are sprinkled around in
- /usr/local.
- ./configure --prefix=/opt/ofs --with-kernel=/path/to/orangefs/kernel
- make
- make install
- Create an orangefs config file:
- /opt/ofs/bin/pvfs2-genconfig /etc/pvfs2.conf
- for "Enter hostnames", use the hostname, don't let it default to
- localhost.
- create a pvfs2tab file in /etc:
- cat /etc/pvfs2tab
- tcp://myhostname:3334/orangefs /mymountpoint pvfs2 defaults,noauto 0 0
- create the mount point you specified in the tab file if needed:
- mkdir /mymountpoint
- bootstrap the server:
- /opt/ofs/sbin/pvfs2-server /etc/pvfs2.conf -f
- start the server:
- /opt/osf/sbin/pvfs2-server /etc/pvfs2.conf
- Now the server is running. At this point you might like to
- prove things are working with:
- /opt/osf/bin/pvfs2-ls /mymountpoint
- You might not want to enforce selinux, it doesn't seem to matter by
- linux 3.11...
- If stuff seems to be working, turn on the client core:
- /opt/osf/sbin/pvfs2-client -p /opt/osf/sbin/pvfs2-client-core
- Mount your filesystem.
- mount -t pvfs2 tcp://myhostname:3334/orangefs /mymountpoint
- OPTIONS
- =======
- The following mount options are accepted:
- acl
- Allow the use of Access Control Lists on files and directories.
- intr
- Some operations between the kernel client and the user space
- filesystem can be interruptible, such as changes in debug levels
- and the setting of tunable parameters.
- local_lock
- Enable posix locking from the perspective of "this" kernel. The
- default file_operations lock action is to return ENOSYS. Posix
- locking kicks in if the filesystem is mounted with -o local_lock.
- Distributed locking is being worked on for the future.
- DEBUGGING
- =========
- If you want the debug (GOSSIP) statements in a particular
- source file (inode.c for example) go to syslog:
- echo inode > /sys/kernel/debug/orangefs/kernel-debug
- No debugging (the default):
- echo none > /sys/kernel/debug/orangefs/kernel-debug
- Debugging from several source files:
- echo inode,dir > /sys/kernel/debug/orangefs/kernel-debug
- All debugging:
- echo all > /sys/kernel/debug/orangefs/kernel-debug
- Get a list of all debugging keywords:
- cat /sys/kernel/debug/orangefs/debug-help
- PROTOCOL BETWEEN KERNEL MODULE AND USERSPACE
- ============================================
- Orangefs is a user space filesystem and an associated kernel module.
- We'll just refer to the user space part of Orangefs as "userspace"
- from here on out. Orangefs descends from PVFS, and userspace code
- still uses PVFS for function and variable names. Userspace typedefs
- many of the important structures. Function and variable names in
- the kernel module have been transitioned to "orangefs", and The Linux
- Coding Style avoids typedefs, so kernel module structures that
- correspond to userspace structures are not typedefed.
- The kernel module implements a pseudo device that userspace
- can read from and write to. Userspace can also manipulate the
- kernel module through the pseudo device with ioctl.
- THE BUFMAP:
- At startup userspace allocates two page-size-aligned (posix_memalign)
- mlocked memory buffers, one is used for IO and one is used for readdir
- operations. The IO buffer is 41943040 bytes and the readdir buffer is
- 4194304 bytes. Each buffer contains logical chunks, or partitions, and
- a pointer to each buffer is added to its own PVFS_dev_map_desc structure
- which also describes its total size, as well as the size and number of
- the partitions.
- A pointer to the IO buffer's PVFS_dev_map_desc structure is sent to a
- mapping routine in the kernel module with an ioctl. The structure is
- copied from user space to kernel space with copy_from_user and is used
- to initialize the kernel module's "bufmap" (struct orangefs_bufmap), which
- then contains:
- * refcnt - a reference counter
- * desc_size - PVFS2_BUFMAP_DEFAULT_DESC_SIZE (4194304) - the IO buffer's
- partition size, which represents the filesystem's block size and
- is used for s_blocksize in super blocks.
- * desc_count - PVFS2_BUFMAP_DEFAULT_DESC_COUNT (10) - the number of
- partitions in the IO buffer.
- * desc_shift - log2(desc_size), used for s_blocksize_bits in super blocks.
- * total_size - the total size of the IO buffer.
- * page_count - the number of 4096 byte pages in the IO buffer.
- * page_array - a pointer to page_count * (sizeof(struct page*)) bytes
- of kcalloced memory. This memory is used as an array of pointers
- to each of the pages in the IO buffer through a call to get_user_pages.
- * desc_array - a pointer to desc_count * (sizeof(struct orangefs_bufmap_desc))
- bytes of kcalloced memory. This memory is further intialized:
- user_desc is the kernel's copy of the IO buffer's ORANGEFS_dev_map_desc
- structure. user_desc->ptr points to the IO buffer.
- pages_per_desc = bufmap->desc_size / PAGE_SIZE
- offset = 0
- bufmap->desc_array[0].page_array = &bufmap->page_array[offset]
- bufmap->desc_array[0].array_count = pages_per_desc = 1024
- bufmap->desc_array[0].uaddr = (user_desc->ptr) + (0 * 1024 * 4096)
- offset += 1024
- .
- .
- .
- bufmap->desc_array[9].page_array = &bufmap->page_array[offset]
- bufmap->desc_array[9].array_count = pages_per_desc = 1024
- bufmap->desc_array[9].uaddr = (user_desc->ptr) +
- (9 * 1024 * 4096)
- offset += 1024
- * buffer_index_array - a desc_count sized array of ints, used to
- indicate which of the IO buffer's partitions are available to use.
- * buffer_index_lock - a spinlock to protect buffer_index_array during update.
- * readdir_index_array - a five (ORANGEFS_READDIR_DEFAULT_DESC_COUNT) element
- int array used to indicate which of the readdir buffer's partitions are
- available to use.
- * readdir_index_lock - a spinlock to protect readdir_index_array during
- update.
- OPERATIONS:
- The kernel module builds an "op" (struct orangefs_kernel_op_s) when it
- needs to communicate with userspace. Part of the op contains the "upcall"
- which expresses the request to userspace. Part of the op eventually
- contains the "downcall" which expresses the results of the request.
- The slab allocator is used to keep a cache of op structures handy.
- At init time the kernel module defines and initializes a request list
- and an in_progress hash table to keep track of all the ops that are
- in flight at any given time.
- Ops are stateful:
- * unknown - op was just initialized
- * waiting - op is on request_list (upward bound)
- * inprogr - op is in progress (waiting for downcall)
- * serviced - op has matching downcall; ok
- * purged - op has to start a timer since client-core
- exited uncleanly before servicing op
- * given up - submitter has given up waiting for it
- When some arbitrary userspace program needs to perform a
- filesystem operation on Orangefs (readdir, I/O, create, whatever)
- an op structure is initialized and tagged with a distinguishing ID
- number. The upcall part of the op is filled out, and the op is
- passed to the "service_operation" function.
- Service_operation changes the op's state to "waiting", puts
- it on the request list, and signals the Orangefs file_operations.poll
- function through a wait queue. Userspace is polling the pseudo-device
- and thus becomes aware of the upcall request that needs to be read.
- When the Orangefs file_operations.read function is triggered, the
- request list is searched for an op that seems ready-to-process.
- The op is removed from the request list. The tag from the op and
- the filled-out upcall struct are copy_to_user'ed back to userspace.
- If any of these (and some additional protocol) copy_to_users fail,
- the op's state is set to "waiting" and the op is added back to
- the request list. Otherwise, the op's state is changed to "in progress",
- and the op is hashed on its tag and put onto the end of a list in the
- in_progress hash table at the index the tag hashed to.
- When userspace has assembled the response to the upcall, it
- writes the response, which includes the distinguishing tag, back to
- the pseudo device in a series of io_vecs. This triggers the Orangefs
- file_operations.write_iter function to find the op with the associated
- tag and remove it from the in_progress hash table. As long as the op's
- state is not "canceled" or "given up", its state is set to "serviced".
- The file_operations.write_iter function returns to the waiting vfs,
- and back to service_operation through wait_for_matching_downcall.
- Service operation returns to its caller with the op's downcall
- part (the response to the upcall) filled out.
- The "client-core" is the bridge between the kernel module and
- userspace. The client-core is a daemon. The client-core has an
- associated watchdog daemon. If the client-core is ever signaled
- to die, the watchdog daemon restarts the client-core. Even though
- the client-core is restarted "right away", there is a period of
- time during such an event that the client-core is dead. A dead client-core
- can't be triggered by the Orangefs file_operations.poll function.
- Ops that pass through service_operation during a "dead spell" can timeout
- on the wait queue and one attempt is made to recycle them. Obviously,
- if the client-core stays dead too long, the arbitrary userspace processes
- trying to use Orangefs will be negatively affected. Waiting ops
- that can't be serviced will be removed from the request list and
- have their states set to "given up". In-progress ops that can't
- be serviced will be removed from the in_progress hash table and
- have their states set to "given up".
- Readdir and I/O ops are atypical with respect to their payloads.
- - readdir ops use the smaller of the two pre-allocated pre-partitioned
- memory buffers. The readdir buffer is only available to userspace.
- The kernel module obtains an index to a free partition before launching
- a readdir op. Userspace deposits the results into the indexed partition
- and then writes them to back to the pvfs device.
- - io (read and write) ops use the larger of the two pre-allocated
- pre-partitioned memory buffers. The IO buffer is accessible from
- both userspace and the kernel module. The kernel module obtains an
- index to a free partition before launching an io op. The kernel module
- deposits write data into the indexed partition, to be consumed
- directly by userspace. Userspace deposits the results of read
- requests into the indexed partition, to be consumed directly
- by the kernel module.
- Responses to kernel requests are all packaged in pvfs2_downcall_t
- structs. Besides a few other members, pvfs2_downcall_t contains a
- union of structs, each of which is associated with a particular
- response type.
- The several members outside of the union are:
- - int32_t type - type of operation.
- - int32_t status - return code for the operation.
- - int64_t trailer_size - 0 unless readdir operation.
- - char *trailer_buf - initialized to NULL, used during readdir operations.
- The appropriate member inside the union is filled out for any
- particular response.
- PVFS2_VFS_OP_FILE_IO
- fill a pvfs2_io_response_t
- PVFS2_VFS_OP_LOOKUP
- fill a PVFS_object_kref
- PVFS2_VFS_OP_CREATE
- fill a PVFS_object_kref
- PVFS2_VFS_OP_SYMLINK
- fill a PVFS_object_kref
- PVFS2_VFS_OP_GETATTR
- fill in a PVFS_sys_attr_s (tons of stuff the kernel doesn't need)
- fill in a string with the link target when the object is a symlink.
- PVFS2_VFS_OP_MKDIR
- fill a PVFS_object_kref
- PVFS2_VFS_OP_STATFS
- fill a pvfs2_statfs_response_t with useless info <g>. It is hard for
- us to know, in a timely fashion, these statistics about our
- distributed network filesystem.
- PVFS2_VFS_OP_FS_MOUNT
- fill a pvfs2_fs_mount_response_t which is just like a PVFS_object_kref
- except its members are in a different order and "__pad1" is replaced
- with "id".
- PVFS2_VFS_OP_GETXATTR
- fill a pvfs2_getxattr_response_t
- PVFS2_VFS_OP_LISTXATTR
- fill a pvfs2_listxattr_response_t
- PVFS2_VFS_OP_PARAM
- fill a pvfs2_param_response_t
- PVFS2_VFS_OP_PERF_COUNT
- fill a pvfs2_perf_count_response_t
- PVFS2_VFS_OP_FSKEY
- file a pvfs2_fs_key_response_t
- PVFS2_VFS_OP_READDIR
- jamb everything needed to represent a pvfs2_readdir_response_t into
- the readdir buffer descriptor specified in the upcall.
- Userspace uses writev() on /dev/pvfs2-req to pass responses to the requests
- made by the kernel side.
- A buffer_list containing:
- - a pointer to the prepared response to the request from the
- kernel (struct pvfs2_downcall_t).
- - and also, in the case of a readdir request, a pointer to a
- buffer containing descriptors for the objects in the target
- directory.
- ... is sent to the function (PINT_dev_write_list) which performs
- the writev.
- PINT_dev_write_list has a local iovec array: struct iovec io_array[10];
- The first four elements of io_array are initialized like this for all
- responses:
- io_array[0].iov_base = address of local variable "proto_ver" (int32_t)
- io_array[0].iov_len = sizeof(int32_t)
- io_array[1].iov_base = address of global variable "pdev_magic" (int32_t)
- io_array[1].iov_len = sizeof(int32_t)
- io_array[2].iov_base = address of parameter "tag" (PVFS_id_gen_t)
- io_array[2].iov_len = sizeof(int64_t)
- io_array[3].iov_base = address of out_downcall member (pvfs2_downcall_t)
- of global variable vfs_request (vfs_request_t)
- io_array[3].iov_len = sizeof(pvfs2_downcall_t)
- Readdir responses initialize the fifth element io_array like this:
- io_array[4].iov_base = contents of member trailer_buf (char *)
- from out_downcall member of global variable
- vfs_request
- io_array[4].iov_len = contents of member trailer_size (PVFS_size)
- from out_downcall member of global variable
- vfs_request
- Orangefs exploits the dcache in order to avoid sending redundant
- requests to userspace. We keep object inode attributes up-to-date with
- orangefs_inode_getattr. Orangefs_inode_getattr uses two arguments to
- help it decide whether or not to update an inode: "new" and "bypass".
- Orangefs keeps private data in an object's inode that includes a short
- timeout value, getattr_time, which allows any iteration of
- orangefs_inode_getattr to know how long it has been since the inode was
- updated. When the object is not new (new == 0) and the bypass flag is not
- set (bypass == 0) orangefs_inode_getattr returns without updating the inode
- if getattr_time has not timed out. Getattr_time is updated each time the
- inode is updated.
- Creation of a new object (file, dir, sym-link) includes the evaluation of
- its pathname, resulting in a negative directory entry for the object.
- A new inode is allocated and associated with the dentry, turning it from
- a negative dentry into a "productive full member of society". Orangefs
- obtains the new inode from Linux with new_inode() and associates
- the inode with the dentry by sending the pair back to Linux with
- d_instantiate().
- The evaluation of a pathname for an object resolves to its corresponding
- dentry. If there is no corresponding dentry, one is created for it in
- the dcache. Whenever a dentry is modified or verified Orangefs stores a
- short timeout value in the dentry's d_time, and the dentry will be trusted
- for that amount of time. Orangefs is a network filesystem, and objects
- can potentially change out-of-band with any particular Orangefs kernel module
- instance, so trusting a dentry is risky. The alternative to trusting
- dentries is to always obtain the needed information from userspace - at
- least a trip to the client-core, maybe to the servers. Obtaining information
- from a dentry is cheap, obtaining it from userspace is relatively expensive,
- hence the motivation to use the dentry when possible.
- The timeout values d_time and getattr_time are jiffy based, and the
- code is designed to avoid the jiffy-wrap problem:
- "In general, if the clock may have wrapped around more than once, there
- is no way to tell how much time has elapsed. However, if the times t1
- and t2 are known to be fairly close, we can reliably compute the
- difference in a way that takes into account the possibility that the
- clock may have wrapped between times."
- from course notes by instructor Andy Wang
|