[ANNOUNCE] relayfs, a high-speed data relay filesystem

Tom Zanussi (zanussi@us.ibm.com)
Tue, 18 Mar 2003 10:34:17 -0600


A kernel patch implementing relayfs, a Linux filesystem designed to
simplify the buffering and efficient transfer of large amounts of data
from kernel to user space for those facilities that need to do such
things, is now available as:

http://www.opersys.com/ftp/pub/relayfs/patch-relayfs-2.5.65-030317.bz2

The relayfs home page, which will host the latest versions of the
above and related patches and info is:

http://www.opersys.com/relayfs

A preliminary description of the characteristics and API for relayfs
was previously posted to this list. Here's a link to that post:

http://marc.theaimsgroup.com/?l=linux-kernel&m=104196329807755&w=2

I've included at the end of this mail the relayfs doc file
(Documentation/filesystems/relafys.txt), which describes the API in
detail, as currently implemented.

Also, the Linux Trace Toolkit has been reworked to make use of
relayfs; a preliminary kernel patch and updated user tools
are available at the LTT website, and can be used as a concrete
working example of using the API:

http://www.opersys.com/relayfs/ltt-on-relayfs.html

I've tested the relayfs code (via the updated LTT code) on my UP PII
machine and so far haven't seen any problems with it, but I still need
to do SMP testing and otherwise pound more heavily on it - I plan on
generating some benchmark data, which in addition to putting relayfs
through its paces, should give some idea of how much overhead might be
introduced by the API/implementation itself, as compared with the
benchmark data previously posted for LTT without it:

http://marc.theaimsgroup.com/?l=linux-kernel&m=103573710926859&w=2

I wouldn't expect to see much difference, as the underlying logging
code is essentially the same.

Comments welcome.

Tom Zanussi <zanussi@us.ibm.com>
IBM Linux Technology Center/RAS

---------

relayfs - a high-speed data relay filesystem
============================================

relayfs is a filesystem designed to provide an efficient mechanism for
tools and facilities to relay large amounts of data from kernel space
to user space.

The main idea behind relayfs is that every data flow is put into a
separate "channel" and each channel is a file. In practice, each
channel is a separate memory buffer allocated from within kernel space
upon channel instantiation. Software needing to relay data to user
space would open a channel or a number of channels, depending on its
needs, and would log data to that channel. All the buffering and
locking mechanics are taken care of by relayfs. The actual format and
protocol used for each channel is up to relayfs' clients.

relayfs makes no provisions for copying the same data to more than a
single channel. This is for the clients of the relay to take care of,
and so is any form of data filtering. The purpose is to keep relayfs
as simple as possible.

Usage
=====

In addition to the relayfs kernel API described below, relayfs
implements basic file operations. Here are the file operations that
are available and some comments regarding their behavior:

open() enables user to open an _existing_ channel.

mmap() results in channel's memory buffer being mmapped into the
caller's memory space.

read() since we are dealing with circular buffers, the user is only
allowed to read forward. Some apps may want to loop around
read() waiting for incoming data - if there is no data
available, read will put the reader on a wait queue until
data is available.

write() writing from user space operates exactly as relay_write() does
(described below).

close() decrements the channel's refcount. When the refcount reaches
0 i.e. when no process or kernel client has the file open
(see relay_close() below), the channel buffer is freed.

In order for a user application to make use of relayfs files, the
relayfs filesystem must be mounted. For example,

mount -t relayfs relayfs /mountpoint

The relayfs kernel API
======================

relayfs channels are implemented as circular buffers subdivided into
'sub-buffers'. kernel clients write data into the channel using
relay_write(), and are notified via a set of callbacks when
significant events occur within the channel. 'Significant events'
include:

- a sub-buffer is full i.e. the current write won't fit into the
current sub-buffer, and a 'buffer-switch' is triggered, after which
the data is written into the next buffer (if the next buffer is
empty). The client is notified of this condition via two callbacks,
one providing an opportunity to perform start-of-buffer tasks, the
other end-of-buffer tasks.

- data is ready for the client to process. The client can choose to
be notified either on a per-sub-buffer basis (bulk delivery) or
per-write basis (packet delivery).

- all sub-buffers are full. The current write won't fit into the
current sub-buffer, and the next sub-buffer is also full, thus
writing the event would overwrite data in the next sub-buffer. The
client is notified of this condition and can choose to either go
ahead and overwrite, or suspend the channel, depending on the value
it returns.

Here's a summary of the API relayfs provides to in-kernel clients:

int relay_open(channel_path, bufsize, nbufs, channel_flags, callbacks,
start_reserve, end_reserve, channel_start_reserve)
int relay_write(channel_id, *data_ptr, count, time_delta_offset)
void relay_buffers_consumed(channel_id, buffers_consumed)
void relay_resume(channel_id)
int relay_info(channel_id, *channel_info)
int relay_close(channel_id)

----------
int relay_open(channel_path, bufsize, nbufs, channel_flags, callbacks,
start_reserve, end_reserve, channel_start_reserve)

relay_open() is used to create a new entry in relayfs. This new entry
is created according to channel_path. channel_path contains the
absolute path to the channel file on relayfs. If, for example, the
caller sets channel_path to "/xlog/9", a "xlog/9" entry will appear
within relayfs automatically and the "xlog" directory will be created
in the filesystem's root. relayfs does not implement any policy on
its content, except to disallow the opening of two channels using the
same file. There are, nevertheless a set of guidelines for using
relayfs. Basically, each facility using relayfs should use a top-level
directory identifying it. The entry created above, for example,
presumably belongs to the "xlog" software.

The remaining parameters for relay_open() are as follows:

- channel_flags - an ORed combination of attribute values controlling
common channel characteristics:

- logging scheme - relayfs use 2 mutually exclusive schemes
for logging data to a channel. The 'lockless scheme'
reserves and writes data to a channel without the need of
any type of locking on the channel. This is the preferred
scheme, but may not be available on a given architecture (it
relies on the presence of a cmpxchg instruction). It's
specified by the RELAY_SCHEME_LOCKLESS flag. The 'locking
scheme' either obtains a lock on the channel for writing or
disables interrupts, depending on whether the channel was
opened for SMP or global usage (see below). It's specified
by the RELAY_SCHEME_LOCKING flag. While a client may want
to explicitly specify a particular scheme to use, it's more
convenient to specify RELAY_SCHEME_ANY for this flag, which
will allow relayfs to choose the best available scheme i.e.
lockless if supported.

- SMP usage - this applies only when the locking scheme is in
use. If RELAY_USAGE_SMP is specified, it's assumed that the
channel will be used in a per-CPU fashion and consequently,
the only locking that will be done for writes is to disable
local irqs. If RELAY_USAGE_GLOBAL is specified, it's assumed
that writes to the buffer can occur within any CPU context,
and spinlock_irq_save will be used to lock the buffer.

- delivery mode - if RELAY_DELIVERY_BULK is specified, the
client will be notified via its deliver() callback whenever a
sub-buffer has been filled. Alternatively,
RELAY_DELIVERY_PACKET will cause delivery to occur after the
completion of each write. See the description of the channel
callbacks below for more details.

- timestamping - if RELAY_TIMESTAMP_TSC is specified and the
architecture supports it, efficient TSC 'timestamps' can be
associated with each write, otherwise more expensive
gettimeofday() timestamping is used. At the beginning of
each sub-buffer, a gettimeofday() timestamp and the current
TSC, if supported, are read, and are passed on to the client
via the buffer_start() callback. This allows correlation of
the current time with the current TSC for subsequent writes.
Each subsequent write is associated with a 'time delta',
which is either the current TSC, if the channel is using
TSCs, or the difference between the buffer_start gettimeofday
timestamp and the gettimeofday time read for the current
write. Note that relayfs never writes either a timestamp or
time delta into the buffer unless explicitly asked to (see
the description of relay_write() for details).

- bufsize - the size of the 'sub-buffers' making up the circular
channel buffer. For the lockless scheme, this must be a power of 2.
The locking scheme can use any size, but if RELAY_SCHEME_ANY is
specified, a bufsize that's a power of two would ensure that either
scheme could be used.

- nbufs - the number of 'sub-buffers' making up the circular channel
buffer. This must be a power of 2 for the lockless scheme, and
exactly 2 for the locking scheme. If RELAY_SCHEME_ANY is used and
the locking scheme is the only one available, the number of buffers
will be automatically set to 2, thus if this flag is used, it's a
good idea to specify the number of buffers preferred for the
lockless scheme, in case that's the one chosen.

- callbacks - a table of callback functions called when events occur
within the data relay that clients need to know about:

- int buffer_start(channel_id, current_write_pos, buffer_id,
start_time, start_tsc, using_tsc) -

called at the beginning of a new sub-buffer, the
buffer_start() callback gives the client an opportunity to
write data into space reserved at the beginning of a
sub-buffer. The client should only write into the buffer
if it specified a value for start_reserve and/or
channel_start_reserve (see below) when the channel was
opened. In the latter case, the client can determine
whether to write its one-time rchan_start_reserve data by
examining the value of buffer_id, which will be 0 for the
first sub-buffer. The address that the client can write
to is contained in current_write_pos (the client by
definition knows how much it can write i.e. the value it
passed to relay_open() for start_reserve/
channel_start_reserve). start_time contains the
gettimeofday() value for the start of the buffer and start
TSC contains the TSC read at the same time. The using_tsc
param indicates whether or not start_tsc is valid (it
wouldn't be if TSC timestamping isn't being used).

The client should return the number of bytes it wrote to
the channel, 0 if none.

- int buffer_end(channel_id, current_write_pos, end_of_buffer,
end_time, end_tsc, using_tsc)

called at the end of a sub-buffer, the buffer_end()
callback gives the client an opportunity to perform
end-of-buffer processing. Note that the current_write_pos
is the position where the next write would occur, but
since the current write wouldn't fit (which is the trigger
for the buffer_end event), the buffer is considered full
even though there may be unused space at the end. The
end_of_buffer param pointer value can be used to determine
exactly the size of the unused space. The client should
only write into the buffer if it specified a value for
end_reserve when the channel was opened. If the client
doesn't write anything i.e. returns 0, the unused space at
the end of the sub-buffer is available via relay_info() -
this data may be needed by the client later if it needs to
process raw sub-buffers (an alternative would be to save
the unused bytes count value in end_reserve space at the
end of each sub-buffer during buffer_end processing and
read it when needed at a later time. The other
alternative would be to use read(2), which makes the
unused count invisible to the caller). end_time contains
the gettimeofday() value for the end of the buffer and end
TSC contains the TSC read at the same time. The using_tsc
param indicates whether or not end_tsc is valid (it
wouldn't be if TSC timestamping isn't being used).

The client should return the number of bytes it wrote to
the channel, 0 if none.

- void deliver(channel_id, from, len)

called when data is ready for the client. This callback
is used to notify a client when a sub-buffer is complete
(in the case of bulk delivery) or a single write is
complete (packet delivery). A bulk delivery client might
wish to then signal a daemon that a sub-buffer is ready.
A packet delivery client might wish to process the packet
or send it elsewhere. The from param is a pointer to the
delivered data and len specifies how many bytes are ready.

- int buffers_full(channel_id)

called when a buffers-full condition is detected. This
callback is used to notify a client that further writes to
the channel will result in unread data being overwritten.
The client can signal that it doesn't want this to happen
by returning 1, which will 'suspend' the channel. Further
writes to the channel will be discarded, and the channel's
events_lost counter will be incremented for each discarded
write. The channel can be subsequently 'resumed' by
calling the relay_resume() API function. If the client
returns 0, writing will continue unhindered into the next
sub-buffer. This 'feature' could be used to implement
'flight recorder' applications, for instance.

- start_reserve - the number of bytes to be reserved at the start of
each sub-buffer. The client can do what it wants with this number
of bytes when the buffer_start() callback is invoked. Typically
clients would use this to write per-sub-buffer header data.

- end_reserve - the number of bytes to be reserved at the end of each
sub-buffer. The client can do what it wants with this number of
bytes when the buffer_end() callback is invoked. Typically clients
would use this to write per-sub-buffer footer data.

- channel_start_reserve - the number of bytes to be reserved, in
addition to start_reserve, at the beginning of the first sub-buffer
in the channel. The client can do what it wants with this number of
bytes when the buffer_start() callback is invoked. Typically
clients would use this to write per-channel header data.

Upon successful completion, relay_open() returns a channel id
to be used for all other operations with the relay. All buffers
managed by the relay are allocated using rvmalloc/rvfree to allow
for easy mmapping to user-space.

----------
int relay_write(channel_id, *data_ptr, count, time_delta_offset)

relay_write() reserves space in the channel and writes count bytes of
data pointed to by data_ptr to it. Automatically performs any
necessary locking, depending on the scheme and SMP usage in effect (no
locking is done for the lockless scheme regardless of usage). It
returns the number of bytes written, or a negative number on failure.
If time_delta_offset is >= 0, the internal time delta, the internal
time delta calculated when the slot was reserved will be written at
that offset. This is the TSC or gettimeofday() delta between the
current write and the beginning of the buffer, whichever method is
being used by the channel. Trying to write a count larger than the
bufsize specified to relay_open() (taking into account the reserved
start-of-buffer and end-of-buffer space as well) will fail.

----------
void relay_buffers_consumed(channel_id, buffers_consumed)

relay_buffers_consumed() updates the channel's buffers_consumed
counter, which is used in conjunction with its buffers_produced
counter to determine when a buffers_full condition exists. The count
in the buffers_consumed param is added to the channel's
buffers_consumed counter, which should track the actual number of
buffers consumed by the client. This would typically be called once
the client had finished processing a delivered sub-buffer (bulk
clients) or had received end-of-buffer notification (packet clients).
If a client is operating in 'flight recorder' mode it never needs to
call this function; suspending the channel in this case would be done
by returning 1 from the next buffers_full calback, and the channel
could be subsequently resumed via relay_resume(), which automatically
'catches up' the buffers_consumed count with buffers_produced.

----------
void relay_resume(channel_id)

relay_resume() resume a suspended channel. A channel that had a
'buffers full' condition detected for it, and was told to 'suspend'
via a non-zero buffers_full callback return value can again be written
to following this call.

----------
int relay_info(channel_id, *channel_info)

relay_info fills in an rchan_info struct with channel status and
attribute information such as usage modes, sub-buffer size and count,
the allocated size of the entire buffer, buffers produced and
consumed, current buffer id, count of writes lost due to buffers full
condition.

Clients may need to know how many 'unused' bytes there are at the end
of a given sub-buffer. This would only be the case if the client 1)
didn't either write this count to the end of the sub-buffer or
otherwise note it (it's available as the difference between the buffer
end and current write pos params in the buffer_end callback) (if the
client returned 0 from the buffer_end callback, it's assumed that this
is indeed the case) 2) isn't using the read() system call to read the
buffer. In other words, if the client isn't annotating the stream and
is reading the buffer by mmaping it, this information would be needed
in order for the client to 'skip over' the unused bytes at the ends of
sub-buffers.

Additionally, for the lockless scheme, clients may need to know
whether a particular sub-buffer is actually complete. An array of
boolean values, one per sub-buffer, contains non-zero if the buffer is
complete, non-zero otherwise.

----------
int relay_close(channel_id)

relay_close is used to close the channel. It finalizes the last
sub-buffer (the one currently being written to) and marks the channel
as finalized. This doesn't free the channel buffer or channel data
structure - this is handled automatically when the last reference to
the channel is given up.

Writing directly into the channel
=================================

Using the relay_write() API function as described above is the
preferred means of writing into a channel. In some cases, however,
in-kernel clients might want to write directly into a relay channel
rather than have relay_write() copy it into the buffer on the client's
behalf. Clients wishing to do this should follow the model used to
implement relay_write itself. The general sequence is:

- get a pointer to the channel via rchan_get(). This increments the
channel's reference count.
- call relay_lock_channel(). This will perform the proper locking for
the channel given the scheme in use and the SMP usage.
- reserve a slot in the channel via relay_reserve()
- write directly to the reserved address
- call relay_commit() to commit the write
- call relay_unlock_channel()
- call rchan_put() to release the channel reference

In particular, clients should make sure they call rchan_get() and
rchan_put() and not hold on to references to the channel pointer.
Also, forgetting to use relay_lock_channel()/relay_unlock_channel()
has no effect if the lockless scheme is being used, but could result
in corrupted buffer contents if the locking scheme is used.

Limitations
===========

Writes made via the write() system call are currently limited to 2
pages worth of data. There is no such limit on the in-kernel API
function relay_write().

User applications can currently only mmap the complete buffer (it
doesn't really make sense to mmap only part of it, given its purpose).

Latest version
==============

The latest version can be found at:

http://www.opersys.com

An example relayfs client, the Linux Trace Toolkit, can also be found
there.

Credits
=======

The ideas and specs for relayfs came about as a result of discussions
on tracing involving the following:

Michel Dagenais <michel.dagenais@polymtl.ca>
Richard Moore <richardj_moore@uk.ibm.com>
Bob Wisniewski <bob@watson.ibm.com>
Karim Yaghmour <karim@opersys.com>
Tom Zanussi <zanussi@us.ibm.com>

-- 
Regards,

Tom Zanussi <zanussi@us.ibm.com> IBM Linux Technology Center/RAS

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/