R function to create an HDF5 dataset and defining its dimensionality and compression behaviour.
h5createDataset(
file,
dataset,
dims,
maxdims = dims,
storage.mode = "double",
H5type = NULL,
size = NULL,
encoding = NULL,
chunk = dims,
fillValue,
level = 6,
filter = "gzip",
shuffle = TRUE,
native = FALSE
)The filename (character) of the file in which the dataset will be
located. For advanced programmers it is possible to provide an object of
class H5IdComponent representing a H5 location identifier (file or
group). See H5Fcreate(), H5Fopen(), H5Gcreate(), H5Gopen() to
create an object of this kind.
Name of the dataset to be created. The name can contain group names, e.g. 'group/dataset', but the function will fail, if the group does not yet exist.
The dimensions of the array as they will appear in the file. Note, the dimensions will appear in inverted order when viewing the file with a C-programm (e.g. HDFView), because the fastest changing dimension in R is the first one, whereas the fastest changing dimension in C is the last one.
The maximum extension of the array. Use H5Sunlimited()
to indicate an extensible dimension.
The storage mode of the data to be written. Can be
obtained by storage.mode(mydata).
Advanced programmers can specify the datatype of the dataset
within the file. See h5const("H5T") for a list of available
datatypes. If H5type is specified the argument storage.mode
is ignored. It is recommended to use storage.mode
For storage.mode='character' the maximum string length to use.
The default value of NULL will result in using variable length strings.
See the details for more information on this option.
The encoding of the string data type. Valid options are "ASCII" or "UTF-8".
The chunk size used to store the dataset. It is an integer
vector of the same length as dims. This argument is usually set
together with a compression property (argument level).
Standard value for filling the dataset. The storage.mode of value has to be convertible to the dataset type by HDF5.
The compression level used. An integer value between 0 (no compression) and 9 (highest and slowest compression).
Character defining which compression filter should be applied to the chunks of the dataset. See the Details section for more information on the options that can be provided here.
Logical defining whether the byte-shuffle algorithm should be applied to data prior to compression.
An object of class logical. If TRUE, array-like objects
are treated as stored in HDF5 row-major rather than R column-major
orientation. Using native = TRUE increases HDF5 file portability
between programming languages. A file written with native = TRUE
should also be read with native = TRUE
Returns (invisibly) TRUE if dataset was created successfully and FALSE otherwise.
Creates a new dataset in an existing HDF5 file. The function will fail if the file doesn't exist or if there exists already another dataset with the same name within the specified file.
The size argument is only used when storage.mode = 'character'. When
storing strings HDF5 can use either a fixed or variable length datatype.
Setting size to a positive integer will use fixed length strings where
size defines the length. rhdf5 writes null padded strings by default
and so to avoid data loss the value provided here should be the length of the
longest string. Setting size = NULL will use variable length strings. The
choice is probably dependent on the nature of the strings you're writing. The
principle difference is that a dataset of variable length strings will not be
compressed by HDF5 but each individual string only uses the space it
requires, whereas in a fixed length dataset each string is of length uses
size, but the whole dataset can be compressed. This explored more in the
examples below.
The filter argument can take several options matching to compression
filters distributed in either with the HDF5 library in Rhdf5lib or via
the rhdf5filters package. The plugins available and the corresponding
values for selecting them are shown below:
"GZIP",
"ZLIB",
"DEFLATE"
"SZIP"
"BZIP2"
"BLOSC_BLOSCLZ"
"BLOSC_LZ4"
"BLOSC_LZ4HC"
"BLOSC_SNAPPY"
"BLOSC_ZLIB"
"BLOSC_ZSTD"
"LZF"
"NONE"
h5createFile("ex_createDataset.h5")
# create dataset with compression
h5createDataset("ex_createDataset.h5", "A", c(5,8), storage.mode = "integer", chunk=c(5,1), level=6)
# create dataset without compression
h5createDataset("ex_createDataset.h5", "B", c(5,8), storage.mode = "integer")
h5createDataset("ex_createDataset.h5", "C", c(5,8), storage.mode = "double")
# create dataset with bzip2 compression
h5createDataset("ex_createDataset.h5", "D", c(5,8), storage.mode = "integer",
chunk=c(5,1), filter = "BZIP2", level=6)
# create a dataset of strings & define size based on longest string
ex_strings <- c('long', 'longer', 'longest')
h5createDataset("ex_createDataset.h5", "E",
storage.mode = "character", chunk = 3, level = 6,
dims = length(ex_strings), size = max(nchar(ex_strings)))
# write data to dataset
h5write(matrix(1:40,nr=5,nc=8), file="ex_createDataset.h5", name="A")
# write second column
h5write(matrix(1:5,nr=5,nc=1), file="ex_createDataset.h5", name="B", index=list(NULL,2))
# write character vector
h5write(ex_strings, file = "ex_createDataset.h5", name = "E")
h5dump("ex_createDataset.h5")
#> $A
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
#> [1,] 1 6 11 16 21 26 31 36
#> [2,] 2 7 12 17 22 27 32 37
#> [3,] 3 8 13 18 23 28 33 38
#> [4,] 4 9 14 19 24 29 34 39
#> [5,] 5 10 15 20 25 30 35 40
#>
#> $B
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
#> [1,] 0 1 0 0 0 0 0 0
#> [2,] 0 2 0 0 0 0 0 0
#> [3,] 0 3 0 0 0 0 0 0
#> [4,] 0 4 0 0 0 0 0 0
#> [5,] 0 5 0 0 0 0 0 0
#>
#> $C
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
#> [1,] 0 0 0 0 0 0 0 0
#> [2,] 0 0 0 0 0 0 0 0
#> [3,] 0 0 0 0 0 0 0 0
#> [4,] 0 0 0 0 0 0 0 0
#> [5,] 0 0 0 0 0 0 0 0
#>
#> $D
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
#> [1,] 0 0 0 0 0 0 0 0
#> [2,] 0 0 0 0 0 0 0 0
#> [3,] 0 0 0 0 0 0 0 0
#> [4,] 0 0 0 0 0 0 0 0
#> [5,] 0 0 0 0 0 0 0 0
#>
#> $E
#> [1] "long" "longer" "longest"
#>
## Investigating fixed vs variable length string datasets
## create 1000 random strings with length between 50 and 100 characters
words <- ceiling(runif(n = 1000, min = 50, max = 100)) |>
vapply(FUN = \(x) {
paste(sample(letters, size = x, replace = TRUE), collapse = "")
},
FUN.VALUE = character(1))
## create two HDF5 files
f1 <- tempfile()
f2 <- tempfile()
h5createFile(f1)
h5createFile(f2)
## create two string datasets
## the first is variable length strings, the second fixed at the length of our longest word
h5createDataset(f1, "strings", dims = length(words), storage.mode = "character", size = NULL, chunk = 25)
h5createDataset(f2, "strings", dims = length(words), storage.mode = "character", size = max(nchar(words)), chunk = 25)
## Write the data
h5write(words, f1, "strings")
h5write(words, f2, "strings")
## Check file sizes.
## In this example the fixed length string dataset is normally much smaller
file.size(f1)
#> [1] 106866
file.size(f2)
#> [1] 58816