Hardware Recommendations
Ceph was designed to run on commodity hardware, which makes building and maintaining petabyte-scale data clusters economically feasible. When planning out your cluster hardware, you will need to balance a number of considerations, including failure domains and potential performance issues. Hardware planning should include distributing Ceph daemons and other processes that use Ceph across many hosts. Generally, we recommend running Ceph daemons of a specific type on a host configured for that type of daemon. We recommend using other hosts for processes that utilize your data cluster (e.g., OpenStack, CloudStack, etc).
小技巧
Check out the Ceph blog too.
CPU
CephFS metadata servers are CPU intensive, so they should have significant processing power (e.g., quad core or better CPUs) and benefit from higher clock rate (frequency in GHz). Ceph OSDs run the RADOS service, calculate data placement with CRUSH, replicate data, and maintain their own copy of the cluster map. Therefore, OSD nodes should have a reasonable amount of processing power. Requirements vary by use-case; a starting point might be one core per OSD for light / archival usage, and two cores per OSD for heavy workloads such as RBD volumes attached to VMs. Monitor / manager nodes do not have heavy CPU demands so a modest processor can be chosen for them. Also consider whether the host machine will run CPU-intensive processes in addition to Ceph daemons. For example, if your hosts will run computing VMs (e.g., OpenStack Nova), you will need to ensure that these other processes leave sufficient processing power for Ceph daemons. We recommend running additional CPU-intensive processes on separate hosts to avoid resource contention.
RAM
Generally, more RAM is better. Monitor / manager nodes for a modest cluster might do fine with 64GB; for a larger cluster with hundreds of OSDs 128GB is a reasonable target. There is a memory target for BlueStore OSDs that defaults to 4GB. Factor in a prudent margin for the operating system and administrative tasks (like monitoring and metrics) as well as increased consumption during recovery: provisioning ~8GB per BlueStore OSD is advised.
Monitors and managers (ceph-mon and ceph-mgr)
Monitor and manager daemon memory usage generally scales with the size of the
cluster. Note that at boot-time and during topology changes and recovery these
daemons will need more RAM than they do during steady-state operation, so plan
for peak usage. For very small clusters, 32 GB suffices. For
clusters of up to, say, 300 OSDs go with 64GB. For clusters built with (or
which will grow to) even more OSDS you should provision
128GB. You may also want to consider tuning settings like mon_osd_cache_size
or rocksdb_cache_size
after careful research.
Metadata servers (ceph-mds)
The metadata daemon memory utilization depends on how much memory its cache is
configured to consume. We recommend 1 GB as a minimum for most systems. See
mds_cache_memory
.
Memory
Bluestore uses its own memory to cache data rather than relying on the
operating system's page cache. In Bluestore you can adjust the amount of memory
that the OSD attempts to consume by changing the osd_memory_target
configuration option.
Setting the
osd_memory_target
below 2GB is typically not recommended (Ceph may fail to keep the memory consumption under 2GB and this may cause extremely slow performance).Setting the memory target between 2GB and 4GB typically works but may result in degraded performance as metadata may be read from disk during IO unless the active data set is relatively small.
4GB is the current default
osd_memory_target
size. This default was chosen for typical use cases, and is intended to balance memory requirements and OSD performance.Setting the
osd_memory_target
higher than 4GB can improve performance when there many (small) objects or when large (256GB/OSD or more) data sets are processed.
重要
The OSD memory autotuning is "best effort". While the OSD may unmap memory to allow the kernel to reclaim it, there is no guarantee that the kernel will actually reclaim freed memory within a specific time frame. This applies especially in older versions of Ceph, where transparent huge pages can prevent the kernel from reclaiming memory that was freed from fragmented huge pages. Modern versions of Ceph disable transparent huge pages at the application level to avoid this, though that still does not guarantee that the kernel will immediately reclaim unmapped memory. The OSD may still at times exceed it's memory target. We recommend budgeting around 20% extra memory on your system to prevent OSDs from going OOM during temporary spikes or due to any delay in reclaiming freed pages by the kernel. That value may be more or less than needed depending on the exact configuration of the system.
When using the legacy FileStore back end, the page cache is used for caching data, so no tuning is normally needed. When using the legacy FileStore backend, the OSD memory consumption is related to the number of PGs per daemon in the system.
Data Storage
Plan your data storage configuration carefully. There are significant cost and performance tradeoffs to consider when planning for data storage. Simultaneous OS operations, and simultaneous request for read and write operations from multiple daemons against a single drive can slow performance considerably.
Hard Disk Drives
OSDs should have plenty of hard disk drive space for object data. We recommend a minimum hard disk drive size of 1 terabyte. Consider the cost-per-gigabyte advantage of larger disks. We recommend dividing the price of the hard disk drive by the number of gigabytes to arrive at a cost per gigabyte, because larger drives may have a significant impact on the cost-per-gigabyte. For example, a 1 terabyte hard disk priced at $75.00 has a cost of $0.07 per gigabyte (i.e., $75 / 1024 = 0.0732). By contrast, a 3 terabyte hard disk priced at $150.00 has a cost of $0.05 per gigabyte (i.e., $150 / 3072 = 0.0488). In the foregoing example, using the 1 terabyte disks would generally increase the cost per gigabyte by 40%--rendering your cluster substantially less cost efficient.
小技巧
Running multiple OSDs on a single SAS / SATA drive is NOT a good idea. NVMe drives, however, can achieve improved performance by being split into two or more OSDs.
小技巧
Running an OSD and a monitor or a metadata server on a single drive is also NOT a good idea.
Storage drives are subject to limitations on seek time, access time, read and write times, as well as total throughput. These physical limitations affect overall system performance--especially during recovery. We recommend using a dedicated (ideally mirrored) drive for the operating system and software, and one drive for each Ceph OSD Daemon you run on the host (modulo NVMe above). Many "slow OSD" issues not attributable to hardware failure arise from running an operating system and multiple OSDs on the same drive. Since the cost of troubleshooting performance issues on a small cluster likely exceeds the cost of the extra disk drives, you can optimize your cluster design planning by avoiding the temptation to overtax the OSD storage drives.
You may run multiple Ceph OSD Daemons per SAS / SATA drive, but this will likely lead to resource contention and diminish the overall throughput.
Solid State Drives
One opportunity for performance improvement is to use solid-state drives (SSDs) to reduce random access time and read latency while accelerating throughput. SSDs often cost more than 10x as much per gigabyte when compared to a hard disk drive, but SSDs often exhibit access times that are at least 100x faster than a hard disk drive.
SSDs do not have moving mechanical parts so they are not necessarily subject to the same types of limitations as hard disk drives. SSDs do have significant limitations though. When evaluating SSDs, it is important to consider the performance of sequential reads and writes.
重要
We recommend exploring the use of SSDs to improve performance. However, before making a significant investment in SSDs, we strongly recommend both reviewing the performance metrics of an SSD and testing the SSD in a test configuration to gauge performance.
Relatively inexpensive SSDs may appeal to your sense of economy. Use caution. Acceptable IOPS are not enough when selecting an SSD for use with Ceph.
SSDs have historically been cost prohibitive for object storage, though emerging QLC drives are closing the gap. HDD OSDs may see a significant performance improvement by offloading WAL+DB onto an SSD.
One way Ceph accelerates CephFS file system performance is to segregate the
storage of CephFS metadata from the storage of the CephFS file contents. Ceph
provides a default metadata
pool for CephFS metadata. You will never have to
create a pool for CephFS metadata, but you can create a CRUSH map hierarchy for
your CephFS metadata pool that points only to a host's SSD storage media. See
CRUSH Device Class for details.
Controllers
Disk controllers (HBAs) can have a significant impact on write throughput. Carefully consider your selection to ensure that they do not create a performance bottleneck. Notably RAID-mode (IR) HBAs may exhibit higher latency than simpler "JBOD" (IT) mode HBAs, and the RAID SoC, write cache, and battery backup can substantially increase hardware and maintenance costs. Some RAID HBAs can be configured with an IT-mode "personality".
小技巧
The Ceph blog is often an excellent source of information on Ceph performance issues. See Ceph Write Throughput 1 and Ceph Write Throughput 2 for additional details.
Benchmarking
BlueStore opens block devices in O_DIRECT and uses fsync frequently to ensure
that data is safely persisted to media. You can evaluate a drive's low-level
write performance using fio
. For example, 4kB random write performance is
measured as follows:
# fio --name=/dev/sdX --ioengine=libaio --direct=1 --fsync=1 --readwrite=randwrite --blocksize=4k --runtime=300
Write Caches
Enterprise SSDs and HDDs normally include power loss protection features which use multi-level caches to speed up direct or synchronous writes. These devices can be toggled between two caching modes -- a volatile cache flushed to persistent media with fsync, or a non-volatile cache written synchronously.
These two modes are selected by either "enabling" or "disabling" the write (volatile) cache. When the volatile cache is enabled, Linux uses a device in "write back" mode, and when disabled, it uses "write through".
The default configuration (normally caching enabled) may not be optimal, and OSD performance may be dramatically increased in terms of increased IOPS and decreased commit_latency by disabling the write cache.
Users are therefore encouraged to benchmark their devices with fio
as
described earlier and persist the optimal cache configuration for their
devices.
The cache configuration can be queried with hdparm
, sdparm
,
smartctl
or by reading the values in /sys/class/scsi_disk/*/cache_type
,
for example:
# hdparm -W /dev/sda
/dev/sda:
write-caching = 1 (on)
# sdparm --get WCE /dev/sda
/dev/sda: ATA TOSHIBA MG07ACA1 0101
WCE 1 [cha: y]
# smartctl -g wcache /dev/sda
smartctl 7.1 2020-04-05 r5049 [x86_64-linux-4.18.0-305.19.1.el8_4.x86_64] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
Write cache is: Enabled
# cat /sys/class/scsi_disk/0\:0\:0\:0/cache_type
write back
The write cache can be disabled with those same tools:
# hdparm -W0 /dev/sda
/dev/sda:
setting drive write-caching to 0 (off)
write-caching = 0 (off)
# sdparm --clear WCE /dev/sda
/dev/sda: ATA TOSHIBA MG07ACA1 0101
# smartctl -s wcache,off /dev/sda
smartctl 7.1 2020-04-05 r5049 [x86_64-linux-4.18.0-305.19.1.el8_4.x86_64] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF ENABLE/DISABLE COMMANDS SECTION ===
Write cache disabled
Normally, disabling the cache using hdparm
, sdparm
, or smartctl
results in the cache_type changing automatically to "write through". If this is
not the case, you can try setting it directly as follows. (Users should note
that setting cache_type also correctly persists the caching mode of the device
until the next reboot):
# echo "write through" > /sys/class/scsi_disk/0\:0\:0\:0/cache_type
# hdparm -W /dev/sda
/dev/sda:
write-caching = 0 (off)
小技巧
This udev rule (tested on CentOS 8) will set all SATA/SAS device cache_types to "write through":
# cat /etc/udev/rules.d/99-ceph-write-through.rules
ACTION=="add", SUBSYSTEM=="scsi_disk", ATTR{cache_type}:="write through"
小技巧
This udev rule (tested on CentOS 7) will set all SATA/SAS device cache_types to "write through":
# cat /etc/udev/rules.d/99-ceph-write-through-el7.rules
ACTION=="add", SUBSYSTEM=="scsi_disk", RUN+="/bin/sh -c 'echo write through > /sys/class/scsi_disk/$kernel/cache_type'"
小技巧
The sdparm
utility can be used to view/change the volatile write
cache on several devices at once:
# sdparm --get WCE /dev/sd*
/dev/sda: ATA TOSHIBA MG07ACA1 0101
WCE 0 [cha: y]
/dev/sdb: ATA TOSHIBA MG07ACA1 0101
WCE 0 [cha: y]
# sdparm --clear WCE /dev/sd*
/dev/sda: ATA TOSHIBA MG07ACA1 0101
/dev/sdb: ATA TOSHIBA MG07ACA1 0101
Additional Considerations
You typically will run multiple OSDs per host, but you should ensure that the
aggregate throughput of your OSD drives doesn't exceed the network bandwidth
required to service a client's need to read or write data. You should also
consider what percentage of the overall data the cluster stores on each host. If
the percentage on a particular host is large and the host fails, it can lead to
problems such as exceeding the full ratio
, which causes Ceph to halt
operations as a safety precaution that prevents data loss.
When you run multiple OSDs per host, you also need to ensure that the kernel
is up to date. See OS Recommendations for notes on glibc
and
syncfs(2)
to ensure that your hardware performs as expected when running
multiple OSDs per host.
Networks
Provision at least 10Gbps+ networking in your racks. Replicating 1TB of data
across a 1Gbps network takes 3 hours, and 10TBs takes 30 hours! By contrast,
with a 10Gbps network, the replication times would be 20 minutes and 1 hour
respectively. In a petabyte-scale cluster, failure of an OSD drive is an
expectation, not an exception. System administrators will appreciate PGs
recovering from a degraded
state to an active + clean
state as rapidly
as possible, with price / performance tradeoffs taken into consideration.
Additionally, some deployment tools employ VLANs to make hardware and network
cabling more manageable. VLANs using 802.1q protocol require VLAN-capable NICs
and Switches. The added hardware expense may be offset by the operational cost
savings for network setup and maintenance. When using VLANs to handle VM
traffic between the cluster and compute stacks (e.g., OpenStack, CloudStack,
etc.), there is additional value in using 10G Ethernet or better; 40Gb or
25/50/100 Gb networking as of 2020 is common for production clusters.
Top-of-rack routers for each network also need to be able to communicate with spine routers that have even faster throughput, often 40Gbp/s or more.
Your server hardware should have a Baseboard Management Controller (BMC). Administration and deployment tools may also use BMCs extensively, especially via IPMI or Redfish, so consider the cost/benefit tradeoff of an out-of-band network for administration. Hypervisor SSH access, VM image uploads, OS image installs, management sockets, etc. can impose significant loads on a network. Running three networks may seem like overkill, but each traffic path represents a potential capacity, throughput and/or performance bottleneck that you should carefully consider before deploying a large scale data cluster.
Failure Domains
A failure domain is any failure that prevents access to one or more OSDs. That could be a stopped daemon on a host; a hard disk failure, an OS crash, a malfunctioning NIC, a failed power supply, a network outage, a power outage, and so forth. When planning out your hardware needs, you must balance the temptation to reduce costs by placing too many responsibilities into too few failure domains, and the added costs of isolating every potential failure domain.
Minimum Hardware Recommendations
Ceph can run on inexpensive commodity hardware. Small production clusters and development clusters can run successfully with modest hardware.
Process |
Criteria |
Minimum Recommended |
---|---|---|
|
Processor |
|
RAM |
|
|
Volume Storage |
1x storage drive per daemon |
|
DB/WAL |
1x SSD partition per daemon (optional) |
|
Network |
1x 1GbE+ NICs (10GbE+ recommended) |
|
|
Processor |
|
RAM |
2-4GB+ per daemon |
|
Disk Space |
60 GB per daemon |
|
Network |
1x 1GbE+ NICs |
|
|
Processor |
|
RAM |
2GB+ per daemon |
|
Disk Space |
1 MB per daemon |
|
Network |
1x 1GbE+ NICs |
小技巧
If you are running an OSD with a single disk, create a partition for your volume storage that is separate from the partition containing the OS. Generally, we recommend separate disks for the OS and the volume storage.