Tuesday, November 4, 2014

OpenStack-Swift-VS-Ceph-RGW-READ-Performance

Assumption and purpose:
This is an attempt to compare two promising open source Object Store Technologies purely based on performance. The use case kept in mind is large or small scale public cloud storage provider & the attempt here is evaluate the best technology for said use case.

Feature delta between OpenStack Swift and Ceph Object Store is ignored here. Ceph is viewed only as Object Store serving Objects via Swift REST API (not RADOS Objects), Ceph’s other interfaces which provide file and block based access are ignored here.

Assumption here is both the technologies can be best compared when deployed on same hardware and topology & tested with same kind of workload. Data caching is discouraged while collecting numbers (Page Cache, Dentries and Inode are flushed every minute on each server). COSBench is used as benchmarking tool.

Note:

I got some suggestion to improve Ceph-RGW performance from Ceph community . I tried all of them, they do have some minor impact on the overall Ceph-RGW performance(<3%). However there is nothing that change the overall conclusion of the study.

It would not be called an apple to apple comparison but with multiple RGW-civetweb instances and HA proxy, I was able to get better results with Ceph-RGW. I will be posting them soon.

Hardware:

There are two flavors of Dell Power Edge R620 servers used in the study. For simplicity I will now call them T1 & T2.

T1:

CPU: 2x Intel-E5-2680 10C 2.8GHz 25M$ (40 Logical CPU, with HT Enabled)

RAM: 4x 16GB RDIMM, dual rank x4 (64GB)

NIC1: Broadcom Corporation NetXtreme II BCM57810 10 Gigabit Ethernet (For Management)

NIC2: Mellonix Connect-X3, 40 Gigabit Ethernet, Dual Port Full Duplex (For Data)

Storage: 160 GB HDD (For OS).

T2:

CPU: 2x Intel-E5-2680 10C 2.8GHz 25M$ (40 Logical CPU, with HT Enabled)

RAM: 8x 16GB RDIMM, dual rank x4 (128GB)

NIC1: Broadcom Corporation NetXtreme II BCM57810 10 Gigabit Ethernet (For Management)

NIC2: Mellonix Connect-X3, 40 Gigabit Ethernet, Dual Port Full Duplex (For Data)

Storage1: 160 GB HDD (For OS)

Storage2: 10x 400GB Optimus Eco™ 2.5” SAS SSDs (4TB)

Interface SAS (4 Phy 6Gb/s)

Interface Ports Dual/Wide

Network Bandwidth Check:

Host-A$ date ; sudo iperf -c XXX.XXX.XXX.B -p 5001 -P4 -m ; date

Tue Nov 4 13:16:58 IST 2014

------------------------------------------------------------

Client connecting to XXX.XXX.XXX.B , TCP port 5001

TCP window size: 325 KByte (default)

------------------------------------------------------------

[ 5] local XXX.XXX.XXX.A port 43892 connected with XXX.XXX.XXX.B port 5001

[ 3] local XXX.XXX.XXX.A port 43891 connected with XXX.XXX.XXX.B port 5001

[ 6] local XXX.XXX.XXX.A port 43893 connected with XXX.XXX.XXX.B port 5001

[ 4] local XXX.XXX.XXX.A port 43890 connected with XXX.XXX.XXX.B port 5001

[ ID] Interval Transfer Bandwidth

[ 5] 0.0-10.0 sec 10.9 GBytes 9.35 Gbits/sec

[ 5] MSS size 8960 bytes (MTU 9000 bytes, unknown interface)

[ 3] 0.0-10.0 sec 9.17 GBytes 7.88 Gbits/sec

[ 3] MSS size 8960 bytes (MTU 9000 bytes, unknown interface)

[ 6] 0.0-10.0 sec 16.5 GBytes 14.2 Gbits/sec

[ 6] MSS size 8960 bytes (MTU 9000 bytes, unknown interface)

[ 4] 0.0-10.0 sec 8.72 GBytes 7.49 Gbits/sec

[ 4] MSS size 8960 bytes (MTU 9000 bytes, unknown interface)

[SUM] 0.0-10.0 sec 45.3 GBytes 38.9 Gbits/sec

Tue Nov 4 13:17:08 IST 2014

Host-B$ date ; sudo iperf -c 10.242.43.100 -p 4001 -P4 -m ; date

Tue Nov 4 13:17:01 IST 2014

------------------------------------------------------------

Client connecting to 10.242.43.100, TCP port 4001

TCP window size: 325 KByte (default)

------------------------------------------------------------

[ 4] local XXX.XXX.XXX.B port 59130 connected with XXX.XXX.XXX.A port 4001

[ 3] local XXX.XXX.XXX.B port 59131 connected with XXX.XXX.XXX.A port 4001

[ 6] local XXX.XXX.XXX.B port 59133 connected with XXX.XXX.XXX.A port 4001

[ 5] local XXX.XXX.XXX.B port 59132 connected with XXX.XXX.XXX.A port 4001

[ ID] Interval Transfer Bandwidth

[ 4] 0.0-10.0 sec 14.6 GBytes 12.6 Gbits/sec

[ 4] MSS size 8960 bytes (MTU 9000 bytes, unknown interface)

[ 3] 0.0-10.0 sec 7.90 GBytes 6.79 Gbits/sec

[ 3] MSS size 8960 bytes (MTU 9000 bytes, unknown interface)

[ 6] 0.0-10.0 sec 14.7 GBytes 12.7 Gbits/sec

[ 6] MSS size 8960 bytes (MTU 9000 bytes, unknown interface)

[ 5] 0.0-10.0 sec 8.40 GBytes 7.21 Gbits/sec

[ 5] MSS size 8960 bytes (MTU 9000 bytes, unknown interface)

[SUM] 0.0-10.0 sec 45.7 GBytes 39.2 Gbits/sec

Tue Nov 4 13:17:11 IST 2014

So the total available bandwidth is ~39Gbps(~5GBps) for inbound and ~39Gbps(~5GBps) for outbound traffic as well.

Topology & Setup:

Ceph setup has two more monitor nodes which are not show here.

Ceph RGW Setup:

OpenStack Swift Setup:

Software Details

General Configuration:

Ubuntu 14.04 (3.13.0-24-generic)
Linux Tuning options for networking configured on all the nodes

#Configs recommended for Mellonix Connect –X3

net.ipv4.tcp_timestamps = 0

net.ipv4.tcp_sack = 0

net.ipv4.tcp_low_latency = 1

net.core.netdev_max_backlog = 250000

net.core.rmem_max = 16777216

net.core.wmem_max = 16777216

net.core.rmem_default = 16777216

net.core.wmem_default = 16777216

net.core.optmem_max = 16777216

net.ipv4.tcp_rmem = 4096 87380 16777216

net.ipv4.tcp_wmem = 4096 65536 16777216

net.ipv4.tcp_tw_reuse = 1

net.ipv4.tcp_tw_recycle = 1

kernel.core_uses_pid = 1

MTU size of 9000 is used along with above options.

A CRON job is configured to flush DRAM cache each minute on each node.

sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches"

Ceph Configurations:-

Ceph Version: 0.87
RGW is used with Apache + FASTCGI as well as CivetWeb.
Apache version : 2.4.7-1ubuntu4.1, with libapache2-mod-fastcgi -2.4.7~0910052141-1.1
Ceph.conf is placed here. This contains the entire ceph optimization configuration, done in the experiment.
Default region, zone and pools. All .rgw *pools created with default zone are set to use PG_NUM of 4096.
Replica count is set to 3.(Max_Size=3, Min_Size=2)
Apache configuration parameters:

ServerLimit 4096

ThreadLimit 200

StartServers 20

MinSpareThreads 30

MaxSpareThreads 100

ThreadsPerChild 128

MaxClients 4096

MaxRequestsPerChild 10000

CivetWeb is used with all the default configurations. However ‘rgw_op_thread’ seems to control the CivetWeb’s configuration option ‘num_op_thread’, which is set to 128. This parameter seems to degrade the performance in term of response time, if increased beyond this point. (I tried setting this to 256/512 and it resulted in to CLOSE_WAIT state of more & more HTTP connections). I am hitting a CivetWeb bug related to this problem.

OpenStack Swift Configurations:

OpenStack Swift Version : Icehouse/ Swift2.0
Webserver: Default WSGI
All performance optimization is done based on OpenStack Swift deployment guide.
Inode Size of 256K is used all other XFS formatting and mounting options are as per recommendation made in Swift Deployment Guide.
WSGI pipeline is trimmed down and only contains essential middleware.

Proxy Server WSGI pipeline looks like this:

pipeline = healthcheck cache tempauth proxy-server

Each Storage node is configured as a zone in a region, and on each node there is disk that is dedicatedly used for Account and Container Databases. All the other disks are used for keeping Objects only. Ring files are populated based on these configurations.
Proxy node only runs the proxy-server & memcached.
Storage node run all the other swift services i.e account-server, container-server, object-server along with supporting services like auditors, updaters, replicaters.

COSBench & Workload Details

COSBench Version: 0.4.0.e1
COSBench Controller and driver both are configured on the same machine, as the hardware is capable of sustaining the workload.
Small File/Object workload is as follows:

Object Size: 1MB

Containers: 100

Objects Per Container: 1000

Large File/Object workload is as follows:

Object Size: 1GB

Containers: 10

Objects Per Container: 100

Objects are written once in both the cases.
Every workload is configured to use different COSBench worker count.
For Small File Workload Worker counts are: 32,64,128,256,512
For Large File Workload Worker counts are: 8,16,32,64,128
Each Workload is executed for 900 Seconds, and objects are read randomly from the available set of Swift Objects.
There is no difference in workloads for Ceph and Swift except the value of generated token. A token was generated after creating Swift users in both cases. This token is provided along with Storage-URL in workload configurations.
Ceph Put all the Swift objects in Single Ceph Pool called ‘.rgw.buckets’.

Results:

Small File Workload:

Large File Workload:

Additional Details:

90%RT: Response Time of 90% requests

Max RT: Maximum Response Time taken by all successful requests.

Small Files

Worker Count	RGW-Apache	RGW-CivetWeb	Swift-Native
32	90%RT< 20 ms Max RT=1450 ms	90%RT< 20 ms Max RT=1440 ms	90%RT< 30 ms Max RT=10,230 ms
64	90%RT< 50 ms Max RT=2000 ms	90%RT< 60 ms Max RT= 1,460 ms	90%RT< 30 ms Max RT=16,336 ms
128	90%RT< 110 ms Max RT=3090 ms	90%RT< 120 ms Max RT= 1,480 ms	90%RT< 70 ms Max RT=16,380ms
256	90%RT< 210 ms Max RT=1760 ms	90%RT< 120 ms Max RT= 3280 ms	90%RT< 90 ms Max RT=17,020ms
512	90%RT< 330 ms Max RT=33,120 ms	90%RT< 200 ms Max RT= 20,040 ms	90%RT< 160 ms Max RT=16,760ms

Large Files:

Worker Count	RGW-Apache	RGW-CivetWeb	Swift-Native
8	90%RT< 3,110 ms Max RT= 4,540 ms	90%RT < 3,060 ms Max RT= 5,380 ms	90%RT < 6,740 ms Max RT= 11,210 ms
16	90%RT< 5,550 ms Max RT= 7,980 ms	90%RT < 5,780 ms Max RT= 18,150 ms	90%RT < 8,150 ms Max RT= 13,710 ms
32	90%RT< 10,860 ms Max RT= 11,900 ms	90%RT < 10,970 ms Max RT= 12,120 ms	90%RT < 9,800 ms Max RT= 17,810 ms
64	90%RT< 21,370 ms Max RT= 24,200 ms	90%RT < 21,190 ms Max RT= 22,080 ms	90%RT < 19,530 ms Max RT= 38,760 ms
128	90%RT< 42,410 ms Max RT= 43,340 ms	90%RT < 41,590 ms Max RT= 44,210 ms	90%RT < 46,800 ms Max RT=74,810 ms

Conclusion:

Native Swift behaviour & results curve seems sane. A clear relation between concurrency and throughput is established.
Ceph-RGW seems to have problem with RGW threading model, a flat throughput curve with increased concurrency is certainly not a good sign.
Native Swift in general performs better in high concurrency environments.
Ceph RGW gives better bandwidth at lower concurrency.
Ceph RGW response time is excellent for Large Objects.
For Small Objects at lower concurrency, Ceph -RGW seems very promising, however there is much to do, as concurrency plays a great role in Web Server environment.
Ceph RGW major bottleneck is WebServer, however CivetWeb & Apache FASCGI gives comparable numbers. However CivetWeb is better than apache+fcgi in term of response time at high concurrency. CivetWeb has a inherent design limitation, which is already reported here.
Digging further I also made an attempt to benchmark Ceph using RADOS bench, which directly uses Ceph Objects(Different from Swift Object interface it provides). I ran the bench the same node which is used as COSBench Controller +Driver. So in this case RGW is out of the picture , in summary my observations are as below:

Object Size& Threads	Avg Bandwidth (MB/Sec)	Avg Latency (Sec)	Runtime (Sec)
1M , t=128	3428.361	0.0373327	300
1M, t=256	3485.405	0.0734383	300
4M, t=128	4015.811	0.127454	300
4M,t=256	4080.127	0.250806	300
10M,t=128	3628.700	0.352318	300
10M,t=256	3609.026	0.701526	300

Bandwidth number can be directly used for representing IOPs. So in summary even RADOS bench is not giving bandwidth beyond ~4GB/s. Strange thing here is it is optimize for 4MB Size, increasing beyond this Object size is not giving higher OPs (bandwidth).

Other Remarks:

Swift is more feature rich in terms of REST API.
S3 API is supported by both.
Finding good documentation is big pain in setting up Ceph.

18 comments:

AnonymousNovember 5, 2014 at 9:35 PM
Good doc Pushpesh :)
ReplyDelete
Replies
pushpesh sharmaNovember 5, 2014 at 10:54 PM
I am posting some of the mail conversation, that might be useful for everyone.
ReplyDelete
Replies
AnonymousNovember 5, 2014 at 10:54 PM
Ben England
Nov 5 (1 day ago)

to me, Mark, Thiago, Peter, Kyle, John, Neil
Thanks Pushpesh, this should give us a lot of ideas about how to tune Ceph and RGW in particular. I'm having some trouble understanding how to configure it so it will help there too. cc'ing Kyle Bader (RGW developer) in case this interests him.

-ben
ReplyDelete
Replies
AnonymousNovember 5, 2014 at 10:55 PM

Mark Nelson
Nov 5 (1 day ago)

to Sage, yehuda, Ben, me, Thiago, Peter, Kyle, John, Neil
Hi Pushpesh,

A couple of comments:

I noticed that you are doing:

sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches"

every minute which will flush dentry/inode cache. With how ceph lays objects out on the disk (and I suspect swift too) this is will hurt performance for small objects. Given that you are using SSDs I'm not sure how much effect it will have, but it's something to keep in mind. Are you re-reading objects? If so, maybe an "echo 1" here might be better to avoid cached reads while keeping dentries/inodes in cache?

Also, with the number of total objects in the 1MB object tests, you might want to tweak these parameters:

filestore_merge_threshold
filestore_split_multiple

By default those settings restrict each PG to 300 objects before the nested directory structure where objects are stored for each PG is split into a deeper tree. With 1 million objects, 3x replication, and 4096 PGs, you likely hit those limits and caused the directory structure to split at least once. With traditional spinning disks this can have a big performance impact, though again I'm not sure how much of one you would see with SSDs.

Having said that, it's very interesting that your civitweb results for RGW are so much faster than the apache results. I'm not an apache tuning expert, so perhaps there is more tuning we can do on that side.

If you'd like, we have a weekly public ceph performance meeting on Wednesday mornings at 8AM PST. We'd be happy to have you join and talk about what you've been seeing! In particular, Somnath Roy from Sandisk has been doing a ton of work optimizing the read path in Ceph and RGW and might have some additional comments on your tests.

Here's the meeting etherpad:

http://pad.ceph.com/p/performance_weekly

And the meeting URL:

https://bluejeans.com/268261044/browser
ReplyDelete
Replies
AnonymousNovember 5, 2014 at 10:55 PM

Yehuda Sadeh
Nov 5 (1 day ago)

to Mark, Ben, me, Thiago, Peter, Kyle, John, Neil, Sage
I agree with Mark that the apache config might need some tweaking.
Also, there are a few more configurables that will need to be updated:

objecter_inflight_op_bytes = 0
objecter_inflight_ops = 0

(I think that setting these to zero will disable them, if not then
will need to set them to higher values then the defaults)

Specifically you can try looking at the radosgw admin socket, read the
perf counters during a run, and see if there's any throttling going
on.

Another thing is that there were a bunch of performance related
changes going into the objecter in Giant that should help concurrency.
I did see that you mentioned that you saw performance regression
there, which might point at some other issue with the setup.
ReplyDelete
Replies
AnonymousNovember 5, 2014 at 10:56 PM

Mark Nelson
Nov 5 (1 day ago)

to Yehuda, Ben, me, Thiago, Peter, Kyle, John, Neil, Sage
I wanted to say too that we've traditionally not had any 40GbE or QDR IB equipment in our lab to test RGW at these speeds. Seeing it get 3GB/s+ (at least in the better cases) is actually not terrible for a first result! We just need to figure out where we are getting held up as Yehuda mentioned.
ReplyDelete
Replies
pushpesh sharmaNovember 5, 2014 at 10:56 PM
Hi All,

Thanks everyone for your time and valuable feedback.

Mark,

There are chances of re-reading the Objects, randomization in reading is done by COSBench workload, so I just tried my best to avoid this. Yes "echo >1 " might prove good for both.

Apache tuning was more or less based on http://httpd.apache.org/docs/2.2/mpm.html. I played around with these parameters and I believe these configs made Apache gates wide and open. I suspect FASTCGI is bigger problem here, I was also observing very high CPU usage(~95%) on RGW nodes at peak load. However I don't rule out possibility of better numbers with better tuning.

I will try out the addition configurations recommended by you and Yahuda, will let you people know about my findings.

I will try to attend the performance meeting.
ReplyDelete
Replies
AnonymousNovember 5, 2014 at 10:57 PM

Yehuda Sadeh
10:54 PM (13 hours ago)

to me, Mark, Ben, Thiago, Peter, Kyle, John, Neil, Sage
Looking more closely into the blog post I have a couple more things:

1. You set the following apache config:

MaxRequestsPerChild 10000

Which basically means that each apache process dies after getting 10k
connections. It really depends on how the benchmark is running, but it
might have severe performance impact. You should really be setting it
to zero.

2. The small IO test uses relatively large IO size, beyond what the
gateway considers as small. It may benefit with adjusting its internal
IO params. You can add:

rgw max chunk size = 1048576

This will cut the number of rados IO operations in the small IO
benchmark by half.

Thanks,
Yehuda
ReplyDelete
Replies
pushpesh sharmaNovember 5, 2014 at 10:58 PM
Hi All,

I would like to include your responses in the blog post so it will benefit open source community in general. I will share the blog post with CEPH community mailing list and have more feedback from there. FYI, I am also giving a talk about this @ OpenStack India Mini-Conf (7-Nov-2014). I will present the Ceph-RGW results with a note that there are active discussion going on .

I hope it make sense to all.

PS: I am working on turntables, Yehuda suggested.
ReplyDelete
Replies
AnonymousDecember 15, 2014 at 11:18 PM
Hi Puspesh ,
its Anurag, I think you gave a presentation on CEPH vs Swift Performance on OSI Days 2014 and i am interested in CEPH and trying to deploy it on simple commodity hardware just for POC purpose. but i am facing some issues and i am trying to do so inside a network with ntlm proxy using ceph-deploy . i have done it on Vm's with 2 osd and 1 monitor and its working but with physical machines i am facing problems.
may be you can give me a email id on which we can communicate.

ReplyDelete
Replies
Rajesh SudarsanNovember 10, 2015 at 1:37 PM
Hey Pushpesh,

The images for the configuration and the results do not load on this page. Can you please look into it? Otherwise, do you have a pdf file of this blog that you can email it to me?
ReplyDelete
Replies
AnonymousJune 16, 2016 at 5:29 AM
Hi,

When a rgw service is started, by default below pools are created.
.rgw.root
default.rgw.control
default.rgw.data.root
default.rgw.gc
default.rgw.log

When a swift user is created, some default pools are created. But I would like to use "Pool_A" for the swift user.
From client when I run Cosbench the data should be placed in "Pool_A" instead of placing it in default pools. How can i achieve that.

Also need help on how to run Cosbench with swift user. Your help is very much appreciated.

Thanks,
kanchana.
ReplyDelete
Replies
SELÇUK AZİZJuly 26, 2021 at 6:20 PM
kayseriescortu.com - alacam.org - xescortun.com
ReplyDelete
Replies
AnonymousApril 28, 2022 at 6:36 PM
mmorpg oyunlar
instagram takipçi satın al
Tiktok jeton hilesi
TİKTOK JETON HİLESİ
Antalya saç ekimi
referans kimliği nedir
instagram takipçi satın al
metin pvp
instagram takipçi satın al
instagram takipçi satın al
ReplyDelete
Replies
AnonymousMay 17, 2022 at 8:54 AM
Fon perde modelleri
sms onay
mobil ödeme bozdurma
nft nasıl alınır
ankara evden eve nakliyat
TRAFİK SİGORTASI
dedektör
web sitesi kurma
Aşk kitapları
ReplyDelete
Replies
sportsbetDecember 22, 2022 at 1:46 AM
Good content. You write beautiful things.
vbet
mrbahis
taksi
hacklink
mrbahis
korsan taksi
sportsbet
hacklink
vbet
ReplyDelete
Replies

Add comment