Ceph too many pgs per osd: all you need to know

I get both of these errors at the same time. I can not reduce the amount of pg, and I can not add more memory.

This is a new cluster, and I received this warning when I downloaded about 40 GB. I think because radosgw created a bunch of pools.

How can ceph have too many pgs per osd, but you have more pg object than average with too little pgs?

HEALTH_WARN too many PGs per OSD (352 > max 300); pool default.rgw.buckets.data has many more objects per pg than average (too few pgs?) osds: 4 (2 per site 500GB per osd) size: 2 (cross site replication) pg: 64 pgp: 64 pools: 11 

Using rbd and radosgw, nothing out of the ordinary.

+11
ceph storage
source share
2 answers

I am going to answer my own question in the hope that it sheds light on a problem or similar misconceptions about Ceph's internal functions.

Sets HEALTH_WARN too many PGs for OSD (352> max 300) once and for all

When balancing placement groups, you should consider:

We need data

  • pgs per osd
  • pgs for pool
  • pools in osd
  • affliction card
  • reasonable defaults pg and pgp num
  • number of copies

I will use my setting as an example, and you can use it as a template for your own.

We have

  • num osds: 4
  • num sites: 2
  • pgs per osd:
  • pgs for the pool:
  • number of pools per osd: 10
  • reasonable defaults pg and pgp num: 64 (... or is it?)
  • number of replicas: 2 (replication through the site)
  • affliction card

ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY root ourcompnay site a rack a-esx.0 host prdceph-strg01 osd.0 up 1.00000 1.00000 osd.1 up 1.00000 1.00000 site b rack a-esx.0 host prdceph-strg02 osd.2 up 1.00000 1.00000 osd.3 up 1.00000 1.00000

Our goal is to fill in the '???' above with what we need to serve the HEALTH OK cluster. Our pools are created by the rados gateway when it is initialized. We have one default.rgw.buckets.data where all data is stored, the remaining pools are administrative and internal for metadata and accounting.

PGs per osd (which is a reasonable default option?)

In the documentation, we will use this calculation to determine our pg counter for osd:

  (osd * 100) ----------- = pgs UP to nearest power of 2 replica count 

It is argued that rounding is optimal. So with our current setup this will be:

  (4 * 100) ----------- = (200 to the nearest power of 2) 256 2 
  • osd.1 ~ = 256
  • osd.2 ~ = 256
  • osd.3 ~ = 256
  • osd.4 ~ = 256

This is the recommended max pgs per osd. So ... what do you really have now? And why doesn't it work? And if you set a "reasonable default" and understand above. WHY IT DOESN'T WORK !!! > = [

Probably several reasons. We must understand what these “reasonable defaults” mean above, how they are applied and where. From the above, it can be misunderstood that I could create a new pool, for example:

ceph osd pool create <pool> 256 256

or I might even think that I can play safely and follow the documentation that says (128 pgs for < 5 osds) can use:

ceph osd pool create <pool> 128 128

This is wrong, flat. Because this in no way explains the relationship or balance between what ceph actaully does with these numbers. Technically the correct answer:

ceph osd pool create <pool> 32 32

And let me explain why:

If you, like me, you provided a cluster with these “reasonable defaults” (128 pgs for < 5 osds) as soon as you tried to do something using the radio stations, it created a whole bunch of pools and your cluster burst. The reason is that I misunderstood the relationship between everything that was mentioned above.

  • pools: 10 (created using rad)
  • pgs for the pool: 128 (recommended in docs)
  • osds: 4 (2 per site)

10 * 128 / 4 = 320 pgs per osd

This ~320 may be the number of pgs per osd in my cluster. But ceph can distribute them in different ways. This is exactly what is happening and is way over 256 max per osd above. My HEALTH WARN cluster is HEALTH_WARN too many PGs per OSD (368 > max 300) .

Using this command, we can better see the relationship between the numbers:

 pool :17 18 19 20 21 22 14 23 15 24 16 | SUM ------------------------------------------------< - *total pgs per osd* osd.0 35 36 35 29 31 27 30 36 32 27 28 | 361 osd.1 29 28 29 35 33 37 34 28 32 37 36 | 375 osd.2 27 33 31 27 33 35 35 34 36 32 36 | 376 osd.3 37 31 33 37 31 29 29 30 28 32 28 | 360 -------------------------------------------------< - *total pgs per pool* SUM :128 128 128 128 128 128 128 128 128 128 128 

A direct correlation between the number of pools you have and the number of placement groups assigned to them. I have 11 pools in the snippet above, and each of them has 128 pg, and that's too much! My reasonable defaults are 64! So what happened?

I misunderstood how “reasonable defaults” are used. When I set my default value to 64, you can see that ceph has my crush map taken into account, where I have a domain with an error between site a and site b. Ceph must ensure that everything on site a is at least accessible on site b.

WRONG

 site a osd.0 osd.1 TOTAL of ~ 64pgs site b osd.2 osd.3 TOTAL of ~ 64pgs 

We needed a total of 64 pg per pool , so our reasonable defaults should have been set to 32 from the start!

If we use ceph osd pool create <pool> 32 32 , then this means that the relationship between our pgs per pool and pgs for osd with these “reasonable defaults” and our recommended max pgs per osd starts to make sense


So you broke the cluster ^ _ ^

Don’t worry, we will fix it. The procedure I'm afraid may vary depending on risk and time, depending on how large your cluster is. But the only way to get around this is to add more storage so that the placement groups can be redistributed over a larger area. OR we must shift everything to the newly created pools.

I will show an example of moving the default.rgw.buckets.data pool:

 old_pool=default.rgw.buckets.data new_pool=new.default.rgw.buckets.data 

create a new pool with the correct pg count:

ceph osd pool create $new_pool 32

copy the contents of the old pool to the new pool:

rados cppool $old_pool $new_pool

delete old pool:

ceph osd pool delete $old_pool $old_pool --yes-i-really-really-mean-it

rename the new pool to 'default.rgw.buckets.data'

ceph osd pool rename $new_pool $old_pool

There may now be a safe bet to restart your radio shows.

FINALLY CORRECT

 site a osd.0 osd.1 TOTAL of ~ 32pgs site b osd.2 osd.3 TOTAL of ~ 32pgs 

As you can see, the number of pools has increased since they were added by the pool identifier and are new copies. And our common pgs for osd is under ~ 256 , which gives us the ability to add custom pools if necessary.

 pool : 26 35 27 36 28 29 30 31 32 33 34 | SUM ----------------------------------------------- osd.0 15 18 16 17 17 15 15 15 16 13 16 | 173 osd.1 17 14 16 15 15 17 17 17 16 19 16 | 179 osd.2 17 14 16 18 12 17 18 14 16 14 13 | 169 osd.3 15 18 16 14 20 15 14 18 16 18 19 | 183 ----------------------------------------------- SUM : 64 64 64 64 64 64 64 64 64 64 64 

Now you should check your ceph cluster for everything at your disposal. Personally, I wrote a bunch of python over boto, which quickly checks the infrastructure and returns statistics on buckets and metadata. They ensured that the cluster returned to working order without any problems that it had previously suffered. Good luck


The default.rgw.buckets.data commit pool has many more pg objects than the average (too few pgs?) Once and for all

This pretty literally means you need to increase the pg and pgp num of your pool. So do it. With all of the above. However, when you do this, note that the cluster will start backfilling , and you can see this%: watch ceph -s process in another terminal window or screen.

 ceph osd pool set default.rgw.buckets.data pg_num 128 ceph osd pool set default.rgw.buckets.data pgp_num 128 

Armed with knowledge and confidence in the system presented in the above segment, we can clearly understand the relationship and the impact of such a change on the cluster.

 pool : 35 26 27 36 28 29 30 31 32 33 34 | SUM ---------------------------------------------- osd.0 18 64 16 17 17 15 15 15 16 13 16 | 222 osd.1 14 64 16 15 15 17 17 17 16 19 16 | 226 osd.2 14 66 16 18 12 17 18 14 16 14 13 | 218 osd.3 18 62 16 14 20 15 14 18 16 18 19 | 230 ----------------------------------------------- SUM : 64 256 64 64 64 64 64 64 64 64 64 

Can you guess which pool identifier is default.rgw.buckets.data ? haha ^ _ ^

+25
source share

In Ceph Nautilus (v14 or later), you can enable "PG Auto Tune". See this documentation and this blog entry for more information.

I accidentally created pools with live data that I could not transfer to restore PG. It took several days to recover, but the PGs were optimally tuned with zero problems.

0
source share

All Articles