Class imbalance in semantic segmentation

I’m curious is anyone can share tips for tackling class imbalance in semantic segmentation problems?

Let’s say I was wanting to do binary segmentation for pedestrians only on the cityscapes dataset…well, a lot of those images do not contain pedestrians. Or they are in the background, might be smaller in size relative to surroundings, etc.

Suppose I’m using an encoder-decoder/U-Net like structure…

I’ve tried weighted binary cross entropy, but didn’t see much improvement. I’ve also noticed that Focal Tversky loss has improved the training a little better.

Is this a matter of selecting training images appropriately? Would online hard example mining be something to try? Or should I toss out examples without pedestrians for a few epochs?

Are there any tips or tricks you can share?

Class imbalance in segmentation is a very common issue. Cityscapes as a scene parsing dataset is dominated mostly by pixels of “things” (road, sky…) rather than “stuff” (traffic signs, pedestrian…), so the imbalance is even larger.
Removing images without pedestrian will definitely helps here, it might relax the model tendency to predict pixels as background.
About the training loss, focal loss is a good choice, but the default gamma=2 might be too subtle, try to increase it further to remove importance of easy pixels. Hard minning works very well for me, this loss is more radical than focal loss and it will ignore completely easy pixels.
In general there is an idea with optimization problems, that learning on harder tasks might improve performance on the easy task i.e learning key points with object detection. In our case does flattening the task to naive non / pedestrian is the best way to go? Our model can benefit from learning semantic info from other classes, i.e learning to segment sidewalk might helps us localizing pedestrians (low abiding at least), is a rider which is a class in cityscapes, is that different than pedestrian to be considered as background?
Just something to think about