The ORViT (Object-Region Video Transformers) method extends video transformer layers using a block that leverage object representations.
The breakthrough idea being: fuse object-centric representations in early layers, propagate them into transformer layers, which affect spatio-temporal representations throughout the network.
Let me know if you’d be interested in doing a quick project with me. We can spend one day max on it, test the library, fine tune a model, and see what happens?
Outstanding paper, thank you @harpreet.sahota for bringing this to my Attention. ORViT seeks to bring together the three components within any action – the subject (he/she who is performing the action), the object (receiving the action), and the action itself into one transformer module which might be embedded easily into any vision transformer. Fascinating.
By incorporating object positional information alongside traditional patching for action recognition, we can see in the included attention maps the results clearly; the resulting attention maps demonstrate that the ORViT model is able to attend to each of the various objects which interact with the action, over time, throughout the duration of the action.
I am always surprised when models considering actions which relate to geospatial relationships insist on using max pooling, which discards positional information, rather than capsule networks – which retain it. In this case, however, I suppose it could be justified: unlike convolutional networks, transformers have no way to associate order within an image without using (as of today) positional encoding – the “patching” referred to in the paper and standard for vision transformers. This being considered, it makes sense that capsule networks are not used, since any intermediate positional information that could be gained through the use of which would be discarded after patching.
More surprising is how, time and time again, researchers consider actions occurring in physical space one-dimensionally, using a single camera to capture an image and – from that – understand the associated action. The use of multiple cameras, connected via a third layer of positional encoding to capture and incorporate each source, together as an inference mesh would allow for a truer understanding of actions occurring in physical space, allowing the model to confirm an action according to multiple perspectives. For example, the paper provides an example of moving an object behind a second object (a trash can). The inclusion of even simply one other perspective, from the other side of the trash can, we’d be able to understand not simply one prompt, but two simultaneously: “an object being moved behind of a trash can” and “an object being moved in front of a trash can”. For IOT applications, such a vision transformer would introduce the possibility of increasing our understanding of actions occurring in physical space through horizontal scaling rather than simply scaling increasingly vertical ($$$) to do so.
This would open the possibility of using lower-powered devices which may be distributed in our environment for any number of applications – including the authorization and authentication of individuals performing actions with objects: all without ever needing to record anyone.
@harpreet.sahota I have been out of commission the past week or so with a stomach issue, so I wasn’t able to check this out yet. But I am definitely interested in trying it out.
We will be chatting tomorrow, so we can definitely throw some ideas around and see which direction to take this
In the meantime I’ll read up on the white paper and try to wrap my head around this