Your robot could have a way to react when it hears a simple hand claps or when it hears someones voice but the question becomes then what do you want your robot to do with this information?

Should the robot simply jump back after hearing the clap or the voice or should it have some better object recognition and speech skills to say something or actually respond properly to a question?

The 1990s’ toy robots like Robosapien were very advanced and were able to jump back when you put your hand in front of them and they could identify certain basic things like a ball.

Now we can teach a robot with $400 hardware how to recognize objects and respond to them.

This website shows how to use a Raspberry Pi computer, some cameras and the Google Coral Edge TPU accelerator USB stick to recognize objects that it sees for the second time. https://magpi.raspberrypi.com/articles/teachable-machine-coral-usb-accelerator

A newer technology called RAP or robot auditioning programming deals with some forms of AI and seeks to allow a robot to distinguish sounds and speech in noisy environments.

Many robots make use of ASR (automated speech recognition) software that can hear speech and then translate it into short text which can then be processed and understood.

The challenge here is that these programs can not usually detect things in the spoken voice like sarcasm or anger or tone or know the context of what is being said.

You can use some internal software or cloud based ASR like one offered by Google called Google Cloud speech to text API where you would pay per minute of use.