Loading
Notes
Study Reminders
Support
Text Version

Set your study reminders

We will email you at these times to remind you to study.
  • Monday

    -

    7am

    +

    Tuesday

    -

    7am

    +

    Wednesday

    -

    7am

    +

    Thursday

    -

    7am

    +

    Friday

    -

    7am

    +

    Saturday

    -

    7am

    +

    Sunday

    -

    7am

    +

Virtual Reality
Prof. Steve LaValle
Department of Multidisciplinary
Indian Institute of Technology, Madras

Lecture – 17
Audio (physics and physiology)
Audio for VR
- More neurons are devoted to vision
- Other senses can still be stimulated as part of the VR experience - Audio is propagated from various sources in the real world.
- Audio is synthetically generated in a VR environment
- E.g., in a cave, speakers generate appropriate sounds in a fixed way
- For an HMD, we hear sound from earphones
- When we turn around, the speakers close to the hear should also react accordingly
- Head-tracking we learnt for vision would also be useful in the case of HMD
- Issues faced for audio is similar to that in vision such as latency and resolution

Sound waves; similar to light, except
- Fluctuating air pressure (instead of EM / photos)
- Frequency only 20 Hz to 2000 Hz
- Speed 343 m/s in air











If we zoom in significantly to the center parts, if we are looking at a cross section of this inside, then there is a displacement that occurs, due to the vibration, inside of this a fluid channel and these hairs bend a bit and this causes, based on the frequency of vibration, causes neural signals to be sent to the brain and just as there is an optic nerve, there is an auditory nerve and just as there is a visual cortex. There is an auditory cortex that transmits the information.
And so, based on motion between this plate and this upper tectorial membrane; These vibrations are transmitted through, this, what I, which, what I said is a, mechanoreceptor as opposed to a photoreceptor in the case of the eye.

This is another depiction of it as well.

And, this is what it looks like, under electron microscope.

As you turn your head around, there is a fluid that compresses and decompresses along here and let us say it is not a wave vibration like in the case of sound, but it is a very let us say low frequency wave measure in the, the fluid is sloshing back and forth along the canal, that causes pressure on these membranes here and there are again sensors that are mechanoreceptors that, have little hairs that move back and forth as a consequence of this fluid moving. So, that transmits angular acceleration, because of the fluid displacement.
So, you can measure two axes with one of them and you can measure two axes with the other. You get four axes total, but of course, you only need three independent axes. So, there is a little bit of redundancy. So, you get the ability to measure both linear acceleration and angular acceleration, remember that the gyros, we use measure angular velocity. So, it is a little bit different, but it is, it is essentially the same information via integration of angular acceleration. We get angular velocity information and by integrating that, we get, orientation information.

This is what it really looks like, if you were to cut out the vestibular organ and the, the cochlea as well.

Virtual Reality
Prof. Steve LaValle
Department of Multidisciplinary
Indian Institute of Technology, Madras

Lecture - 17-2
Audio (auditory localization)


The auditory, so auditory localization is a very important part of perception. If we are going to make a virtual reality system that produces virtual sounds; and if in the real world, we have the ability to localize to figure out where sounds are coming from we better not mess that up right, we better not fail when we do it in virtual reality. So, how do we how we get these things? So, that is why understanding is going to be very important.

So, all these three components, so we have say 1, 2 and 3, the horizontal plane direction which is called azimuth which I have represented there as theta right. So, we have the horizontal plane, so just some direction is 0 to phi, where is the sound coming from. It seems related to yaw in the when the coordinate systems we have been talking about for head transformations.
We have vertical all right. So, how high or low is the sound, this will be called Elevation. I represented that with phi. And then we have distance which we have represented with d right. So, just using spherical coordinates does not matter right, all right.
So, this just ends up being convenient for where ears are varies in the type of information that we get and are able to infer that, that allow us to resolve where does sounds come from. I am going to give an example of a just noticeable difference with regard to audio localization.

It is called the minimum audible angle or MAA which is an example of just noticeable difference. This would be exactly as I said trying to localize where is the bird sound coming from.
And so in terms of the azimuth, so we have the head. And let us suppose it is I guess I will draw kind of a nose here. So, we are looking top down. And I want to understand what is the smallest angular change here delta theta that can be detected. And if you make a very small change you ask people to tell whether or not it is in fact moved.
So, one thing that is interesting is when it is closer to the front, we are much better added; when it gets to the sides, we are not as good added all right, so that is one thing to pay attention to. So, when we get up to looking straight ahead, so it is around 1 degree if the stimulus is below the 1000 hertz, and straight ahead. It is around 5 degrees to the side

So, let me go over some monaural cues. One, we get a significant amount of information from the pinna that is the shape or geometry of this outer part of our ears, and the external ear canal shape. So, basically the funnelling part provides a significant amount of information about where is sound is coming from. So, a kind of signal processing filter or transform is performed by outer ear.
I will get into more details of that shortly. But I just want to point out that is a significant amount of information that let us determine where sound is coming from. It is distorted in different ways across the frequency spectrum depending on where the sound is coming from just on how it the sound waves propagate through your pinna and external ear canal.
Two, the intensity decreases by the inverse square law we talked about that in tracking systems for light. So, same thing for audio; this may be equivalent to a monocular depth cue that has to do with the retinal image size right.
So, if you know how loud something should be again maybe it is the Asian Koel and you know how loud that bird typically is. If I can barely hear and it is probably far away right I do not need two ears to determine that I just need one ear to determine that. If it is a very unusual sound, you have never heard before, maybe this cue will not be so good because you are not sure how louder supposed to be anyway.

Let us suppose I just transmit the same sound to both speakers just no worry about stereo separation. You are listening to some kind of music it is really mono music separated and I am just putting the same audio track out to both channels. So, I have a left and a right track, I guess maybe that is the left track and that is the right track. So, everything seems fine.
Now, what I going to do is move my head over here right, and I ask you do you hear the speaker at all. If I go over here, it should be the case that I hear the sound from this speaker and then this one comes in significantly later and there should be a time shift right, because this one is travelling further away. But I do not hear that, do I?
I just hear the sound from this one speaker; I do not hear them both. Try it sometime. Arrange two speakers walk back and forth usually hear the sound. You have done this before; some of you have done this before I think right by just making this up. You try this before you ever notice this you get really close to one speaker you hear only that you get out into the perfect place and you hear both of them you like, wow, this is perfect this is where I am supposed to be. Then you get over to the other speaker and you only hear that one.
Of course your ears are taking in the sound from both, but your brain is masking away the reverberation from the other one. Because it is a secondary effect, it is essentially it is perceived as the same audio, but it is time shifted and it is lower amplitude. So, it just gets massed away it is an audio or auditory illusion.
You do not hear the extra echo from that or the time shifted version of that that it falling onto ears. If I were going to suddenly while I am over here, turn off the speaker of course, I would hear that one right, but you do not perceive it at all when you turn on this one on.

So, based on that time difference, believe it or not our brains our neural structure is resolving that temporal difference and it is using it is measuring that temporal difference or the phase shift between these waves that are coming in. And it is using that information to determine where the sound is coming from.

Let me say something about the geometry of that, and then we can take a break.

And generally we may remember from basic connect sections and analytic geometry, hyperboloids come in two sheets. The two sheets will be you will get one sheet if one signal came first, and you will get the other sheet if the other signal came for, so it depends on the actual order as far as which sheet you get. And I am drawing it in 2D, but you actually get a hyperboloid, it should be peaking right on the axis here.

And the hyperboloid is referred to by perceptual psychologists by the great name or the cone of confusion all right. So, this is the cone of confusion. So, there is a cone shaped region hyperbolic cone over which you cannot localize any further using only interaural time differences.
Now, one thing I find fascinating about that is that we can in fact determine where sound is coming from inside of the cone of confusion. Now, part of that is because we are using interaural level differences, but part of it is because of some more information that is coming later, but to give you just a hint of it, it has to do with the pinna. So, we can do more information, but if you are only looking at interaural time differences, you have a cone of confusion in region within which you cannot distinguish any further where the sound is coming from based only on this time difference of arrival of the sound waves questions about that?

Virtual Reality
Prof. Steve LaValle
Department of Multidisciplinary
Indian Institute of Technology, Madras

Lecture - 18
Audio (rendering)


So, you have sound rendering, just like we had rendering for the visual part which we call computer graphics there is an audio part that has many similarities to the visual part. So, I will describe 4 steps and just give some comments about this as I go along. And it will be significantly more brief than it was for the visual case, so 4 steps, so one we have modeling.
So, for the visual case and even before we covered vision versus audio or anything we talked about geometric models right. So, we will again have geometric models, you can have stationary or movable models. You could have movable models you could have a moving object that generates sound or maybe sound will be bouncing off of a moving object.
So, walls, obstacles, moving bodies, perhaps you have a bird that is flying in VR and generating sounds. When we were modeling objects for computer graphics for rendering in that case we were concerned about the material properties in some cases right, we wanted to know how the light reflects off of the surface of our objects.
So, what do we want in this case we want to know how the sound reflects off or absorbs into the objects or maybe transmits through the objects or the diffraction that occurs. So, how does this happen? So, we have acoustic material properties, and just like we had light sources we have sound sources, when you can have a point sound source or you could have some kind of parallel wave source maybe an entire vibrating virtual wall that makes parallel waves. And so we could generate those directly or I mean in the case of a point sound source I guess we can make the same statement as we did for light that we never end up with truly parallel waves, right. So, we have these propagating spherical wave fronts, right. So, it could have point, source, or a parallel wave source, like a vibrating plate.

We can talk about the loudness of the source just like we talked about the brightness of a light source, all right. So, these are the modeling components that we have right just as we had for the visual case.

When we need to think about propagation through this virtual world that we are constructing, so using the acoustic material properties we want to figure out how is the sound going to propagate. And again we have all the things from before I will just write them here reflection, diffraction that should be two has there, refraction or transmission. If we are moving right if the avatar is moving through VR we may also get a Doppler effect right or if some object is moving towards us and making a sound.
So, that is something that we did not worry about that in the light case right. The red shift that occurs right in light is useful for tracking the motion of stars in the sky for example, right, but it does not end up being enough of a shift to be significant for VR, but here it ends up being important. So, the Doppler effect and overall amount of attenuation. So, waves get weaker as they propagate through the air at a more significant rate than light waves, so attenuation. So, these are all important types of propagation or factors that are involved in propagation.
And if we divide up computational approaches we want to think about doing a computational approach. Remember we are making an alternate world generator right. So, we talked about very early on in the course we are making a kind of simulator to propagate this audio that is being constructed in the virtual world. So, there are two big authorities I would say at the highest level, we could do it in what I would call in a numerical approach versus a combinatorial approach.


So, you have the same kinds of choices for audio you can go this direction or you can try to find some kind of combinatorial quick visibility ray tracing kinds of hacks let us say that are hopefully good enough. So, these choices exist as well, this one being very expensive, but very physically accurate this one hopefully being good enough.
I think there is a lot of research and a lot of development to be done in this side here to make fast and efficient methods just as was done in computer graphics across the 1970s and 80s when people had very limited hardware and they still nevertheless generated. Very nice looking pictures and then eventually videos synthetically
So, we are in the same kind of place hear with audio its known how much of these models and the acoustic material properties need to be fully specified, how much of this information is going to be critical for a good audio experience. You will hear something, but will it be reproduced in the in the same kind of way as it would in the physical world giving the abilities to localize the source for example, as you can in the real world. And that could be very important for the type of experience at your building. As I said in the beginning it depends on the task right what is the experience you are trying to make so.
If you are trying to make a horror experience way maybe some kind of scary experience perhaps where the sounds are coming from in some kind of not very well lit environment might be an important aspect of that right and so you might want to capture that very well. If you are just making an experience where you are just talking with your friends, maybe some amount of localization is important, but maybe not a large amount right.

So, 3: Rendering, all right. Well, one of the things that we have to take into account for rendering and rendering is going to be to determine what, output needs to be given to the audio display which we may call speakers, right; so what needs to be presented there. Or one thing we have to do is we have to use the head position and orientation to determine the appropriate let us say air pressure signal at the right and left ears. Just like we have to give the right visual information to your virtual right and left eyes in VR you have to figure out where the wave should be hitting your virtual ears in right and left ears in virtual reality.
So that means, if you are moving your head around this has to be tracked and you have to adjust the sound that goes into your ears based on that. So, what goes into the audio display or the speaker should be dependent on your head position and orientation right. And if you grab on to a controller and move your character that will also change your position orientation in the virtual world, even though it is not changing in the physical world. The audio should be adjusted accordingly as well if you want to convince your brain that you are in fact, moving which may even further contribute to this vestibular mismatch or maybe it helps that problem. Maybe it overwhelms when you put audio and visual together helps overwhelm the vestibular signal. I do not know which one will happen you will have to do experiments to see.

Virtual Reality
Prof. Steve LaValle
Department of Multidisciplinary
Indian Institute of Technology, Madras

Lecture - 18-1
Audio (Spatialization and display)


Well we have to account then for some additional scattering of sound waves. So, problem becomes account for a scattering of sound waves due to the outer ears where the part which I called the pinna, and the and the canal is well that goes into your inner ear, what about the shape of your head, and your whole body. It could even be scattering a sound based on what clothes I am wearing today and maybe it is different from day to day based on what I am wearing could be different based on whether I am wearing a hat all right.
So, the sound will propagate differently that ultimately comes into your ears, you have to simulate the rest of that right remember when we talked about how close is the stimulus generator to the sense. So, if you make it very close, you end up having to simulate the rest that is that is around it right because on the back side of it you end up having to simulate it. So, how do you do that, how do you deal with that well people have studied this very carefully and they have come up with what is called an HRTF.

Well, sounds very similar to the BRDF when we talked about reflectance models as a similar kind of function, but not exactly the same; it is head related transfer function. And this is actually this extra amount of information the scattering that is due to your outer ear the pinna, the head and your whole body this is the extra information that we use in order to resolve the source of sound inside of this cone of confusion that I talked about last time. So, it is extra amount of scattering. There is a transformation that is happening based on where the sound is coming from and it is a transformation in the frequency domain.

Let me show you how these are measured in a studio. So, you have you have the head, I will put some ears here, and suppose it is looking upward or just have a top down view.
And we place the subject in what is called an anechoic chamber that means that the walls all around the subject are fully absorbing the sound, and so there is no echo back. So, it is absorption for the materials all around.
And then inside of this chamber, you place speakers you have the ability to place a speaker that generates a sound source, and you put it at different locations perhaps every 15 degrees, you put the speaker, and then you generate an impulse, just a single impulse and you look at the impulse response. So, this is standard way of designing a filters. So, you look at the impulse response from that particular location, and you record it across the frequency spectrum.
And then you move to another location and you do the same. You can do this at various angles both in you know across the the horizontal direction and the vertical direction. You can also look at different distances or you can just make a simplifying assumption that these sounds are coming from sufficiently far away then I am not going to worry about distance. So, so it could be that we measure this HRTF function of frequency, we look at these two angles horizontal and vertical and some distance d. So, as I said we could put these speakers further and further away.

Or we use a far-field approximation. This is like the assumption of parallel wave fronts for a light which it does correspond to this case, in which case it just simplifies down to only taking measurements based on we characterizing this transfer function here in terms of frequency, but it only depends on beyond that theta and phi right, all right.
And where I have drawn the center here this makes me wonder. I put at the center of the head, it may make more sense to study the HRTF for each particular ear and just move this over. So, I could take this picture and center it on the ear, and then move all the way around at a fixed distance from the ear. In fact, that might make more sense than this. I have drawn it based on the center of the head, I could do it once for the right ear and once for the left ear pretty sure are the pinna for each ear is not exactly the same for ear to ear from left and right for an individual person all right.

And then if we think about the choices that we have, we have a surround sound which I gave an example of that in the first lecture. We could have a surround sound system that is fixed in space. I suppose I could mix things right; I could have a head mounted display for the visual part, but I could still have audio being surrounding in the room right. So, it could be half cave like and half head mounted display like right, we could do that we could separate them out that is probably not too convenient.
But this is what we would have in a cave system or if we wanted to do that for a head mounted display, we could put it surrounding. So, in this case, surround sounds the display or the speakers are fixed in space right versus I could be wearing what we normally refer to as headphones of course they are only for the ears what I am talking about I was calling them earphones.

And in this particular case, we have speakers that are placed on the outside of the ear. There is an interesting question of does that compress your pinna when you put them on right. And very often for closed headphones what are called closed headphones, you are just compressing the pinna blocking off all of the outside sound and then generating a sound for your ear. So, this part gets the part that ends up being important here this extra scattering may get lost.
So, do you try to compensate for that with an HRTF, you may have to write versus open headphones which may leave more room for the sound to propagate in through the pinna and also combined with outside sounds all right. So, open headset open headphones this means that the sound from outside of the headphones is being added.

So, what are the problems or challenges for developers? One is how much modeling a detail or accuracy is needed all right. So, how much do we have to worry about that right, can we make very, very coarse crude simplified models and that will be sufficient or do we really have to pay attention to every bit of detail right? Do I have to worry about the fabric in all of your clothing, if I want to model the acoustics of my lecture today right? No, I do not know how much if it and if so how to accomplish this? What we would like to have is good middleware to facilitate this right.
We have geometric modeling tools, we have game engines; we have all kinds of things out there for the visual part. We need good tools for the audio part. It is going to take many years to develop these to make it easier for the developers of virtual reality content, and so right now they have to be there sort of in the in the early ages of this right where you have to do all the work yourself, all the hard work yourself.
So, the person the developer is trying to be creative has to get into a lot of the technical details and do the implementations themselves. Whereas, if it is the visual part you could go to something like unity or unreal engine, and very quickly make beautiful visual content without worrying about the technical implementation aspects in in most cases.
Another challenge is how to evaluate correctness or sufficiency for your task right have you accomplished your goal or not if its visual you may look at it and say it looks fine right. If it is audio, again do are you do you care about the fact that you might have lost some localization capability, and if you if it is important for your task to have that localization capability, how do you know that you have maintained it? I mean how do you know that you have reproduced the sounds well enough to maintain that.
So, the ears are not as sensitive as the eyes in some ways. And you know these HRTFs is not important to get those correct or does your brain just adapt to a different HRTF or if you eliminate it all together how much of your localization capability have you lost and is it critical for your application. So, this gets very difficult.

And generally you have a problem that I would call a one of psychophysics when we talked about the perception of sound, you have to design experiments to determine whether or not you have succeeded. So, it may become much more complicated, you may have to design experiments on and bring in a handful of subjects to evaluate whether or not you have gotten it correct or sufficient for your task in terms of the simulation you are performing.
And generally you know what are the computational costs associated with doing these audio simulations. Can you take shortcuts and still be effective with regard to number two, are you getting it correct is it sufficient for your task and within your computational budget all right.
So, in computer graphics people struggle with this for a very long time then they design GPUs, are there going to be audio processing units that are going to be handling exactly the most important acoustic aspects or is it going to turn out that it does not have to be so high fidelity as it was for the visual case, so that specialized processing units are not needed all right, how far do we have to go.
Virtual Reality
Prof. Steve LaValle
Department of Multidisciplinary
Indian Institute of Technology, Madras

Lecture - 18-2
Audio (combining other senses)
Any questions about this? I would like to go on to the next topic, which is going to be interfaces for virtual reality. I want to give you a high level overview this time. And then in the next lectures, we can we can finish up, so we are up to let us say the highest level, I made a mistake. I want to say, one something about combination of senses. Let me back up just a bit, so I finished talking about audio, I finished talking about visual senses. Let us think about multiple senses now, and how all of these come together in virtual realities.
So, I want to talk about that first, before getting to higher level concerns such as interfaces combination of senses.

So, what have we looked at so far, we have the vision sense, we have auditory sense hearing we have vestibular. We have not talked about rendering directly to the vestibular organ, but making displays for that, but it nevertheless is impacted by virtual reality, so becomes important. We did not spend time on haptic or touch, but that is also very important for virtual reality, I could spend easily a couple of lectures as I did for the auditory case, we could talk about haptic feedback as well. And one that gets a last attention, but we might as well included for completeness is smell and taste, which provides information through chemoreceptors um.

This is called the McGurk effect. So, if I do the following experiment ok, I show you a video of someone speaking words or in this case maybe particular syllables of sounds. And I also present audio of that at the same time and I put them together. In the real world, these are consistent right. So, if I make a ba sound, you hear ba, and you see my lips coming together to go ba; if I go ga then you do not see my lips. But I can make it so that, the auditory and visual parts are in conflict.
It turns out this is shown by McGurk and other colleagues done similar things that if you change, what you see and what you hear, your brain work construct something different from that. So, So,. So, one of the very simple experiments they did was the spoken part was ba; so you hear ba, you see ga and then you swear that you have heard da, some other sound altogether. And you can even tell people of this effect, you was say ok, you see the McGurk effect, this is what you are going to experience, you tell them this and they still cannot prevent their brain from hearing a da sound.
So, we will still seem to you that you have heard a da sound, because there is a mismatch between the vision and auditory parts. You can go off watch videos on this the fascinating effect. So, sometimes in an optical illusion, I can tell you about the illusion and you go ok, I do not see it anymore; I see what is going on right, some kind sometimes not.