Alison's New App is now available on iOS and Android! Download Now

Module 1: File Organisation

    Study Reminders
    Support

    Right. So in the last few lectures we have seen the structure of disk files, right. Andthen we have also seen what are the primary, possible primary organizations for thefiles. And continuing our discussion on how to handle data on secondary storage, stablestorage. In this lecture we will focus on what are called index structures okay. Sobasically in index is a disk data structure which will enable very efficient retrieval of arecord given the value of a certain field on which we have constructed that index. Sothose attributes on for on which we construct this index are usually called theindexing attributes, okay.So given the value of these indexing attributes, this data structure will help usretrieval the records in the data file that have values matching the given values. Theremay be multiple records, there may be one record, etc., okay. So essentially how toget hold of those records, given these values for the indexing attributes. And what arethe various ways ideas that are available for us to build these indexes, what kind ofindexes are there?How do we build indexes? In fact, you can actually use the idea of hashing also tokind of construct indexes, okay. So In this lecture, we will basically focus on youknow index structures where we are going to put the physical address, physicaladdress of the data block in which the record is present in the index, in the index. Sowhen you consult the index, the index will finally give you a disk block address or afew disk block addresses where you will find those records, okay.So directly you can go to those disks, disk blocks and then pick up your records. It isalso possible for you to kind of organize this in such a way that finally the indexactually will give you some something like the key of the you know record so that youcan go to the you can exploit the primary organization of the data file on which youknow whatever it is the primary organization and then get hold of the actual records,okay.So that is also possible. So there are, so where I am going to discuss a few ideas, and Iexpect you to kind of mix and match these ideas, right? Okay, so let us look at whatkind of indexes are available. So we call something as a primary index if the index isbuilt on the ordering key field of the file. That means a file is actually an ordered file.It is an ordered file. And the ordering field actually is also a key, right?And if you build an index on that, then such an index is usually called as a primaryindex. Now a clustering index is an index built on ordering non-key field of a file. Thefield is still an ordering field. That means the data is available in the ascending orderor descending order of the particular attributes values. But it turns out that it is notaccurate key, right. So that is one possibility.The other one is called the secondary index where the index is actually built on somearbitrary field. See if the data file is ordered on a particular field right, all the otherfields are non-ordering fields, right. So if you want to build a index on one of thesenon-ordering fields, so you will such a kind of index is what is called a secondaryindex. Now, so mostly we are actually focusing on ordered files, ordered files.So for an ordered file, you can only have one primary index because there is only oneordering field, right. So one primary index is can be built. And if that ordering fieldhappens to be not a key field, then you will build one clustering index. So either aprimary index or a clustering index you will build. But on the same file, you canactually build several secondary indexes.If you want faster access to the data records on fields, which are not the ordering field,then you can actually build several for each of those fields, you can build a secondaryindex, okay. So these, how do we build these indexes and where do we store them?Obviously, these are all disk files again. So in index will again generate a file. So thatfile has to be linked to the data file.And then the availability of the what is the primary index for the file? And how manysecondary indexes are available? If there are there where are there, all thatinformation has to be there in the header of the data file, okay. So let us look at theseareas. These are very interesting ideas and they will definitely help us get the, youknow very fast access to the index to the data records, okay. Let us start off with theprimary index. So the primary index is a very simple idea. We have a data file versus this is the datafile, right. So the data file has records and let us assume that this is the field on whichwe want to build the index. So this is an ordered file. We can see that these are theindividual records. These individual records are in the increasing order of thisparticular value. Let us assume that it is something like a roll number. And it is a keyalso.So there are no two records with the same values, okay. So but each of these, each ofthese things is a block, you know this thing is a block. So it will have a bunch ofrecords, okay. So we have a sequence of these blocks. And what we do here in orderto build a primary index is keep the value of the first block and a pointer to the firstblock as the first entry in the index.Keep the value of the first record of the second block and a pointer to that in thesecond entry and so on like that. So in general the index these are called index entriesand this is the index file. So the index file will have value of the ordering key field forthe first record of the block B j, okay and then the disk address of the block B j. Nowthese first records of each of the blocks are actually called block anchors.You can call them as they are called as block anchors. They are the first especiallywhen the data file is a ordered file we call them as block anchors. So for the blockanchors, we have the values and the disk pointer to the block. So that will constitutethe index, okay. Now let us look at the index file. The index file, this index fileobviously itself is an ordered file. It is ordered on this values.It is sorted on this ordering key field values. And what will be its size? The size willbe the number of blocks that are there in the data file. So compared to the data file, theindex file will be small, because it has one entry per block of the data file. Andremember the block has huge number of records and the maximum number of recordsthat is blocked in whole is called the blocking factor, the blocking factor.So now you can calculate this index file itself has to be on the disk. So it has to beaccommodated in a few blocks. So now we can calculate the index file blockingfactor or the blocking factor for the index file. So blocking factor for the index file isthe block size divided by the length of the record. Because that is the maximumnumber of records that you can hold in a block.The blocking factor is nothing but the maximum number of records that you can holdin a block. So now the records are of length V + P, V is the length of the value. I meanordering key field size and P is a block pointer, whatever is the block pointer size. It isusually about 6 to 8 bytes. So block pointer size. So B divided by V + P will give usthe blocking factor for the index file.Now this is the number of index entries that can be held in a particular block. Theblock size is same because we are using the same disk. The block size is same.Usually this blocking factor for the index file will be much larger because the indexentry is much smaller compared with data record. Data records are usually long. Theindex entries are small. So the blocking factor for the index file will be high.And the number of blocks that the index occupies will be low, you get it? Sogenerally, the blocking factor is much higher for the data than the data file blockingfactor. And now you can calculate what is the number of index file blocks that areneeded. So let us call this b i the and let us call b as the number of data file blocks. Sothese are the data file blocks 0, 1, 2, 3 actually you should probably number them as1, 1, 3 up to b. So b is the number of data file blocks that we have.b divided by the blocking factor BF i for the index file will give us how many numberindex file blocks. Now what is the advantage of this thing? If you want to have aaccess to a particular record given this ordering key field value, what you can do is todo a, you can do a binary search. You can do a binary search on this and then get holdof the record. Go to the block where that particular record would possibly be there andthen fetch the record. So that is the advantage. So record access using this primary index. Given the ordering key field value x versussome value x carry out a binary search on the index file, okay. So let m be the valueof the ordering key field for the first record in the middle block k of the index file. Solet us consider the cases now. So x is strictly less than m then obviously you knowthat this data record will not be there in the portion in the lower portion of the datablock.So you have to do a binary search on the blocks 0 through k - 1 of the index file, k isthe middle block, k is the middle block. How do you get the middle block? Again theindex file is a like a file. So it will have its own file headers and it will have thenumber of blocks that are present in the file and then block pointers for all of theblocks. Let us assume that, okay.See in the index file header, you have what are the number of blocks and informationabout what are the number of blocks available and for each of the block where is theblock, the block pointers are available. Okay, so you can find out what is the middleblocks thing and get the value of m or the first record in the middle block. So youhave already done one block access here.Then you do a binary search on 0 through k - 1 of the index file blocks. Otherwise, ifx is either equal to m or greater than m, then you basically have got hold of thismiddle block k. So in that middle block you search for this entries v j and v j + 1 okaytwo entries v j, v j + 1 that contain this x. So x is either less than or equal I meanbetween v j and v j + 1, strictly less than v j + 1 okay.So if this is the case, then the block pointer P j will give you the data file block wherethis particular record is present, okay. So that is the so remember this block of theindex file, block of the index file has several disk pointers, right? Remember, I willshow you this picture again. So this has been blocked I mean it has been divided intoblocks. So each of the blocks will have several disk pointers.So that is why you need to search through here and then figure out between what twovalues is this particular value that you are searching for is present and then pick up theappropriate data file pointer. So use the block pointer P j to get the data file block andfirst search for the actual data record in that block, in the data block.Otherwise, if x is not present here, if such index entries are not present, then youknow that the record that you are looking for does not have a pointer in the middleblock. So you go and do binary search in k + 1 through b i. So like this you can figureout actually where the data file block is that contains the record that has this value x.These is only one record, right.So the maximum number of block accesses required will be log b i to the base 2because you are doing binary search. So let us look at some sample values and get anidea of this this figures. Let us say the data file has some 9500 blocks and the block size is about 4 kilobytesand the ordering key field length is about 15 bytes then block pointer is about 6 bytes,okay. So now the index file will have 9500 records. It will have exactly as manyrecords as there are blocks in the data file. So 9500 records will be there. And the sizeof the entry will be 21 bytes, 15 + 6 21 bytes.So now you can calculate the blocking factor for the index file that is 4 kilobytesdivided by 21 equal to 195 index entries will fit into one block. Now you can see theactual number of blocks that are required will be 9500 divided by this 195. We willsee that about 49 blocks will be required to hold the index file itself. And so basically,you are going to do search on this blocks.So the maximum number of block accesses for getting a record using the primaryindex will be log b i to the base 2 c + 1. Because this is the number of block accesseson the index file to get hold of the actual index entry that gives the block pointer.Then finally you have to access the data file using that pointer and then one moreaccess is required. So if you do not have this primary index then the you can of coursedo the same binary search on the data file itself.You can do binary search on the data file also, right. Because the data file itself isordered, okay. So and it has 9500 records, so log b to the base 2 flow that will giveyou 14. So from 14 block access, you are able to reduce it to 7 block accesses. Andthen you will be able to get hold of your data record. Now this is what it is calledprimary index. It is constructed for data files that are ordered and on the orderingfield, okay?If the ordering field is not a key, then similar thing can be done, we will call that as aclustering index. Now here is a again another interesting idea. The index file is itself an ordered fileanyway. So why do we not try and construct the index for the index file? Why do wenot construct an index for that index file. So we have data file which has a 9500blocks and we have first level index which has 49 blocks and 9500 entries. And thisitself is a you can treat this as a data file which is ordered.Now obviously, I can also construct an index for this. An index for this will have 49entries, okay. So I can convert this into a multilevel index. So there can be asuccessive levels of indices built till the last level has one block. This is last level isthis one. So now this 49 entries will be there because there are 49 blocks here in theindex file. So there will be 49 index entries here and these 49 index will fit into oneblock because the blocking factor for the data file is 195.You can fit a large number of records. So this is one block. Now for the search, youactually now do not need a binary search at all. I mean, disk binary search you do notneed. You can get hold of this block first, okay. That will give you access to locate anappropriate index entry given the value here. So you can of course do in memorybinary search, but we do not count in memory binary search at all.Our complexity is in terms of block accesses. So memory operations we will simplyignore. They are so fast that in compared to disk block accesses, we will not countthem, okay. So you can get hold of this the index for this index file, and then do asearch to record to locate an appropriate record here, okay. And then get hold of thedata pointer and then go to the actual data file get hold of the record.So now you can see, so for such a kind of a multilevel, multilevel index, we call theheight as the number of levels. So this is one level first level, this is a second level. Sothe number of levels is 2 here, so we will call it as you know the height is 2 plus soyeah so height is 2. And the number of block access required is height plus 1 becauseheight, number of block accesses are required to navigate through these index levels.And then finally you will get a data file pointer. So plus one will be required, onemore block access is required to get the actual data block. Okay, so the number ofblock accesses required with this primary, this kind of a index will now be just 3. Wewill reduce it from 7 to 3. So one here one here. And finally one here. So withoutindex it is 14. Now we have brought it down to something like 3.So three block accesses especially when you keep this particular 4 kilobytes of youknow data in the memory, then you do not have. See if you are frequently accessingthis data file, you might actually buffer this and then keep it in memory. So in 2 blockaccesses you usually get your record, which is not bad compared to 14 block accessesthat we had to do earlier. Okay, so this is the idea of turning indexes into multilevelindexes. Okay, now let us see how you can, you can do range search pretty efficiently now. Onthis data file, you can do range search pretty efficiently because all that you have todo is to say supposing you are to get hold of records ordering key field value betweenx 1 and x 2 inclusive, then you can locate the record which has value x 1 using theindex and then keep on reading successive records from the data file till the orderingkey field value exceeds x 2.So it will be very efficient. It is as good as locating x 1. From a block access point ofview, it is locating x 1 and then reading as many data file blocks as required. So thatof course depends on the range that has been given. So range search will be veryefficient. What about insertions and deletions? What about insertions and deletions?So imagine that you have a data file and then you have constructed this first levelindex, and then constructed another index for that maybe in this case it is stopped with2, maybe there is one more level, etc. So there is some h number of indexes. And thennow somebody want to come and insert a few records into your data file. And it is anordered file. And so there can be movement of records across blocks.And if records move across blocks, then the whole of your structure goes for a toss,right? We are in great trouble if you have to insert new records into the data file. So ifthe data file is, you know relatively static, you do not have too many insertions to beperformed in the data file. Then this idea of you know constructing indexes and thenmaking them multilevel and all that will be very useful.But in practice, there can be some insertions but how do you handle them? So what dowe do? There are multiple ideas to take care of insertions. One idea is to keep somesomething like 24% of the space in each block free anticipating that there will be datarecords coming into the, into that particular block. The moment you have some freespace in each block, then the block anchors do not change.If a new record comes into the picture if it comes to the first version, the block anchorwill change, otherwise you know nothing much happens. So it is not a major issue. Soto take care of future insertions, you can keep some 25 to 30% of the block spaceempty or you have to use overflow chains for the blocks. Basically have some bunchof overflow blocks where you can hold this overflow records.And then but essentially the idea is do not change the, do not change the orderingamong the records of the data files. If you change them, then you have to change notonly the first level index and then probably other indexes you will have to anddeletion is always handled using deletion markers. So for each of the record you havea field, you create a field where you mark whether it is valid or invalid, in case it isdeleted.So you can turn that you know flag off on in case it is to be deleted. And essentially,we will if you are organizing data files like this we will try and avoid changes to theindex. Okay but this is of course, this will, there are there would be files like youknow which we will create say once in a year or something like that. The all the, youknow students admitted into the university in that particular year, you create a datafile.And you keep accessing that all through the year or you know probably for 4, 5 years.So for such kind of files, this is a kind of a perfect situation where you can accessthem on say roll number and things like that. You want to have quick access to therecords. Now the idea of a clustering index is similar. The clustering index is similar. Onlything is the clustering index is actually built on the ordering field, which is not a key.The ordering field is not a key. So in that case, the index attribute index entries will belike this distinct value of the ordering field and what is the address of the first blockthat has a record with this particular value V i.So distinct value V i and what is the address of the first block that has a record withvalue V i. In fact the first record of V i because there may be multiple records havingthe same value V i, okay. So what is the address of the first block. So that is that willbe the index entry. They of course index entries are small in size so the index will fitinto fewer number of records, fewer number of blocks compared to the data file.Now the index file itself of course is an ordered file and the number of entries in thatis the number of distinct values of the ordering field. And you can actually constructeven you can even convert this into also a multilevel index. Remember that the firstlevel index entry first level index is an ordered file and this ordering field isaccurately fill in that file because we have taken distinct values of V i.So from the first level onwards it becomes the field on which we are searching, in factis a key. Only for the data file it is not key, okay. So if you are constructingconverting a clustering index into a multilevel index, you can again for the secondlevel onwards you can again exploit block anchors and then have entries only for thefirst records of the blocks in the subsequent levels of the index, okay.So it is possible that we can convert the clustering index also into a entire multilevelindex and then benefit from the fast access to the data file, a data block that containsthe record whose value is being given. In fact in this case, there will be a bunch ofrecords, which you will have to retrieve from that particular block or the consecutiveblock.It is possible that a certain value you know in the ordering starts at one block and thenthere will be a few more few records in the subsequent block also in which case youstart from that data block and then access successive blocks in order to get hold of allthe records, okay. So is this idea of clustering index clear? It is very similar to theprimary index except that we have the index entries will have distinct values of theordering field, okay. Now let us move on to secondary indexes. The secondary indexes are actually veryuseful. If you do not have secondary index for certain data files you know there willsee if you do not have primary index you can still do binary search on the ordered fileand then get hold of the record in 14 block access for example in that example that wehave seen, okay.But supposing you have been given the value of some field that is not the orderingfield of that particular data file. Then what else what is the option you have? Youhave to basically do a file scan where files can read all the data blocks of thatparticular data file and get hold of the records that have these matching values. So thealternate option is so let us see in that situation, what can we do?So this index is built on some field, which is the obviously it is a non-ordering field ofa data file. Okay, if you take a non-ordering field there are again two possibilitieshere. That field could be a key, could be a key. For example, in a data file, you haveordered the data file of student records on roll number values, okay. And somewherethere, there is a field which says it is a Aadhaar number of the student.Aadhaar number obviously is a key, okay. But you have ordered the data file onincreasing values of the roll number. And so the data records will not be there in theincreasing order of the Aadhaar number obviously, okay. So the Aadhaar number is asecondary key. We call them as secondary key. Remember, when we are definingkeys, we call them one of them is chosen as the primary key.The reason why we choose a primary key is that we will organize the primary indexfor the data file using that key. And the other one is called the secondary key. So nowwe are looking at how to construct the secondary indexes, okay. So that non-orderingfield could be a key in which case we would have called them as secondary keys.And so the index entry in this case will be the value of that particular non-orderingfield V i and the pointer to the record with V i as the NOF value. Take Aadhaarnumber for example, Aadhaar number here pointed to the student record with thatAadhaar number. It will be exactly one. Now here we are first time we are usingpointers to the records ordering the pointer to the record.We have been always having block pointers, okay. A pointer to the record is nothingbut a block pointer plus some offset value, which will give you what is the position ofthat particular record within the block that is all. So it is some small couple of bytesyou know which will give you an integer telling you the position of the record in yourin the block.So pointer of the so you can so that is how the index entries for the for this case, if thenon-ordering field is also a key, then we will have this kind of index entries. Now theother case is that the field on which we are constructing this index is not a key, okay.Like for example, you want to have fast access of the names of the students. Namestypically are not keys.There may be students which have same name values. So in this case, we haveactually a couple of options as to how to organize the index. First thing is we canorganize the index entry like this. The value of that particular field. And then pointersto the records which have that value. There will be multiple records which have thatparticular value, collect together all those pointers, and then put them in this recorditself.Now the moment you do this, this particular record, actually is not a fixed lengthrecord. Because you do not know how many number of records will have this value Vi. And so how many pointers are going to be there in this field you do not know,right? So in some index entries there may be 5 pointers. In some index entries theymay be 10 pointers, okay. So you will get hold, you will see that this particular recordis a variable length record, okay.The other option is do one more level of index indirection. What you do is to kind ofmake this index entry as a fixed length record. What you can do is to keep the firstfield as it is. In the second field instead of putting all these pointers, put a pointer to ablock and in that block keep all these pointers, okay. You have a bunch of pointers 5or 10 or whatever, record pointers.So put all of them in one block, put all of them in one block and put a pointer to thatblock here. So pointed so a block that has pointers to the records which have thisparticular value, okay. The advantage of this option two is that the index entrybecomes fixed length. So this is usually a good option because if you have fixedlength records it is easier to handle them, okay.But the downside of this is that there is one more block access that is required becauseyou will only get a pointer to the block that has the pointers to the data records. Soyou have to first access that block and then you will get hold of your pointers to theactual data records, okay. Is that clear? Is there some issue with this? Okay, so theseare the two options that we have while constructing secondary indexes.