Alison's New App is now available on iOS and Android! Download Now

Module 1: File Organisation

    Study Reminders
    Support

    Yeah, so let us just start it off. So in the last lecture, I was talking to you about thestructure of disks. The standard you know option for storing data in a non-volatile and basically what weobserve is that disks will offer a good option where we have random access to theblocks. A block is a unit of transfer, data transfer from the disk main memory. Andthese are random access devices. So if you give a block address and the diskcontroller will be able to give you a block of data, which is usually something likehalf a kilobyte to about 8 kilobytes of data into memory buffers.Now the we also seen that the amount of access time that is required to access thisblock involves the so called seek time where you have to position the read/write headonto the cylinder from which you are reading the block. And then the rotational delayfor reading off a particular block from the surface when it is spinning, okay. So wehave seen all this.Now what we are doing is to see that if when there is a file, the file has records andwe need to now store this file onto the disk. So what we will do is to pack theserecords into block units, and then map those, then we will call we will get onesequence of you know record units. So we call them as file blocks. And then these fileblocks need to be mapped to the disk.And so we were looking at what are the mapping policies that are possible here. Soone of them is to do what is called contiguous allocation, where we will try where wewill map the consecutive file blocks on to consecutive disk blocks, consecutive fileblocks onto the consecutive disk blocks. Now the advantage of this is that if you wantto do a scanning of the file, it can be done very fast because there the file blocks areall on the consecutive disk blocks.So we have seen the notion of consecutivity. Basically consecutive means from anaccess time point of view, these blocks that are there on a certain one particularcylinder are said to be close to each other. And then the next cylinder etc., nextadjacent cylinder.Now but the downside of this kind of a contiguous allocation is that when you aretrying to include a new block, into a file block actually into a file data, then that fileblock has to go into a since our policy of you know storing consecutive file blocksand consecutive disk blocks somewhere in the middle that new block has to beinserted.So if you have to insert a new block somewhere inside, you know in between severalother blocks on the disk, then you know if you are roughly trying to insert it in themiddle, then it involves reading off half the blocks of the data file and writing themback which is a very costly affair, right. So first of all reading itself is yeah so andthen reading so many blocks is a lot of cost, okay.Then the other and the other extreme spectrum, we have this link allocation whereeach of your file blocks is actually stored on some random disk block. And then inorder to find out where the next block is you know of next block of the file block islying, from each of those disk blocks we will have a pointer to the next block wherethe next file block can be found, okay.So that is what is called each field block is, file block is assigned to some disk blockand from each of these disk blocks, we have a pointer to the next block that containsthe next you know file block in the sequence. Of course, the file expansion will bevery easy in this case right? How, so all that you have to do is to actually do standardlinked list insertion.So in some sense, what is happening here is that the file block, which is a array of,you know blocks is actually you know made into a linked list on the disk, okay. Thereis a singly linked list on the disk. So that the pointer is the disk pointer. So that a diskpointer is nothing but the disk address. You know the disk address has surfacenumber, cylinder number and block number.So this and these numbers are so it takes about 10 bytes or something that is the diskaddress. So this disk address will be stored in each of these file blocks so that you willbe able to find out what is there at the next file block lying on the disk. Now theexpansion of course, is very easy because you know that inserting a new element inthe linked list is easy, right. You can do pointer manipulation and then insert theblock.Whereas, but the downside of this is that if you have to access the first hundred blocksof a file for some reason then you know it will involve chasing hundred disk pointers,right? So each of them might potentially include again seek time rotational delay. Ifyour file blocks are consecutive disk blocks, then you would hope that the diskaccesses have less amount of seek time involved because they are on the samecylinder, right.So this scanning can become pretty costly operation. So you can also of course, thinkabout some kind of a mixed allocation where we say that okay, I will take first 25blocks of the file sequence and then put them on consecutive blocks. And then thenext 25 I will put in some other conservative place and put a pointer from the first 25to the next 25.You can take a policy like that next allocation where you use contiguous chunk ofdisk blocks for some part of the file and use again, put a pointer and so then before(()) (8:22) when you are trying to insert a new file block then it is kind of easier. Youdo not have to read and write lots of the disk stuff. You only have to read the diskblocks that are involved in that particular segment in this new block, okay.So mixed allocation is also possible. Now this is how the sequence of records, in ourcourse we have, the files that we are dealing with are essentially sequences of records,right. So these records of course can be need not be directly the tuples of the relationsthat we are talking about. They could be something, you know which may have someextra information or they could be longer than, these records could contain multipleyou know information about multiple tuples, it is possible.But so you remember that we long time back we discussed what is called physicaldata independence. So the ability to kind of change the organization without affectingthe logical view of the disk. Now so all this would be possible if we have appropriatemapping information as to how those relations are realized on actual files of records,okay. So that information as to what is the file and you know whether the file haswhat is the information about that file with regards its configuration.Is it one file or is it two files. Is this tuple sequence. So all that information will bethere in a in some sense of you know metadata kind of store and so we have theconcern that before we can actually make use of disk files for actual data access. Sofirst the file headers will be returned, and then you get this information about what isthere in the file and then whether there are what we call access structures additional.So we are going to discuss what are called disk data structures. We have seen normaldata structures, memory data structures and data structures. Now we are going to seedisk data structures. And so these data structures can completely lie in the disk and soyou need to know as to whether there is a available data structure or faster access forthe cause of this file and then make use of it and all that will come into picture.Now the first thing is let us look at what are the various operations that we may haveto do on files from the database perspective. Obviously, inserting a new record, if there is a bunch of already a bunch of recordsare there, you want to insert a new record into the file. Now depending on how theparticular file is organized, which we are going to discuss in a short while now, thismay involve actually an appropriate position to involve to appropriate location wherethis new record has to go into the slot, okay. So we will see.So this will become clear when we discuss what are called file organizations, how arefiles organized. Deletion of a record obviously. So deletion of a record involves firstlocating the record, where is it exactly, which block. So that will might involve somekind of search and then delete that particular record from that block. It may involvemovement of other records in case there is a policy over they can take this allocationor something like that. Updating.Updating a record field or fields is in some sense equivalent to deleting the old recordand inserting it. And then search for a record. So given some value of the aerobic keyfield or the non key field etc. Given values of some of these fields, you have to locateeither a record or records that have the specific value for those fields. This is a veryoften required operation on files. And then range search.So we may get two values for the aerobic key field or the non key field and then weare requested to get all those records that have values between these two minimumand maximum values that are being given the range has been given. So let us call it arange search. Now how successfully we will be able to actually carry out theseoperations it crucially actually depend on something called the organization of a file.And also whether there are any indexes available for the file. Indexes or additionalaccess structures that we are going to discuss whether any indexes are available, andwhat is the organization of the file. How are records actually placed inside the file. Sonow depending on how the records are placed inside the file, some of these operationsactually might be easy, some of them might be tough, etc. So we will see examples ofthat, okay. So that brings us to what we call as primary file organization. The primary fileorganization is basically some kind of a logical policy that we use and method that weuse for placing records into file blocks. How does it matter? Let us look at forexample a student file. We might want to organize this file as a sorted file inincreasing order of the roll number values.We know that roll number is a key for the student records. So we might take a logicalpolicy here of organizing these records of the students in increasing order of the rollnumber values. So this is an example of file organization. So this is a sorted filebecause they are in the increasing order of the roll number values. In general, the goalof a file organization is essentially that how do we ensure that the operationsperformed frequently on this particular file execute fast, okay.So as you can see for example in this case, suppose we have organized a file like this.Then there may be a so roll number based access will be pretty fast because we canexploit the sorted nature of the file and then get hold of the record that has a specificroll number or given roll number, we will be able to quickly get that. What if, youknow you ask a, you give a name and ask for a record that matches this particularname of the student.So we do not have any other, we do not have any means other than scanning the file.And then you know start from the first data record and then keep on searching for thisparticular name, and then find out all those student files that have matching values inthis name. So if you now say that okay my access to the student records by rollnumber and by name is more frequent, and I would like this organization to be in sucha way that you know both of them actually are pretty fast.So in some sense it is a if you organize it by roll number values, and this is aconflicting kind of a data access, right. So if you choose to make roll number accessfast, then the access based on name might be slow. So what we do typically is to firstchoose some one primary file organization using which we can make one of theseaccess as fast like for example roll number and then to make the other access fast sayname access, we will construct a additional access structure okay.So we will basically make use we will construct a separate data structure that will takethis name using which we can actually use the name and then search through thatstructure and then get hold of the roll number or get hold of the pointer to theparticular block in which the set of students are, okay. So those things are what arecalled index structures and we will see the details later.So let us first look at what are the various you know options we have for fileorganizations, okay. So this is the let us look at the options that we have. So we will discuss what are called the heap files and sorted files and hashed files,okay. Heap file is actually a file that does not really have much of an organization. Itis like records are appended to the file as they are inserted. This is you know just likeappend only block file. So this is the simplest of organizations. You do not have tothink anything about it.You get a new record, go and add it at the end of the list of records that we actuallyhave. That means you can keep a pointer to the last file block where it is in the list andthen read that file block, add this new record and write it back, okay. So inserting anew record is done pretty quickly. You have to basically read the last file block,append this record and write back the block.But then locating a record given some values of any of these attributes would be anightmare, because basically you just have to scan the entire file. You do not have anyother means of getting hold of these records. So you have to scan the entire file. Nowthis kind of files are useful for organizing something like you know our somewhereunless you know there are other access structures we rarely use this kind of a fileorganization, okay.And this is typically kind of can be used in for example, organizing the log data ofsome system you know log in you know as and when operations are happening, youjust want to keep a log of what happened there. So you generate a log record. And somost of the time you may not actually use the log. So if you want to actually use thelog then you can get hold of the thing and search for the record.So you may be willing to pay the price for searching. But most of them we do not usethis heap file organization unless we actually also build other access structures thatwill help us locate the file, okay. So the next one is very interesting organization which is also called the sequential fileor sorted file. So what we will assume here is that there is a field called the orderingfield of the record that is chosen so that the records are sorted on that field, sorted onthat field. So the field on whose values are used for sorting the records in the data fileis what is called is ordering field and it could be in a key field or need not be a keyfield, okay.So we call it ordering key field in case the ordering field is also a key. Otherwise, it isjust called an ordinary field. So basically we store the file and the records are in theorder of those ordering field values. And we call such as a file as a sorted file or asequential file. Now what we can do here is on the given the value of this particularordering field, locating a record is very somewhat easy, how do you do?You know that you can do if you are a sorted array and then locating a value in thatyou will do binary search, right. So you could actually do some kind of binary searchhere. So locating a record given the value x of the ordering field you can do binarysearch. So but here we are doing binary search on the disk file. So what is theassumption here? The assumption, important assumption here is the address, the diskaddress of the nth block nth block of the file can be obtained from the file header.So the file header has all the disk addresses of all the blocks, okay. Then you can dobinary search. So you can pick up if it has 1000 blocks, so you can take out the 500blocks address and then you know you get 8 kilobytes of data or something like that.You get a block of data and then you look for this particular ordering record whichhas this value x.And depending on the records that you get if you are getting records you know thosevalues are much higher than X or much lower than X, you will have proper field linkmove. So decide whether you are go to the upper portion of the data file or the lowerportion of the data file and then again ask for the make use of these file pointers, I amsorry, the disk pointers that are available in the file header, and then record of the nextfile block and then keep that in your file.So what you do is about if there are N number of disk blocks, then about log Nnumber of disk accesses would be required in order to get the block. So since log N isslowly rising function, so this is reasonably efficient. So we will later on see how toimprove this further and range search is also efficient.So given two values you can, v 1, v 2, you can locate v 1 first and then sequentiallystamp the file till you exhaust all the records which are whose values are between v 1and v 2. So what about inserting a new record? Inserting a new record there is a problembecause the ordering might get affected, right. You have already ordered the file onroll number and it is there on the disk. And if you are very unfortunate this new recordthat has to go will be in the very first block or something like first somewhere in thevery first few blocks everything and then you know you have exceeded the number ofrecords that can be kept in block.So this you know the overflow block has to go into the next block and so on. Allblocks will get affected, right. So that is messy, because if the file has stocking blocksthen you have to put this following 100 blocks then all this remaining block you haveto read and write back. So this is this can be pretty costly. One technique you can do,one technique you can adopt here is to, though is to actually not insert the data recordsinto the data file at all.What you do is you insert them into an auxiliary file, into an auxiliary file keepinserting the new records, whatever the records have come. And then the idea is ofcourse after some time, so after several you know days of use of this file, then you cantake some lean time off and then you know reorganize this file, insert all the recordsthat are there in the auxiliary file into the main data file.So you can merge the auxiliary file and the main data file and then get make the mainfile again into a sorted file. So if you are using an auxiliary file, then for locating arecord, you first do a search on the auxiliary file first, because there are fewer numberof records there and they will probably be just fitting in a few blocks. So you will beable to quickly find out a new record you know the record that is being requested or isthere in that.If it is not there, then you go to the main data file and then do a standard search. Nowinserting a new record we do it like this mainly because it might affect the orderednature of the file. So is the case with the deleting actually. If we delete a file, delete arecord request comes for deleting a record, it might affect the order, right? So whatyou do, of course you can just place a marker on the file on that particular recordsaying that that particular record actually has been deleted, without actually deletingokay.So there will be, in some sense an empty slot there. It is marked as deleted, but youactually do not delete it. So if you, the reason why you do not delete it is, is that if youdelete and try to, you know make the blocks compact or something like that, thenagain there will be a lot of movement among the blocks which you want to avoid. Sowhat you do is so both these techniques have to be used.You use an auxiliary file to keep inserted records as the file is in our operation, usethe auxiliary file to keep inserted records and use deletion markers to mark deletions.And then periodically we organize the whole scene and then insert the auxiliary filerecords into the main data file and ensure that all the marked up records are deletedand make the data file clean.So you know how often you should reorganize will depend on how often the file isused, okay. So all that can be done. So on that particular ordering field, the access tothe records is fast and we will sort it fast, okay. What about other fields? Other fieldsyou basically have the files, okay. So what you typically do is on the other, if you weneed access on the other files.So you build an index, you build an index and then create a separate data structureusing which you will then be able to find out a block address where the record is,okay. So the other thing, other primary organization is the hash file. You have studiedhashing in data structure, so I am not going to repeat the details of hashing. But wewill use the hashing idea and then use it for organizing the file. So this is a very usefulfile organization. If we really want a very quick access to data records given the valueof one single attribute which is what we choose as a hashing attribute.So the hashing field is the attribute on which we want quick access, okay and that isthe attribute on which we perform hashing. So let us see how hash files are organizedor how we use the idea of hashing to organize a file. So this is one of the primaryorganization methods, okay. So the data file is we use a term called bucket in thiscontext. A bucket is either a block or a few consecutive blocks, two, three consecutiveblocks is called as a bucket.So the data file is organized in the form of buckets. So bucket numbers 0 through M -1 are used, okay. And then we use a hash function. The hash function what the hashfunction does is map the values of the domain of the hashing attribute, some attributewere chosen. So the values of the domain in the domain of the hashing attribute willbe hashed by this hashing function into bucket numbers. So you give the value and itwill give a bucket number, okay. So to insert to insert a new record into this kind of a hash file, what you do so this isthe, this is how the main data buckets main data is laying here. And these are actuallyoverflow buckets. So we will see what exactly overflow buckets are. So each of these,each of these main file buckets can potentially have a, an overflow chain, an overflowchain attached to it, okay.An overflow chain will be like this. It is a pointer to one of the records, and thatrecord has another pointer, etc. So that is an overflow chain alright. So why do weneed overflow chains we will see that. So given some record, you have the value ofthe hashing field. So apply the hash function on this on that hashing attribute valueand then you get a bucket number go there to that bucket, bucket number.So this bucket numbers where these buckets are, this point is of all these buckets willbe stored in the file header. So bucket number i you know the disk pointer, so you gethold of this bucket and then go and insert. So there are two possibilities here. Thisbucket might be full, okay. There is no guarantee that the hashing function actuallydistributes the records evenly, right? It might be skewed, you do not know, right.So this bucket actually might be might actually be full. If that is the case, then youknow insert the new record into the overflow chain. If it is not this one space here,where you can if there is a screen where you can put this insert this data record welland good insert it. Otherwise, we will chase this overflow chain and then basically,this is something like a buddy list, right.So you have you can take a free slot from here, you can take a free slot from here, putyour record there, adjust this overflow chain to point to that and then from that, youcan put a pointer to the next record that was there originally in overflow chain likeslicing a new record into this overflow chain. So you can do that right in thebeginning by using one of these free slots that are available in the overflow buckets.So you would have probably studied in data structure course as to how to maintain thefree list the available free slots can be can themselves be actually arranged as a list. Soyou can take out one item from that list and insert that frame into this overflow chainand put that record in that place. So that is how we can handle overflows into maindata buckets, okay.So the overflow chains of all these buckets are all maintained in this overflowbuckets, okay. So in the file header there will be a pointer to the overflow bucket andin the free list actually, the free list so that you can in case necessary borrow an itemfrom the previous music to put your record and then adjust the field, okay. Now deleting record is also similar. All that you have to do is first use the hashfunction then get hold of the locate the record, which has to be deleted by applyingthis hash function then remove that R from the bucket and or the overflow chainwherever it is. It is possible that supposing there is no overflow chain then it is therein the bucket, we can just remove it.If there is a overflow chain for the bucket, then there are two possibilities. The recordmight be here or might be in the overflow chain. So basically, you will have to followthe overflow chain and then follow the record and then adjust the chain so that therecord actually is not anymore there in the overflow chain. It is actually that slot canbe actually sliced back to the previous, okay. So that is how you remove.So bring the record yeah if possible bring the record from the overflow chain into themain bucket. That is also possible. Yeah, for example, if you if a slot becomes emptyhere because of deletion, you might actually bring in the first record that is there inthe overflow chain into the main bucket and then cut short the overflow chain becausehaving this overflow chain is actually a burden.So because you have to follow it. So search can be done given the hash field value k,compute this r which is a hash of the hash k. Get the bucket r and basically search forthe record. Get the bucket and so the main cost is getting the bucket. That means gethold of those consecutive file blocks. And once you get them into the you know mainmemory the searching is fast so you can search in there.So if you are unlucky then it is not there in the bucket, it is actually there in theoverflow chain. Then we will have to really and that operation will be little costlybecause you have to do several disk accesses, okay. So on the average you wouldhave seen the analysis of hashing. On the average you would only you would most ofthe time you will get your record in the main bucket itself.The overflow chains will be very, you know small length overflow chains. If yourhashing function is chosen in such a way that the records that hash down to uniformlyacross the data buckets and there is enough amount of space in all these main bucketsthen the overflow change will be small, if at all they are there. Okay, the length of theoverflow chains will be small.And so most of the time you will get your data access you will get your record thefirst time you access the main file, okay. So you compute the hash the bucket numberand then go access that bucket you are most likely to get the data. So that is likelypretty fast because if you have organized your file to have say just two buckets fortwo blocks per bucket or something like that.So you are just making couple of block access and you are getting one of your datarecords irrespective of how many number of actual blocks are there in the immediatefile. So this is one of the isolations that gives you very quick access to records whichhave which I mean which will have the value of the hashing adjective okay.Now of course there is a so basically now we have looked at what is called a primaryfile organization, which is the logical method by which we organize records into afile, okay. And one of them is sorted files, the other one is hash files. This particularhash file organization basically makes use of what is called static hashing.