Slashdot Mirror


Large File Problems in Modern Unices

david-currie writes "Freshmeat is running an article that talks about the problems with the support for large files under some operating systems, and possible ways of dealing with these problems. It's an interesting look into some of the kinds of less obvious problems that distro-compilers have to face."

27 of 290 comments (clear)

  1. Re:Why large files by mr.henry · · Score: 3, Funny

    Who needs more than 512k of RAM??

  2. Not really that groundbreaking... by CoolVibe · · Score: 4, Interesting

    The problem is nonexistant in the BSD's, which use the large file (64 bit) versions anyway. And that you have to use a certain -D flag if your OS (like Linux) doesn't use the 64 bit versions. Whoopdiedoo. Not so hard. Recompile and be happy.

  3. Re:Why large files by voodoopriestess · · Score: 3, Informative

    Databases, Movie files, Backup files (think dumps to tapes). Animations, 3D modelling.... Lots of things need a > 2GB file size. Iain

    --
    ---- "I would be careful in separating your weirdness, a good quirky quantum weirdness, from the disturbed weirdnes
  4. Re:Why large files by Big+Mark · · Score: 5, Insightful

    Video. Raw, uncompressed, high-quality video with a sound channel is fucking HUGE. Look how big DivX files are, and they're compressed many, many times over.

    And compressing video on-the-fly isn't feasible if you're going to be tweaking with it, so that's why people use raw video.

    -Mark

  5. Re:Why large files by Anonymous Coward · · Score: 5, Interesting

    Real analytical work can easily produce files this large. Output for analyses of structures with more than half a million elements and several million degrees of freedom can EASILY produce output of over two gigs. Yes, these results can and should be split, but sometimes it makes sense to keep them together as a matter of convenience. Plus, there IS a small performance hit when dealing with multiple files on most of the major FEA packages.

  6. Re:Why large files by hbackert · · Score: 4, Informative

    vmware uses files as virtual disks. 2GB would be a really, really small disk. UML does the same, using the loop device feature of Linux. Again, a filesystem in a file. Again, 2GB is not much. Simulating 20GB would need 10 files.

    Feels like 64kbyte segments somehow...and I really don't want to have those back.

  7. data warehouse, and any database for that matter by CrudPuppy · · Score: 5, Insightful

    my data warehouse at work is 600GB and grows at a rate of 4GB per day.

    the production database that drives the sites is like 100GB

    welcome to last week. 2GB is tiny.

    --
    A year spent in artificial intelligence is enough to make one believe in God.
  8. Its funny how some lamers dont listen... by cheekyboy · · Score: 3, Insightful

    I said this to some unix 'so called experts' in 95, and they said, oh why why do you need >2gig

    I can just laugh at them now...

    --
    Liberty freedom are no1, not dicks in suits.
  9. 640 K ought to be enough for anybody by cyber_rigger · · Score: 3, Funny

    --Bill Gates

  10. It will happen with time_t, too by wowbagger · · Score: 5, Informative

    We are seeing problems with off_t growing from 32 to 64 bits. We are also going to see this when we start going to a 64 bit time_t, as well (albeit not as badly - off_t is probably used more than time_t is.)

    However, the pain is coming - remember we have only about 35 years before a 64 bit time_t is a MUST.

    I'd like to see the major distro venders just "suck it up" and say "off_t and time_t are 64 bits. Get over it."

    Sure, it will cause a great deal of disruption. So did the move from aout to elf, the move from libc to glibc, etc.

    Let's just get it over with.

  11. A woman's perspective . . . by pariahdecss · · Score: 5, Funny

    So my wife says to me, "Honey, do I look fat in this filesystem ?"
    I replied, "Sweetie, I married you for your trust fund not your cluster size."

  12. Re:Why large files by CoolVibe · · Score: 5, Interesting
    raw video can easily exceed 2 GB in size. Why raw video? Because (like others said) it's easier to edit. Then you encode to MPEG2, which will shrink the size somewhat (usually still bigger than 2 GB, ever dumped a DVD to disk?), so it'll be "small" enough to burn onto a DVD or somesuch. Oh, editing 3 hours of raw wave data also chews away at the disk size. Also, since you need to READ the data from the media to see if it looks nice, you need to have support for those big files as well. Right, now why don't we need files bigger than 2 GB again? Well?

    Oh, you're still not convinced, well see it this way: when in the future will you ever need to burn a DVD?

    Well? A typical one sided DVD-R holds around 4 GB of data (somewhat more), if you use both sides, you can get more than 8 GB of data on it. That's way bigger than 2 GB, no? Now, how big must your image be before you burn it on there? well?

    Right...

  13. Re:Wrong point of view. by KDan · · Score: 4, Insightful

    Two words:

    Video Editing

    Daniel

    --
    Carpe Diem
  14. Funny...in AIX... by cshuttle · · Score: 4, Informative

    We don't have this problem-- 4 petabyte maximum file size 1 terabyte tested at present http://www-1.ibm.com/servers/aix/os/51spec.html

  15. Have you ever seen some people's email? by alen · · Score: 4, Insightful

    On the Windows side many people like to save every message they send or receive to cover their ass just in case. This is very popular among US Government employees. Some people who get a lot of email can have their personal folders file grow to 2GB in a year or less. At this level MS recommends breaking it up since corruption can occur.

    1. Re:Have you ever seen some people's email? by nentwined · · Score: 5, Funny

      I agree with MS on this one. government employees shouldn't be allowed to hold their positions for longer than a year. DOWN WITH GOVERNMENTAL CORRUPTION! ... :)

      --
      heaven
  16. Switch to gnu/hurd by Anonymous Coward · · Score: 3, Funny

    It has a nice small 1gb filesystem limit. I have partitioned my hard disk in to 64 little chunks and it runs very slowly, and unstabilly, but its completley open source and im happy.

  17. Re:Wrong point of view. by heby · · Score: 5, Funny

    "oh yes, those were the days." - misty eyed smile - "when i was young and filesizes were small. you should have seen it. today's youth is so spoiled that they don't even learn assembly language any more. i tell you, you're all going to die because of your large files, yes, die!" - madly waves his cane in the air - "2gb, that's more than anybody will ever need and you are greedy for even more! the holy bit will punish you for this, it will!" - dies of a heart attack.

  18. Re:Wrong point of view. by cvande · · Score: 5, Insightful

    In a world everything is small and manageable. Unfortunately, some databases need tables BIGGER than 2gb. Even splitting that table into multiple files still finds you with files larger than two gb. Try adding more tables? OK. Now they've grown to over 2gb and the more tables the more complicated everthing gets. I still need to back these suckers up and a backup vendor that I won't name can't help me because their software wasn't large file (for Linux) ready. So let's get into the game with this and make it the default so we don't need to worry about these problems in the future. Linux IS an enterprise solution.....(my $.02)

  19. Re:Why large files by kasperd · · Score: 3, Insightful

    The seek times alone withinr these files must be huge

    Who moded that as Insightful? Sure, if you are using a filesystem designed for floppy disks, it might not work well with 2GB files. In the old days where the metadata could fit in 5KB a linked list of diskblocks could be acceptable. But any modern filesystem uses tree structures which makes a seek faster than it would be to open another file. Such a tree isn't complicated, even the minix filesystem has it.

    If you are still using FAT... bad luck for you. AFAIK Microsoft was stupid enough to keep using linked lists in FAT32, which certainly did not improve the seek time.

    --

    Do you care about the security of your wireless mouse?
  20. The O/S should do it and do it well. by tjstork · · Score: 3, Interesting

    1) Splitting up a big file turns an elegant solution into a an inelegant nightmare.

    2) Instead of 10 different applications writing code to support splitting up an otherwise sound model, why not have 1 operating system have provisions for dealing with large files.

    3) You are going to need the bigger files with all those 32 bit wchar_t and 64 time_ts you got!

    --
    This is my sig.
  21. Re:Wrong point of view. by costas · · Score: 4, Insightful

    Maybe in your problem domain that's true. I work with retailer data mines and we've hit the 2GB file limit, oh, 4-5 yrs ago? We've been forced to partition databases causing maintainance issues, scalability issues, and the like, just because of the size of a B-tree index.

    True, it looks like the optimal solution is lower-level partitioning, rather than expanding the index to 64bits (tests showed that the latter is slower), but that still means that the practical limit of 1.5-1.7 GB per file (because you have to have some safety margin) is far too constraining. I know installations who could have 200GB files tomorrow if the tech was there (which it isn't, even with large file support).

    I am also guessing that numerical simulations and bioinformatics apps can probably produce output files (which would then need to be crunched down to something more meaningful to mere humans) in the TB range.

    Computing power will never be enough: there will always be problems that will be just feasible with today's tech that will only improve with better, faster technology.

  22. Re:Wrong point of view. by Yokaze · · Score: 4, Interesting

    I'm not a specialist on this matter, so maybe you can enlighten me, where I am wrong or misunderstood you.

    > fragmentation: large files increase to fracmentation of most file systems
    What kind of fragmentation?

    Small files lead to more internal fragmentation.
    Large files are more likely to consist of more fragments, but when splitting this data into small files, those files are fragments of the same data.

    >entropy pollution
    What kind of entropy? Are you speaking of compression algorithms?

    Compression ratios are actually better with large files than small files, because similarities between files across file-boundaries can be found. Therefor, gzip(bzip2) compresses a single large tar-file. (Simple test, try zip on many files and then zip without compression and subsequent compression on the resulting file).

    >data pollution
    How should limiting file size improve that situation? Then, people tend to store data in lot of small files. What a success. People will waste space, whether there is a file size limit or not.

    >These limits are there for very good reasons and in my opinion they are even much to big.

    Actually, they are there for historical reasons.
    And should a DB spread all its tables over thousands of files instead of having only one table in one file and mmapping this single file into memory? Should a raw video stream be fragmented into several files to circumvent a file limit?

    >[...] original K&R Unix [...] was much faster than modern systems

    Faster? In what respect?

    --
    "Between strong and weak, between rich and poor [...], it is freedom which oppresses and the law which sets free"
  23. Re:BeOS Filesystem by Yokaze · · Score: 4, Informative

    Mine is bigger than yours :)

    Linux XFS: 9 exabytes

    Also supports extended attributes.

    --
    "Between strong and weak, between rich and poor [...], it is freedom which oppresses and the law which sets free"
  24. Re:Only 35 years... by Dan+Ost · · Score: 3, Informative

    For most programs, it would require little more
    than to change the typedef that defines __time_t
    in bits/types.h.

    For stupidly written programs that assume the
    size of __time_t or that use __time_t in unions,
    each will need to be addressed individually to
    make sure things still work correctly.

    --

    *sigh* back to work...
  25. Large filesystem lack more of a problem by mauriceh · · Score: 3, Interesting

    A much bigger problem is that Linux filesystems have a capacity limit of 2TB.
    Many servers now have the physical capacity of over 2TB on a filesystem storage device.
    Unfortunately this is still a very significant limitation.
    This problem is much more commonly encountered than file size limitations.

    --
    Maurice W. Hilarius Voice: (778) 347-9907
  26. The "l" in lseek() by edhall · · Score: 3, Informative

    Once upon a time (prior to 1978) there was no lseek() call in Unix. The value for the offset was 16 bits . Larger seeks were handled by using the different value for "whence" (the third argument to seek()) which causes seeks to occur in 512-byte increments. This resulted in a maximum seek of 16,777,216 bytes, with an arbitrary seek() often requiring two calls, one to get to the right 512-byte block and a second to get to the right byte within the block. (Thank goodness they haven't done any such silliness to break the 2GB barrier.)

    When Research Edition 7 Unix came out, it introduced lseek() with a 32-bit offset. 2,147,483,648 bytes should be enough for anyone, hmmm? :-).

    -Ed