Developer Shares A Recoverable Container Format That's File System Agnostic (github.com)
Long-time Slashdot reader MarcoPon writes: I created a thing: SeqBox. It's an archive/container format (and corresponding suite of tools) with some interesting and unique features. Basically an SBX file is composed of a series of sector-sized blocks with a small header with a recognizable signature, integrity check, info about the file they belong to, and a sequence number. The results of this encoding is the ability to recover an SBX container even if the file system is corrupted, completely lost or just unknown, no matter how much the file is fragmented.
Thanks, looks interesting. I can see some applications for use in long term storage... it's better to get some data back rather than lose it all.
The Geek in Black
I know my BCD's (when I'm Sober)
That's an interesting property, but what's the use case?
How often does your filesystem get corrupt and instead of restoring from backups, you curse the fragmented tar file that can't be reassembled?
How practical is it to keep files in an sbx container rather than extracting them? Can apps read files inside an sbx container?
...but this is better than a backup, how, exactly?
Great question. We just ordered five new SANs because we can't compact disk images. I doubt it can, but if it could, we would have gotten raises instead of a 2% pay cut.
1) Obvious troll.
2) Nowhere is compression stated.
3) If you can compress your encrypted file, then your encryption isn't worth anything.
There's no way it can. LUKS is great but it wastes tons of disk space on vms.
What if your file system and/or hardware uses a different sector size? Didn't those change size over the last decades?
#DeleteFacebook
Because it never occurred to you to compress before encrypting?
tar cJf - dir | gpg -c -o archive.tar.xz.gpg
Compressed and encrypted archive, plain and simple.
Wait, so your job gives you impossible tasks, and then docks your pay when you can't do it? Get a new job!
And this is how you get an Internet on a Disk.
Doesn't GPG already compress the input data for improved input cleartext?
Ezekiel 23:20
I did a quick read of the code and see that it relies on a magic cookie in the first four bytes of every physical sector to identify a block. This may not work for files small enough to fit entirely within the MFT on NTFS since that data isn't guaranteed to be aligned on a physical sector. There are other filesystems that store small file segments in the metadata structures as well.
Therefore if everything uses this container then hard drives would be uncorruptable? Sounds too good to be true.
You solved a problem I've never had in 25 years of using computers and losing data. Congratulations to you.
Or I could just use ZFS and set copies = 3. Which begs the question, how well will your solution fail when data is striped across drives and one of the drives in unrecoverable, or the uC in an MMC refuses to let you read any of the flash and not just the FAT table.
Take your freshman school project and get the fuck gone, man. I hate kids today.
There's no way it can. LUKS is great but it wastes tons of disk space on vms.
It can! Just turn on discard (and have the system inside issue trim commands). This does have an impact on encryption, though, which might or might not be acceptable for you: it is possible to tell used from unused disk space, which leaks information about usage patterns inside the VM.
The creatures outside looked from Alt-Right to Antifa; but already it was impossible to say which was which.
This. But I would guess they can't support that since libvirt doesn't already.
I didn't say compression! Nice attempt at trolling. I said compact. Very often virtual machine images can be compacted to be less than 80% of their previous size. If you have a vm that uses a lot of disk space, and for example, you copy those logs off, then they can be greatly compacted.
Could the identifying signature in each sector be obfuscated with user generated encryption? This could prove useful as part of a reversible dead man switch for a filesystem where the hidden container could later be recovered. Also useful possibly for hidden encrypted containers such as ones used by truecrypt/veracrypt.
He said compact.
I use write only memory for all my backup needs. I don't have to worry about any sector size b.s.
He said compact, not compress.
Since the disk image is encrypted, it doesn't compress well. Compacting is vey different from compressing.
You obviously tried to keep the per-block header small to minimize overhead. But that has caused questionable decisions that may make this format less useful than it could be. Firstly, at 48 bits, the UID is a bit short. If UIDs are chosen randomly and with even distribution, there's a 1 in 1000 chance of a duplicate UID with just 750000 files. That might seem low enough, but I think it's cutting it a bit close. Secondly, the block sequence number is a 32bit value, so 4 billion blocks in a file max. With this format, files are limited to 2TB. Thirdly, the 2-byte checksum is too small. No two ways about it. Fourthly, you wasted 3 bytes on a "magic" string. These would have been put to better use in the checksum. You could tell a SBx block from other blocks by making the verification calculations.
I think you should redo this with a header twice the size and sequencing that relies on a hash of the previous block and a small block index (actual block index modulo a small number). That would remove the file size limit and make it much more robust against collisions.
I mean the chances of the filesystem being corrupted without the file itself also being corrupted seem small to none to me.
Unlike HDD controllers, SSD controller do wear-leveling, so there is no guarantee that your data will be written as as a contiguous block of memory (regardless of what the filesystem says), only that it will be in 4096 byte blocks. Recovering deleted data from a SSD is no simple task because it means you need to know or guess the controller behavior for wear-leveling in order to go back and find the order of previously written data. With this you would be able to just read the raw memory even after the controller has been reset and still be able to recover the data. I think it would be a nice option to have a filesystem be able to encode user files in something like this highly recoverable format. The only real problem is that the file has to be completely rewritten even if you only modify part it in order to differentiate the new version from the old version.
Anons need not reply. Questions end with a question mark.
It seems to me this would be a lot more useful if it directly incorporated forward error correction.
Compact for vms has a very specific meaning.
It can't compress blocks that have previously encrypted data. That the problem.
There is some confusion as to what this is actually doing.
Most filesystems have use special structures to store the name and location of your files on the drive. Directories, cluster bitmaps, etc etc. The reason why it's difficult at best to recover files from a hard drive when parts of the filesystem have been damaged is that it's difficult to identify where on your hard drive the files are. Besides the special filesystem directories, no where else stores information on what is stored where. If you lose the directory it's hard to tell one file's data from another on your hard drive.
That is where SBX comes in. What it does is make sure that every physical sector that stores data for a particular file is labelled with a number that identifies that file, and a sequence number so you can reconstruct what order that piece is in the original file. Really, for the amount of overhead, something like that should be embedded into every filesystem. Basically a distributed backup of all the filesystem metadata.
Some people are criticizing this that is solves non problems. I disagree. While it isn't the solution to global warming, it is both simple and clever (and will thus suffer from a lot of people who will disparage it out of a "well anyone could have thought of that" attitude). It won't save you from a full hardware crash. It won't save you from physically bad sectors in that file. What it will save you from is accidental deletion and from loss of the filesystem's metadata structures. How often does this happen? Twice to me from failures of a whole-disc-encryption system driver.
I wouldn't use this for every file, but for critical ones, sure. Why not. The problem is, where it is most useful, for very volatile files that change a lot (databases etc) between backups, is where it can't really be used until/unless different applications start supporting it. So it unfortunately has limited use in the places where it would really help the most. Like I said above, this sort of thing really needs to get rolled into a filesystem. The amount of overhead it costs is meaningless in today's storage environment.
If the disk image is encrypted then it doesn't compress well. Plus the underlying file system doesn't know about unused blocks. That is why compacting LUKS encrypted disk images just doesn't work. Compression doesn't either due to the random (as is expected of any good encrypted algorithm) data in the disk image. We need a file format for vms that supports encryption and supports compaction.
If you can meaningfully compact *anything* that's encrypted, the encryption was improperly implemented. You *always* want to compact files prior to encryption, and a well-encrypted compressed file should be statistically indistinguishable from random noise.
Zero your freespace? Then your unused sectors can at least be compressed.
Compression before encryption often results in a padding oracle or other problems. If you're designing a system that is supposed to be secure, avoid compression until you fully understand the issues. Avoid compressing and encrypting chosen plaintext at all - you'll never be sure you understand all of the issues with that.
Hash the password AND block number through a key-stretching routine to get the encryption key. It is important to avoid using the same key for all blocks. If different blocks are XORed with the same key, I can still see your penguin:
https://blog.filippo.io/the-ec...
Could this be also be used when the file contents are deliberately separated? Eg, distribute the file pieces (sectors?) to different audiences / storage locations, such that one has to get cooperation from all piece-holders to retrieve the net results? Eg: nuclear launch codes, and other less dramatic scenarios.
The Lisa and early Macintosh drives supported 532-byte sectors. The extra sectors were used for "tags" - basically a less-sophisticated version of this scheme and without the "block 0."
For details on why "tags" were eliminated, see Macintosh Technote #94, "Tags," by Bryan Stearns, November 15, 1986.
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
Been struggling to understand this for the last 3 decades (not full time) ... :/
I know about that, but what does that have to do with the fact that there's apparently already an extra compression step in the pipeline above?
Ezekiel 23:20
You've never worked in government, have you?
My comment doesn't directly relate to having the *two* compression steps. I probably should have replied to the same person you replied to.
Yes I have
I want to storage that is File System Atheist!
(And would that be like Write Only Memory?)
Tracy Johnson
Old fashioned text games hosted below:
http://empire.openmpe.com/
BT
Creimer is that you?
xD
I wrote three pieces of software:
Strongbox
Throttlebox
Clonebox
Then you chose Seqbox. :)
That s a real joke...
Have you tried modeling the randomicity after encryption and regenerating with an according RNG? If it is lossy graphics algorithms, then maybe it would be enough, for a given encryption algorithm. No idea what that algorithm may be, though.
I've starred SeqBox on GitHub and I think it's a really good idea, but I suggest you rewrite it in C, because:
1) C programs have much higher performance and a smaller memory footprint
2) More portable (C can be compiled virtually anywhere)
3) It could be used from languages other than Python (C libraries can be used from C++ and can be easily made to work with Java and C#)
4) The above points would make it much more attractive for use on servers and embedded systems.