Dumping Lots of Data to Disk in Realtime?

lol by Anonymous Coward · 2005-05-14 01:08 · Score: 0

fourth ask slashdot in a row. slow day eh?

Re:lol by Anonymous Coward · 2005-05-14 01:11 · Score: 0

Errr. Are you by any chance reading http://ask.slashdot.org/ by mistake?

2-stage approach by eagl · 2005-05-14 01:11 · Score: 5, Informative

Have you considered a 2-stage approach? Stuff it to disk, and process/index it separately? A fast stream of data would let it all get recorded without loss, and then you could use whatever resources are necessary to index and search without impacting the data dump.

Cost... Are you going to go for local storage or NAS? Need SCSI and RAID or a less expensive hardware setup? Do you think gigabit ethernet will be sufficient for the transfer from the data dump hardware to the processing/indexing/search machines?

Sounds like you might want to run a test case using commodity hardware first.

Re:2-stage approach by Anonymous Coward · 2005-05-14 03:05 · Score: 0

I have to second the two-stage recommendation. I have no idea how hard your real-time constraints are, but you want to be sure the queries can't mess up the recording. That's much easier to show with a two-stage design.
Re:2-stage approach by Anonymous Coward · 2005-05-15 00:05 · Score: 0

cat /dev/instruments >database.txt
I'll be contacting you regarding my exorbitant consultant fees you now owe.
Re:2-stage approach by Nutria · 2005-05-15 02:15 · Score: 1

A messaging system would work.

The front-end gets all the data, then passes it along using a file-based backing-store queueing system to the back-end that posts the data to your permanent store.

This also gives you the flexibility to let the front-end choose which back-end to send it to (usually on another machine).

--
"I don't know, therefore Aliens" Wafflebox1

Suuuure. by Seumas · 2005-05-14 01:12 · Score: 4, Funny

Yeah, like it isn't obvious that this guy works for the government's TIA program and is looking for ways to maintain all of the data culled from the thousands of audio and video sensors they have planted around.

Suuuure.

Re:Suuuure. by mabhatter654 · 2005-05-15 16:50 · Score: 1

And I was just going to suggest he ask "homeland security" for advice... beat me too it!
of course we can see how well the Govt's spyware works... of course he could have a network of "voulanteers" allowing him to "monitor" their computing habits... that would be a lot of info too...

Wonderware InSQL by Dios · 2005-05-14 01:15 · Score: 4, Informative

Check out wonderware InSQL. We update roughly 50k points every 30 seconds without loading the server much at all. Pretty nice product, also has some custom extensions to SQL built in for querying the data (eg cyclic, resolution, delta storage, etc etc).

http://www.wonderware.com/

Of course, you'll need your data to come from an OPC/Suitelink/other supported protocol, but should work nicely for you.

- Joshua

Re:Wonderware InSQL by btlzu2 · 2005-05-14 02:15 · Score: 2, Interesting

How does archiving work? What is the performance of querying on a large table? (Hundreds of millions of rows) Can you hook into the database with any language/package you desire or proprietary tools only?

Do you actually charge a license fee PER point?

We had a need for a smaller SCADA system in our company and Wonderware could not answer these questions (except for the fee per point, which they actually charge PER POINT). This department is going with a different product.

Sorry, but be very cautious of Wonderware.

--
Zed's dead baby. Zed's dead.
Re:Wonderware InSQL by kernelistic · 2005-05-14 02:57 · Score: 1

We update 50,000 points at the bottom of every minute, archive every 2 minutes and have SQL tables that are several trillion (Yes, trillion) rows long on COTS Dell servers with MSSQL 2000 and and a standard middleware approach.

Sounds to me like you're either not throwing the hardware you ought to at this project or you are looking at the wrong software.

SCADA is very versatile and powerful. Are you feeding data in mostly from local or remote RTU's?
Re:Wonderware InSQL by btlzu2 · 2005-05-14 04:36 · Score: 2, Informative

We stopped at the investigation phase. They couldn't answer simple questions and were going to charge us if we needed to add more points. Unacceptable.

SCADA is very versatile and powerful. Are you feeding data in mostly from local or remote RTU's?

You do understand that SCADA is a general term which describes a type of system, right? A SCADA system could be designed (and has been) :) that is not versatile and powerful. Sorry to be nitpicky, but I'm just trying to understand what you mean.

Anyway, we work with a much larger SCADA system vendor, which actually has the SCADA market share for our industry. Wonderware would never come close to providing the functionality we'd need in our industry and we do not want to be tied to a Microsoft platform.

Wonderware was a candidate for a smaller sub-system, but we've decided to go with another system that's working out very well--is more open for development purposes and is generally better designed. I wasn't on the smaller project, but I was on the big system project and continue to maintain and develop for it.

SCADA is a fun area to work in for geeks--loads of administration, development, design opportunities in various techologies including, but not limited to, LANs, WANs, telecommunications, backend/frontend development, database maintenance, etc.

--
Zed's dead baby. Zed's dead.
Re:Wonderware InSQL by Dios · 2005-05-14 05:06 · Score: 2, Informative

InSQL works as an OLE Processes for SQL Server. You can use pretty much any tool (ODBC/ADO/excel/DAO/whatever) to query the database. Yes I realize I mixed libraries/methods/applications in the tool list, but just trying to get across a basic idea.

Yes, per point licensing, I believe we licensed for 60k points, not sure on the cost. This is pretty typical in the SCADA world I believe.

Sample query I'd use to get all data for a specific rtu
select * from live where tagname like 'StationName%'

Two tables use typically work with, live and history. Live between the latest values, history for historical queries.

As for query times, very respectable. I believe we have about 50k points right now, updated/stored every 30 seconds (Actually, its delta storage, so some discrete points who don't change every 30 seconds would be stored only on change...). So how many rows is that?

1440 minutes per day * 2 samples per minute * 50000 points * 180 days (approx history we have online) = 25,920,000,000 rows.

We have asp pages people query the data from, we limit 30 second resolution data to only 2 days at a time (to help prevent loading down the machines) but a query for any point will typically return in a few seconds.

We are pretty satisfied with the product, may not fit your needs, but its been good for us.
Re:Wonderware InSQL by btlzu2 · 2005-05-14 06:14 · Score: 1

Thank you for the information! It was more helpful than the Sales support we received from Wonderware. :)

Actually, I would refuse to pay a license per amount of points. That is completely an arbitrary way to make more money. The only thing the amount of points should affect is disk space and possibly CPU power.

Numerous companies do not charge for a license based on how many points you have and I find the practice of charging for points reprehensible. Similar to the concept of an ISP charging per packet transmitted. What conceivable extra software engineering work do they need to do if you buy a system and enter 10,000 points as opposed to buying a system with 1,000 points?

The rough query times are about what we achieve on an order of magnitude larger database.

Of course, Wonderware doesn't meet our needs because we have numerous other requirements including no single-point-of-failure distributed architecture, 100% up time (which we've achieved for 5 years now), and other performance issues.

--
Zed's dead baby. Zed's dead.
Re:Wonderware InSQL by kernelistic · 2005-05-14 07:47 · Score: 1

I understand exactly what SCADA is. I was wondering if you are using it for local or remote network control. The extent of my SCADA experience has been interfacing with PLCs in large manufacturing and power generation.

For those looking to find out more about SCADA and/or OPC, you might want to have a look at the SCADA Working Group webpage or primers such as this one.
Re:Wonderware InSQL by Anonymous Coward · 2005-05-14 13:31 · Score: 0

I've been working with InSQL for the last couple of months. I preface by saying I have no experience with other systems, but it's been a hellish pain in the ass. The associated applications that are provided with it are even worse (ActiveFactory in particular).

It's possible that it's the implementation rather than the product itself, but it's not been fun.

YMMV.
Re:Wonderware InSQL by gyanesh · 2005-05-15 09:39 · Score: 1

My experience with Wonderware is that it can be a pain to get data out of if you don't want to use their add-ons, once you move outside their little world into other products the rate for retrieval goes right down. Their help is not up to much ...all in all I'd say go with whatever else you can find/build
Re:Wonderware InSQL by meme_police · 2005-05-16 06:35 · Score: 1

SCADA is cool. I had 2 job leads 2 years ago, one at a local water district, and the other at a network-owned TV station. Unfortunately I didn't get an offer from the water district. I think I would have had much more fun at the water district even though most people would think the TV station would be more fun (I can say it does have some perks). The only thing challenging here is to keep the place running with such a low budget.

--
The meme police, They live inside of my head

Don't roll your own by btlzu2 · 2005-05-14 01:21 · Score: 3, Informative

Unless you really want to do a LOT of work. This sounds very much like a SCADA system. There are vendors of such systems. Most of the realtime databases are designed to stay in a large, proprietary, RAM database which is occasionally dumped to disk for backup purposes.

In order to process so many points realtime, it usually will have to be in RAM for performance reasons.

--
Zed's dead baby. Zed's dead.

Cluster it by canuck57 · 2005-05-14 01:24 · Score: 3, Insightful

I know your working with windows but when I read this I said yes.

I'm guessing someone out there has done something like this before.

Google has a cluster of machines far larger than you need but their approach was a Linux cluster. Plus, for the amount of writes going on your going to want not to have any burdens on the system that are not needed.

Re:Cluster it by Gopal.V · 2005-05-14 02:39 · Score: 1

didn't you read this ?. Talks about the same thing - but is patented shit (lots of prior art anyway).

--
Quidquid latine dictum sit, altum videtur
Re:Cluster it by HyperChicken · 2005-05-14 04:06 · Score: 1

Google's GFS is mainly a write-once-read-many system. It doesn't function that well for something with lots of writes.

--
Free of Flash! Free of Flash!
Re:Cluster it by MrScience · 2005-05-14 06:36 · Score: 1

Of course, to do this they created a new file system. Might be more work than you want to take on. :)

--
You quitting proves that the karma kap worked. The most annoying of the whores shut up. --CmdrTaco

Just dump it by marat · 2005-05-14 01:24 · Score: 1

You may want to look how video streams are composed, but basic idea is very simple - just dump it all in the arrival order and keep track of what did you write at which offset in some table of contents. Dump tables of contents at some regular offsets so you would be able to find them easily. That's it. Just one thing - use offsets relative to TOC, this way they'd consume less bits each, and align data - it also saves several bits from the other side.

And remember - Keep It Simply Stupid. Be sure you can reed it in hex editor when trouble comes.

Just use the file system by amorsen · 2005-05-14 01:27 · Score: 1

Keep a file per device. The OS will cache appropriately. The files will eventually get horribly fragmented, depending on which file system you choose. This should not be too much of a problem, depending on the read access pattern -- and if it is a problem, just be careful about which file system you pick. Reiser4 with automatic repacking would be the perfect candidate, but I haven't followed the development closely or tried the repacking myself.

--
Finally! A year of moderation! Ready for 2019?

Re:Just use the file system by Kinlan · 2005-05-14 01:59 · Score: 1

And how do you do that on a Windows box?

--
As cunning as a fox, which has just been appointed professor of cunning at Oxford University. http://www.kinlan.co
Re:Just use the file system by eric17 · 2005-05-14 03:23 · Score: 1, Funny

No problem..almost all Windows boxes have this upgrade option called "Linux". Check the manual...
Re:Just use the file system by TheSHAD0W · 2005-05-14 04:32 · Score: 1

You can avoid the fragmentation if you pre-allocate space based on what you think you'll need.
Re:Just use the file system by ignorant_coward · 2005-05-14 04:43 · Score: 1

Why do all the people who use UNIX and Linux for these things use UNIX and Linux and not Windows?
Re:Just use the file system by Anonymous Coward · 2005-05-14 15:30 · Score: 0

actually, most of them use OpenVMS.

A commercial RDMS can cut it by jbplou · 2005-05-14 01:28 · Score: 4, Informative

You can definitely use Oracle to write out 2000 updates per second if your hardware is up to it and your db skills are good.

Re:A commercial RDMS can cut it by marat · 2005-05-14 01:49 · Score: 1

No you cannot. Oracle is designed to handle a lot of updates of the same data per second, but we are talking about a completely different task (databases are usually populated via separate batch interfaces by the way). There're specialized tools for this task as well (IBM had something, but I cannot remember correct TLA right now), but this is not hard to write yourself as I outlined in other reply.
Re:A commercial RDMS can cut it by jbplou · 2005-05-14 02:02 · Score: 1

According to mysql they're are sites that run with 800 updates\inserts per second http://dev.mysql.com/doc/mysql/en/innodb-overview. html.

Here is sql server performance test that gets over 9000 inserts per second.
http://www.sql-server-performance.com/jc_large_dat a_operations.asp

It took me two minutes to find these two exmamples. Now I didn't find an Oracle. But you do realize that 2000 inserts per second is not that many, OLTP database design is made for this.
Re:A commercial RDMS can cut it by zyzko · 2005-05-14 02:03 · Score: 1

Even MySQL can do this.

I've build a system like this, only the ammount of data is smaller. Our system is written in Java and has MySQL backend. On stress test it could perform about 1000 updates per second on single-processor x86 hardware. With better hardware and a few optimizations even our system could perform at 2000 updates / sec.

-Kari
Re:A commercial RDMS can cut it by Anonymous Coward · 2005-05-14 02:31 · Score: 0

Yes you can. You are wrong. Oracle can easily update >2000 records/sec... Even random records (not the same ones). I've seen it and done it plenty of times.

This problem is really a question of your storage backend; Can it handle 2000 random seeks per second (technically you'll need 3-5 seeks for each update, plus 3 writes: undo, redo, data).
Re:A commercial RDMS can cut it by marat · 2005-05-14 02:47 · Score: 1

This is pointless. We are talking about efficient hardware management, of course any database can parse 2000 requests per second. Why use the wrong tool and compensate it with expensive hardware?

And BTW why do you think messing with database connections would be easier than doing it manually? I did this things in REXX (read Perl), it takes about hundred lines for all.
Re:A commercial RDMS can cut it by Anonymous Coward · 2005-05-14 03:04 · Score: 0

The commercial solution is called Tuxedo
Re:A commercial RDMS can cut it by gvc · 2005-05-14 03:56 · Score: 4, Interesting

"Can [the storage backend] handle 2000 random seeks per second?"

The short answer is "no."

A 10,000 RPM disk has a period of 6 mSec. That's 3 mSec latency on average for random access (not counting seek time or the fact that read-modify-write will take at least 3 times this long: read, wait one full rotation, write).

So one disk can do, as a generous upper bound, 333 random accesses per second. I'll spare you the details of the Poisson distribution, but if you managed to spread these updates randomly over a disk farm, you'd need about 2000/333*e = 16 independent spindles.

The trick to high throughput is harnessing, and creating, non-randomness. You can do a much better job of this with a purpose-built solution.
Re:A commercial RDMS can cut it by Nutria · 2005-05-15 02:07 · Score: 1
So one disk can do, as a generous upper bound, 333 random accesses per second. I'll spare you the details of the Poisson distribution, but if you managed to spread these updates randomly over a disk farm, you'd need about 2000/333*e = 16 independent spindles.

You seem to be presuming that there's no:
1. database caching on the host,
2. intelligent flushing by the RDBMS,
3. Tagged Command Queueing &
4. caching at the SCSI or SAN level
--
"I don't know, therefore Aliens" Wafflebox1
Re:A commercial RDMS can cut it by gvc · 2005-05-15 02:23 · Score: 1

If the accesses are really random, caching will do no good. As you'll note, my computations already assume no seek time, so reordering to shorten seeks won't improve it. The only way caching could help is if it were to accumulate adjacent sectors for writing. There won't be many of those unless the cache is nearly as big as the database.

The whole idea behind caching and any other memory hierarchy is that it takes advantage of locality of reference, which is explicitly precluded by the stipulation in the great-grandparent that the accesses are random.
Re:A commercial RDMS can cut it by Nutria · 2005-05-15 16:41 · Score: 1

The whole idea behind caching and any other memory hierarchy is that it takes advantage of locality of reference,

Yes.

which is explicitly precluded by the stipulation in the great-grandparent that the accesses are random.

Didn't notice that part. I was thinking more of the Original Asker, but in the context of an RDBMS.

If I'm trying to shove as much data as possible into a table, caching will definitely help. And since the table will have to be indexed, caching may help there, too, depending on the keys in the index.

--
"I don't know, therefore Aliens" Wafflebox1

my two cents worth by Anonymous Coward · 2005-05-14 01:33 · Score: 0

ok, so you've got several hundred thousand intruments? if you're not military then you're a meterologist or something similar. (if not give us a hint:) ). this means that you're pulling small amounts of data from many sources which may or may not change in your designated unit of time.

so. why are you not thinking about a real big enterprise level database? if NASDAQ can do it you can too.

going with the flat file/caching solution: if you're handling that many transactions is a windows os/file system truely a viable solution? i'm not bashing MS here i'm just curious what others think about so "many" disk and cache transactions in say 2003 or longhorn.

Have you tried a relational database? by photon317 · 2005-05-14 01:35 · Score: 1

With your specs, chances are you will either need a very beefy machine, or a distributed approach spreading the load across many machines, regardless of the software approach. But I wouldn't be surprised if a good RDBMS would outperform a flatfile approach. It is what they're designed for after all.

--
11*43+456^2

Re:Have you tried a relational database? by LuckyStarr · 2005-05-14 01:59 · Score: 2, Interesting

I agree. In fact SQLite performs quite well on a reasonable sized machine. 3000+ SQL updates on an indexed table should be no problem.

--
Meme of the day: I browse "Disable Sigs: Checked". So should you.
Re:Have you tried a relational database? by Anonymous Coward · 2005-05-14 04:57 · Score: 0

3000+ updates on an indexed table will be murder to performance, as it's going to have to update the index for every update or in a big ol batch on commit. You don't want an index on the real time log database, that's for your warehouse.
Re:Have you tried a relational database? by LuckyStarr · 2005-05-14 06:16 · Score: 1

I am fully aware on the performance penalty of an index.

What I meant is: SQLite (and presumably other RDBMS as well) is quite fast. Even with an index.

--
Meme of the day: I browse "Disable Sigs: Checked". So should you.

Yes, this sort of thing has been built before by Andy_R · 2005-05-14 01:40 · Score: 2, Informative

I have a system that can record 32 streams of data 44,100 times per second. It's called a recording studio, and I make music with it.

If your data streams are continuous, and can be represented as audio data, then you are pretty much dealing with a solved problem, and your other problem of selecting from large number of possible 'instruments' is solved by an audio patchbay.

If this isn't feasible, then a number of solutions might be appropriate (spreading the load over a number of machines/huge ram caches/buffering/looking at the problem and thinking of a less intensive sampling strategy/etc.) but without more information on the sort of data you are collecting, and exactly how quickly you need to access it, it's very hard to be specific.

--
A pizza of radius z and thickness a has a volume of pi z z a

Proprietary patented stuff - but yeah... by Anonymous Coward · 2005-05-14 01:51 · Score: 0

Posting as AC, so that nobody sues me ..

Where I work, they handle like 300 million users and have data associated with each user. Unlike AOL which used sybase to store users (and crawled) these guys use a filesystem based repository. It's a fast replicated database indexed by only one key - the username. It scales great and works on FreeBSD.

this patent and related patent should answer a few questions.... (Google fs is not as good for search scans)

Re:Proprietary patented stuff - but yeah... by dereference · 2005-05-14 04:28 · Score: 1

Where I work, they handle like 300 million users [...]
Hmm, where have I heard that number before...? Oh, right, that's just about exactly the current population of the US!

So, you say these are your "users" ?

[...] and have data associated with each user.

Ok, well, I don't think I'm going to sue you, and I really don't care who you work for, but I do think I'm going to go find my tinfoil hat RIGHT NOW...!
Re:Proprietary patented stuff - but yeah... by jbplou · 2005-05-14 04:50 · Score: 1

We'll he claimed they authenicated users faster than AOL(not much of speed claim there) but I wondering too who authenicates 300 million users since no company has 300 million employees or customers. At least not until Wal-Mart takes over all business in the world.
Re:Proprietary patented stuff - but yeah... by Anonymous Coward · 2005-05-14 09:17 · Score: 0

Yahoo? FreeBSD and the links being the clues I used to guess...

300 million stillseems like alot, even for Yahoo.
Re:Proprietary patented stuff - but yeah... by GebsBeard · 2005-05-15 01:58 · Score: 1

Gee I thought the name on the Patent, "Yahoo" would have been enuf to give it away.

horizontal scaling is good... by anon+mouse-cow-aard · 2005-05-14 01:58 · Score: 2, Interesting

Sure, optimize single node performance first, but keep in mind that horizontal scaling is something to look for. Put N machines behind a load balancer, ingest gets scattered among 'n' machines, queries go to all simultaneously. Redundant Array of Inexpensive Databases :-)

Linux Virtual Server in front of several instances of your windows box will do, with some proxying stuff for queries. Probably cheaper than spending months trying to tweak single node to get to your scaling target, and will scale trivially much farther out.

in-memory by zm · 2005-05-14 01:58 · Score: 1

You will likely need to run this baby all in RAM, with optional persistant storage if needed. If you don't have enough memory, go for distributed solution: data from devices a,b,c go to machine1, from devices d,e,f to machine2, etc. The per device distribution algorithm should consider the amount of data from each device.

--
Sig ?

Re:in-memory by AmiChris · 2005-05-17 02:51 · Score: 1

An entire hour of updates might well fit in RAM! My proposed solution with flat files would take advantage of this. I'm thinking of using linked lists to store the entries for each instrument and having a background thread come round and write out all the entries of an instrument at once.

This thing is going to run for months at a time. Eventually the stuff has to go to disk.

The Solution by cwraig · 2005-05-14 02:07 · Score: 2, Funny

the solution to your problem comes in the form of a little known software application from a vender called Microsoft.
The program is called Microsoft Access 97
:P

Re:The Solution by Anonymous Coward · 2005-05-14 03:16 · Score: 0

Wouldn't it be faster and easier just to use the nul device?
Re:The Solution by mabhatter654 · 2005-05-15 16:55 · Score: 1

he's probably trying to upgrade something already in Access!! Don't we all get those at work...
Re:The Solution by Anonymous Coward · 2005-05-16 01:44 · Score: 0

Hey

Accesss is great for sending relatively small amounts relational data around for quick local analysis.

Just don't use it for anything medium, large, multiuser or that is going to take any sort beating.

Sounds like an automated stock trading app by Anonymous Coward · 2005-05-14 02:32 · Score: 0

Check out Kx or VhaYu

If you were running Linux OR BSD by LP_Tux · 2005-05-14 02:33 · Score: 0

You could use the XFS file system to get faster read/write speeds. In addition I'd recommend a special RAID setup. You would want SCSI320 RAID striping over 4 drives, in addition you'd want it mirroring over a further 4 drives. You'd need to set up a RAID array to achieve this, but it's well worth it for the performance gains. Make sure your RAID is 8xAGP or PCI-X. PCI is far too slow.

--
Open-Source > *

Re:If you were running Linux OR BSD by timigoe · 2005-05-14 03:00 · Score: 1

All well and good.

Raid will improve disk accessing performance but... theres always a but, you might want to take note that AGP is for graphics only, you'll have fun finding an AGP RAID Card.

--
Tim (http://tim.igoe.me.uk)
Computers are like Air-con, open windows and they stop working!
Re:If you were running Linux OR BSD by SirTalon42 · 2005-05-14 13:31 · Score: 1

I think he meant PCI-Express 8x.
Re:If you were running Linux OR BSD by LP_Tux · 2005-05-21 11:09 · Score: 1

Not true. Specialist stores have 8x AGP RAID for pure speed. If you really want to break the bank you could even go for PCI-Xpress. And if AGP is only for Gfx, how come I have an AGP soundcard? Obviously a twelve year old...

--
Open-Source > *
Re:If you were running Linux OR BSD by timigoe · 2005-05-21 11:30 · Score: 1

Do show me as i've never seen anything other than graphics for AGP (Accelerated GRAPHICS Port)

--
Tim (http://tim.igoe.me.uk)
Computers are like Air-con, open windows and they stop working!

Sequential files are your friend by gvc · 2005-05-14 02:45 · Score: 1

You didn't specify some key parameters. How big are these updates, and how do they get multiplexed? What kind of retrieval do you want to do in the data?

If your data are already arriving on a single socket, just mark up the data and write it out. Then you can retrieve anything you like with linear search. And you can be reasonably certain that you have captured all the data and will never lose it due to having trusted it to some mysterious DB software.

If linear search isn't good enough, you have to specify the sorts of queries you want. All information from a particular sensor? Information from all sensors at a particular time? Does this information have to be available on-line, or can you answer your queries in batch. Sort/merge is really efficient if you don't need real-time queries. You can build indexes in real-time almost as efficiently, if you know what you want to index. The basic technique is the same, but more complicated to set up - batch up the information to be indexed, and do a series of sort-merges to accumulate the indexes.

just writ the shit as it comes in. by Anonymous Coward · 2005-05-14 02:50 · Score: 0

block it.
interleave it.
write a new timestamp periodically

as for what instruments you are recording and their parameters, use a simple hash table.
the time stamp that corresponds to the introduction or deletion of instruments or change in recording parameters is hashed with the corresponding configuration. This allows 100% utilization of existing file system speed and space for recording. be careful with parameter record so you don't lose sync to data that looks like a time stamp. probabliity might be once in a million years but Murphy will have it happen 40 times in your most important hour of recording.

it's sort of rocket science...but
more like geophysical data recording in oil exploration industry, where you might look for examples.

If you need some help, i'm available for systems and algorithm design. I WILL NOT code. $2K/day plus first class travel and expenses.

I've some 35 years experience in instrumentation and telemetry.

Re:just writ the shit as it comes in. by abradsn · 2005-05-14 06:28 · Score: 1

Interesting post coming from an AC

Data warehousing by Anonymous Coward · 2005-05-14 03:11 · Score: 0

would like to know if: someone has already built something like this; and if not, would someone want to use it if I build it? I'm not sure what other applications there might be.

I'd find yourself someone with data warehousing experience (not the same thing as standard DBAs). I've worked with such people and 2000 updates a second isn't a big deal. We have no problem doing hourly bursts of millions of records with Oracle on some relatively modest hardware. It will cost you though...

One word. by Anonymous Coward · 2005-05-14 03:17 · Score: 0

iSeries

Re:One word. by mabhatter654 · 2005-05-15 17:08 · Score: 1

There's not good tools for real-time stuff on iSeries though... It's geared toward being right, not always being fast. iSeries can easily handle that load of transactions with the right hardware... I'd love to see something NEW written [less than 5 years old] to really take advantage of the new hardware.
Anything above a trivial level of reporting gets nasty fast. Personally, I'd love to see an iSeries pulled into SCADA apps... the programming model fits the PLC world perfectly. But it's just not optomized for that type of purpose... and far too much of an iSeries is hardwired.

Mostly, we'd need more info to say anything "real" about this... most hardware can handle it's network bandwidth worth of transactions nowdays.. the exact particulars of type of info, devices, and purpose/usage are critical to getting the right program. Cost and complexity is entirely dependant on what you value most to monitor/report.

Ramdisk database by Glonoinha · 2005-05-14 03:18 · Score: 4, Informative

Here's a thought - just use a hard-RAM based database.
Either make a big ramdisk and put your database out there (see my Journal from a few months back, ramdisk throughput is pretty damn fast from the local machine, given certain constraints, and random access writing is hella fast), or use a database that runs entirely in memory (think Derby, aka Cloudscape that comes with WebSphere Application Developer.)

When you got your data, save it out to the hard drive.

Granted it helps to have a box with a ton of memory in it, but they are out there now, almost affordable. If you are collecting more than 4G of data in one session, well YMMV - but 4G is a LOT of data, perhaps consider your approach.

--
Glonoinha the MebiByte Slayer

Re:Ramdisk database by btlzu2 · 2005-05-14 04:39 · Score: 1

This is VERY insightful and I'd like to hire you. :) This is EXACTLY what well designed SCADA systems do.

--
Zed's dead baby. Zed's dead.
Re:Ramdisk database by Glonoinha · 2005-05-14 08:47 · Score: 2, Insightful

You will find that my imagination and abilities are only limited by my budget. Well that and, as I am finding, the Sarbanes / Oxley mandates that recently came down from the Productivity Prevention Team, quite effective in keeping me from actually getting any work done.

I don't really care what it pays if it has anything to do with real-time systems (guidance or delivery systems a plus), if the R&D budget has enough wiggle room for better hardware (toys) than I have at home, if you promise that I will be able to participate in the production roll-out and be allowed to make the production environment succeed, and esp if there are a few challenges that are categorized "can't be done."

Apollo 13 didn't get home because a bunch of mediocre guys sat around filling out paperwork requesting permission and setting up a committee to discuss business impact - Apollo 13 got home because a bunch of crack-junkie hardcore engineers decided that failure wasn't an option.

So the stuff you do at work - is it hard? :)

--
Glonoinha the MebiByte Slayer
Re:Ramdisk database by calidoscope · 2005-05-14 08:56 · Score: 1

If you are collecting more than 4G of data in one session, well YMMV - but 4G is a LOT of data, perhaps consider your approach.
My recent forays to Crucial show 4 sticks of 2GB reg/ecc PC2700 DDR memory will set one back a bit over $3k. For 8 to 16GB of data, the most economical route would be a dual Opteron box, things start getting expensive above 16GB.

--
A Shadeless room is a brighter room.
Re:Ramdisk database by Glonoinha · 2005-05-14 09:39 · Score: 1

You could pick up a Dell PowerEdge 1800 dual 3GHz (64 bit?) Xeon (2M cache free upgrade) right now with their quad memory upgrade promotion cranking it up to 2G for somewhere in the neighborhood of $1,700 ($500 of that being the second Xeon CPU)- leaving four slots for more memory. Add in four 2G sticks of the stuff Crucial has for that machine (the 1800 has six slots for memory) and you are looking at 10G of physical memory on a dual 3GHz Xeon machine for just shy of $6k.

That said, as I understand it AMD got their memory bus figured out quite a bit better than Intel, particularly in multi-CPU machines - if it was going to serve as a memory based SMP database engine, I might look into the Opteron platform.

Damn, I just re-read that first sentence.
$1,700 for a dual 64-bit 3.0GHz Xeon machine with 2G of RAM, upgradable to 10G for a complete system cost of under $6k. Gonna be a good Christmas this year, I'm guessing.

--
Glonoinha the MebiByte Slayer

Did something like this some years ago by isj · 2005-05-14 03:36 · Score: 2, Insightful

My current company did something like this back in 2001 with real-time rating performance, which conceptually is much like what you want to do: receive a lot of items and store them in a database, real-time. But you did not mention some of the more important details about problem:

How much processing has to be done per item?
How long can you delay comitting them to a database?
Do the clients wait for an answer? Can you cheat and respond immediately?
How many simultaneous clients must you support? 1? 5? 100?
What is the hardware budget?

2.000 items/sec means that you must do bulk updates. You cannot flush to disk 2.000 times per second. So you program will have to store the items temporarily in a buffer, which gets flushed by a secondary thread when a timer expires or when the buffer gets full. use a two-buffer approach so you can stil receive while committing to the database.

Depending on you application it may be beneficial to keep a cache of the most recent items for all instruments.

You also have to consider the disk setup. If you have to store all the items then any multi-disk setup will do. If you actually only store a few items per instrument and update them, then raid-5 will kill you because it performs poorly with tiny scattered updates.

Do you have to backup the items? How will you you handle backups while your program is running? This affects your choice of flat-file or database implementation.

Yup... by joto · 2005-05-14 03:40 · Score: 2, Informative

Someone has done this before. It's called a data acquisition system. The basic design for one is even sketched out in one of Grady Booch's books (before he became one of the three amigos).

The design of a data acquisition systems will of course differ, depending on how much data it records per sensor, how many sensors there are, how often to record the data, and if the data is to be available for online or offline processing.

In most of the "hard" cases, you will use a pipelined architecture, where data is received on one or more realtime boxes, and buffered for an appropriate (short) period. A second stage occurs when data is collected from these buffers, and buffered/reordered/processed to make writing the desired format to a file or DBMS easier. The last stage, is, of course, to write it. You might use zero or more computers at each stage, with a fast dedicated network in-between. You might even decide to split up some of the stages even further. Depending on how much you care about your data, you may also add redundancy. And make sure it's fault-tolerant, it's generally better to loose some data, as long as it's tagged as missing, than to loose it all. To check this in real-time you can also add data-monitoring anywhere it makes sense for your system.

In the simper cases, you simply remove things not needed, such as a soundcard instead of dedicated realtime-boxes, redundancy, monitoring, dedicated network, etc...

Some commercial off-the-shelf systems will surely do this. But the more advanced systems, you still build yourself, either from scratch, or by reusing code you find in other similar projects (I'm sure there are some scientific code available from people interested in medical science, biology, astrophysics, geophysics, meteorology, etc...).

Most of the "heavy" systems will not run on Windows, or even Intel, due to limitations of that platform for fast I/O. This has obviously changed a lot recently, so it's no longer the stupid choice it was, but don't expect too many projects of this kind to have noticed, as they probably have existed much longer.

Have you considered memory-mapped files? by Teancum · 2005-05-14 03:50 · Score: 3, Interesting

I did some work on a DVD-Video authoring system that had some incredible file system requirments (obviously, when involving video data and the typical 4 GB data load for a single DVD disc).

The standard file API architechture just didn't hold up, so we (the development team I was working with) had to rewrite some of the file management routines ourselves and work directly with the memory mapped architechture directly. This does give you some other advantages beyond speed as well, as once you establish the file link and set it in a memory address range you can treat the data in the file as if it were RAM within your program, having fun with pointers and everything else you can imagine. Copying data to the file is simply a matter of a memory move operation, or copying from one pointer to another.

The thing to remember is that Windows (this is undocumented) won't allow you to open a memory-mapped file that is larger than 1 GB, and under FAT32 file systems (Windows 95/98/ME/and some low-end XP systems) the total of all memory mapped files on the entire operating system must be below 1 GB (this requirement really sucks the breath out of some applications).

Remember that if you are putting pointers into the file directly, that it works better if the pointers are relative offsets rather than direct memory pointers, even though direct memory pointers are in theory possible during a single session run.

Re:Have you considered memory-mapped files? by p3d0 · 2005-05-14 05:24 · Score: 1

Remember that if you are putting pointers into the file directly, that it works better if the pointers are relative offsets rather than direct memory pointers, even though direct memory pointers are in theory possible during a single session run.
Good advice. These are "self-relative-pointers". Instead of this:

Foo *Bar::getFoo(){ return _fooField; }

...you write something like this:

Foo *Bar::getFoo(){ return (Foo*)((char*)&_fooField + (char*)_fooField); }

--
Patrick Doyle
I mod down every jackass who puts his moderation policy in his sig. Oh, wait a sec....
Re:Have you considered memory-mapped files? by Anonymous Coward · 2005-05-14 10:57 · Score: 0

The thing to remember is that Windows (this is undocumented) won't allow you to open a memory-mapped file that is larger than 1 GB,
It was documented 4 or 5 years ago (last time i touched a windows box). Curious how file-mapping use to be so poorly documented even on unix variants.
Re:Have you considered memory-mapped files? by Teancum · 2005-05-14 11:47 · Score: 1

At the time I had to find out the hard way from some obscure Microsoft support line at $500 per incident that this was the case... and even then the tech support wasn't really that sure or understood why that was the case.

I may have been the reason why it got documented in the first place, and it seems like a really silly limitation.
Re:Have you considered memory-mapped files? by Anonymous Coward · 2005-05-14 21:55 · Score: 0

Yes, silly limitation, and if my memory serves me well, windows file mapping could not grow after opened. I remember a msdn article about some guy using page-faults to close/recreate the whole file to achieve some 'growable' functionality.
At this time AIX have growable shared-memory, but unfourtunately with a severe size limit (512 MB?).

There's some good explanation about the state of file-mapping on linux today?
Re:Have you considered memory-mapped files? by julesh · 2005-05-15 04:43 · Score: 1

Second the suggestion of using memory mapped IO. It allows the system to optimise cacheing much more effectively than you're likely to be able to.

The thing to remember is that Windows (this is undocumented) won't allow you to open a memory-mapped file that is larger than 1 GB

OK... does it fail at CreateFileMapping or MapViewOfFile? If the latter, you can work with larger files, you'll just need to restrict yourself to a 1Gb window within them.

and under FAT32 file systems (Windows 95/98/ME/and some low-end XP systems) the total of all memory mapped files on the entire operating system must be below 1 GB (this requirement really sucks the breath out of some applications).

Are you sure this is related to the filesystem? My understanding was that this restriction was because Win9x could only share memory in the address range 0xc0000000 - 0xffffffff (which is automatically shared between all running applications!), and memory mapped IO had to take place in the shared memory segment. If so, it shouldn't apply to XP.

Remember that if you are putting pointers into the file directly, that it works better if the pointers are relative offsets rather than direct memory pointers, even though direct memory pointers are in theory possible during a single session run.

If you use the MapViewOfFileEx function, you can specify the location at which you want the file mapped. This may or may not be useful: if you have 3rd party DLLs whose versions may change you can't easily predict where space will be available, and if you're on a Win95 family OS, it has to be available for _all_ applications, and the allowed range of addresses is severely restricted.
Re:Have you considered memory-mapped files? by AmiChris · 2005-05-17 00:50 · Score: 1

Won't fly. The current beast I'm tring to improve has a files larger than 5GB, which is larger than any 32bit memory space can be.

I'm also not sure I see an advantage. I've considered trying windows "overlapped IO", but I'd like to stear clear of everything platform specific even if it means using lots of threads.
Re:Have you considered memory-mapped files? by Teancum · 2005-05-18 01:28 · Score: 1

While the MapViewOfFile function does have an impact with the problems of Windows regarding memory space, the 1GB limitation is not restricted to just this function.

It is indeed the "CreateFile" portion of what Windows NT deals with that causes the problems. I did experiment with different memory window ranges and various strategies to access the data. It didn't seem to have any effect at all regarding this absolute limit as it appears Windows does some sort of alternate mapping beyond what is formally published through the MapViewOfFile function. I hope Wine got this one right with their implementation of this.

The massively frustrating issue came up with the 1GB limit for "ALL" open files on the operating system when used under Win 95/98/ME. At the time I was doing development under Windows 98 (to simplify the distribution requirements... and at the time I wasn't as familiar with the quirks of NT). This was indeed a limitation of the file system, and was generally not documented. Your explaination regarding the shared memory space is likely the reason for this problem, but at the time I couldn't get an answer even from Microsoft itself.

Re: XP.... there are some versions of Win XP that are basically modified versions of Windows ME. I know that seems silly, but this is Microsoft. I'm not talking XP Professional, but some low consumer-grade versions that MS branded as Windows XP. This isn't that hard of a limitation to get around, but as a software developer you need to be aware of it if you have a customer with their own computer that is installing your software and you are expected to troubleshoot why it won't work.

This was the result of work that I did several years ago, and the projects I was working on were fairly high value pieces of software, where you could not only specify the operating system, but much of the computer equipment as well. (Windows was specified by folks higher up the food chain).
Re:Have you considered memory-mapped files? by Teancum · 2005-05-18 01:44 · Score: 1

The real advantage for going with memory-mapped files is really speed. By throwing the file into memory mapped space, you are by-passing much of the overhead that the operating system (Windows in this case) throws in regarding memory management with the file system. It is still abstracted that you don't have to deal with specific hard drive archetechtures, but it pretty much right where the operating systems deal with the disk data anyway.

When I did data flow experiments, I got up to a 3x to 5x data throughput improvement going this route as opposed to typical file I/O access routines found in the standard I/O libraries of a typical compiler.

This is a case of where if you need to get high performance, you need to tweak the operating system and use those APIs that can help your application the most. If you don't care about the maximum performance but are more interested in cross-platform issues, then you need to stick with standard I/O functions and APIs.

I'm curious about the 5GB files though... I was under the impression that NTFS in general won't allow you to create or open a file > 1GB. I'm sure you are fighting other issues in dealing with data files of that size, and you would have to, in essence, create your own internal file system just to be able to access a specific piece of data within such a huge file. That is the real issue here, and data organization when you are dealing with that much data is a huge issue. It is also something that you can easily screw up if you shove data in haphazardly.
Re:Have you considered memory-mapped files? by AmiChris · 2005-05-20 01:14 · Score: 1

Actually having started I do see a major advantage of the memory mapping. I don't have to worry about multiple threads read/writing to the same file. Right now I'd have to put a mutex around the file, or do some kind of file sharing between threads. Why? Because you have to seek and then read/write.
I'm curious about the 5GB files though... I was under the impression that NTFS in general won't allow you to create or open a file > 1GB.
Nahh, I've had files of ~4Gb. Fat32 has a limit around here at 2^32B. The old DB had multiple files. NTFS will theoreticall hold ~2^64 bytes per file and volume:
Size Limitations in NTFS and FAT File Systems

Oh this is interesting:
If you use large numbers of files in an NTFS folder (300,000 or more), disable short-file name generation, especially if the first six characters of the long file names are similar. For more information, see "Optimizing NTFS Performance" later in this chapter.

Maybe that's why things sucked so bad when I had a bunch of files in one dirrectory. It also seems to imply I am allowed to have that many files.

Check out RDM Server by devexial · 2005-05-14 04:01 · Score: 0

They have their own algorithm that claims to be 200x faster than normal RDBMs, using 'tick tables'. modulus RDM server

Specialized Hardware by mschaef · 2005-05-14 04:23 · Score: 2, Informative

This may be gross overkill, but there's specialized hardware specifically designed for sustained high-throughput disk storage. A company called Conduant makes specialized disk controllers that use on board microcontrollers to drive arrays of disks. When I last saw them demoed, they could sustain writes of 100MB/sec using direct card to card transfers across the PCI bus. They can configure a data acquisition card to directly store information into a shared buffer on the disk controller across the PCI bus. The disk controller then picks the data up and drives it across ten IDE channels. That was a few years ago, these days it looks like they can sustain 200MB/sec with a controller, and up to 600MB/sec and 6TB of capacity with custom box mounted in a rack.

I'm not so sure what their story is regarding reading or querying. My guess is you lose a lot of bandwidth, but not all. Anyway, it might be worth checking out.

http://www.conduant.com/products/overview.html

Another thing is that modern computers cam have lots innate capacity themselves. My hunch is that you could do a lot with a couple modern disks on seperate SATA channels and several GB of RAM. Maybe this is only a software problem...

just dump to disk by Anonymous Coward · 2005-05-14 04:41 · Score: 1, Insightful

as others have said, just stream the data to disk with some kind of big RAM buffer in between. each instrument can go to a separate directory, each minute or hour of data goes to a separate file. A separate thread indexes or processes the data as needed.

And don't forget the magic words: striping. you should interleave your data across many disks, and the index files should be on separate disks as well.

Do striping+mirroring for data protection. do the striping at the app level for maximum throughput, do the mirroring at the hardware level.

When you aren't going through layers of crap like an SQL database, you should *fly* like this on modern hardware.

what about reordering requests? by Anonymous Coward · 2005-05-14 04:47 · Score: 0

IMO, it is done by both O/S and SCSI hardware

Re:what about reordering requests? by gvc · 2005-05-14 05:50 · Score: 1

Post-hoc reordering won't do it. For a vast database, the probability of accessing adjacent sectors within the lifetime of the cache is vanishingly small.

Kdb+ by RussHart · 2005-05-14 05:28 · Score: 3, Informative

Kdb+ by KX Systems (http://www.kx.com/ is by far and away the best thing for this. Its main use is to store tick data from financial markets, and is excellent at this (if expensive).

From how you descibed your needs, this would probably bit the bill..

Been there, done that by toddbu · 2005-05-14 06:01 · Score: 1

No time to read the thread, so some of this may have already been covered. I did a similar project where we had to keep track of billions of hits on a web site. The volumes got to be too great to handle using SQL Server inserts. The nature of our data (which is common for data sets this size) is that some loss was acceptable but only in situations where the servers experienced a problem (power loss, server lockup, etc) We weren't running a bank. So we'd write stuff to an in-memory queue and have a background thread pick up the data and write it to disk on idle cycles. Every hour we'd start a new disk file, pick up the previous hour's data, and load it into the db. Eventually even this didn't work because our loads were too great, so then our hourly processing process got a makeover and started doing some of the summarization of the data sets that we needed and we just dumped the raw data. There were many people who didn't like that idea because we lost the original values, but once we proved that it didn't affect the final values then it was accepted.

The moral of the story is to determine up-front how much of that data you really need.

--
If you don't want crime to pay, let the government run it.

Re:Been there, done that by Stormcrow309 · 2005-05-16 04:53 · Score: 1

From another perspecive, we use to post inventory transaction to our ERP system at the bandaide level, creating 100,000 inventory journals a day. This is a large load for a hospital's ERP and makes financial analysis a headache. I would do an use case to determine the 'real' granularity needed for the data. Remember, users ask for everything, we give them what they need.

--
In God we trust, all others require data.

More info by Halvard · 2005-05-14 06:45 · Score: 1

You don't mention the type of instruments or data. Perhaps you could store it via syslog on a remote syslog server.

NetCDF or HDF5 by Salis · 2005-05-14 07:19 · Score: 2, Informative

NetCDF and HDF5 are optimized binary file formats for storing incredibly large amounts of data and quickly retrieving it.

I'm more familiar with NetCDF (because I use it) so let me tell you some of the things it can do. (HDF5 can also do these things, I'm sure).

With NetCDF, you can store +2 gigabyte files on a 32 bit machine (it supports Large File support). I've saved 12 gigabyte files with no problems. It supports both sequential and direct access, meaning you can read and write either starting from the beginning of the file or at any point in the middle of the file.

The format is array-based. You define dimensions of arrays and variables consisting of zero, one, or more dimensions. You can also define attributes that are used as metadata, information describing the data inside your variables.

You can read or write slices of your data, including strides and hyperslabs. This allows you to read/write only the data you're interested in and makes disk access much faster.

It's also easy to use with good APIs. They have APIs for C, Fortran95, C++, MATLAB, Python, Perl, Java, and Ruby.

Take a look at it. It might be what you're looking for.

-Howard Salis

--
Favorite /. tagline: "On the eighth day, God created FORTRAN." And it was good.

Thousands of instruments? by checkyoulater · 2005-05-14 07:41 · Score: 0

Man, that must be some cool sounding music if it has thousands of instruments playing at the same time. Care to share the name of this supergroup?

--
Is that a real poncho? I mean, is that a Mexican poncho or is that a Sears poncho?

Re:Thousands of instruments? by Anonymous Coward · 2005-05-14 09:28 · Score: 0

Asia

Now for bonus points - how old am I?
Re:Thousands of instruments? by Anonymous Coward · 2005-05-14 17:25 · Score: 0

Now for bonus points - how old am I?
Doesn't matter, they have a new album.
Re:Thousands of instruments? by Anonymous Coward · 2005-05-17 06:40 · Score: 0

Testing some shit.

SQLite by shadowpuppy · 2005-05-14 13:50 · Score: 1

I seem to remember the SQLite homepage saying it could handle a few million inserts in a few seconds. So asuming you mean 2000+ updates a second in total and not 2000+ per instrument thats quite a safety magin.

An RDBMS won't cut it? by Anonymous Coward · 2005-05-14 19:57 · Score: 1, Informative

You need to do 2000+ updates a second?

*Many* RDBMS systems can do this without breaking a sweat.

Do some googling on Interbase for example - one of the success stories for IB is a system that does 150,000 inserts per second - sustained. It's a data capture system that may well be similar to yours.

Oracle can definately do it - but you'll probably need a good Oracle DBA to tune it up properly.

Informix can definately do it as well - don't know about the latest version, never used it, but whatever was current circa 1999 (v5?) could handle your needs as well.

300 million users by da5idnetlimit.com · 2005-05-14 21:38 · Score: 1

Hello, this is your bank calling ...

"MasterCard member banks added an EMV chip to 40% of the 200 million MasterCard"

"In Asia Pacific, Visa has a greater market share than all other payment card brands combined with 59 percent of all card purchases at the point of sale being made using Visa cards. There are currently more than 365 million Visa cards in the region." (2003)

If Visa had 365 million cards holder just in Asia Pacific in 2003, I wonder how many they have worldwide nowaday...

--
It takes 40+ muscles to frown, but only four to extend your arm and bitchslap the motherfucker

HP-IB and ISAM by Decker-Mage · 2005-05-14 21:44 · Score: 3, Informative

This was what the Hewlett Packard Interface Bus (HP-IB) was invented for and your instruments may already be equipped for it. As for what to do with the data stream from the instruments, you stuff it into an ISAM database. Why anyone would even think of using an RDBMS for this is beyond me. ISAM (Indexed Sequential Access Method) has been around forever, exists to take tons of sequential data and store it to the media of choice. From your description, retrieval is only going to be based on a few criteria anyway (instrument, time), so those indices are perfect in this instance.

On the coding end, there are numerous (hell, hundreds) of commercial, F/OSS, and books on ISAM libraries for you to use for the actual storage and retrieval. It may even be included in your existing libraries given how old the technique is now. I was doing this back in the '80s for the US Navy using a 24 bit, very slow, mini-computer, so any normal box should be able to handle it today!

We use these techniques in electronic instrument monitoring, logistical systems, systems engineering, you get the idea. You may want to mosey over to the HP developer web site to see if there is a drop in solution, as I imagine there is (sorry, haven't looked).

I hope this helps.

--
"[I]t is a wise man who admits the limits of his knowledge or skill, and that pretending either causes harm." --Terry Go

MySQL In-Memory Table, memcached, or Prevailer? by Anonymous Coward · 2005-05-15 00:38 · Score: 0

If you want speed, I'd look into either of these.

RDMS Insert Performance by Anonymous Coward · 2005-05-15 03:43 · Score: 0

If you plan on inserting into a database at some point, whether directly or buffered, pay attention to insert performance. There are two lessons I learned. One, the typical ODBC interface creates an implied transaction for each separate insert statement. So, group many thousands of inserts into one transaction. The second point is using bulk inserts. ODBC has a mechanism for sending arrays of parameters for an insert statment. So, you could create arrays of 2000 parameters and send one bulk instruction to the server, rather than 2000 individual inserts. This makes a huge difference in performance. The problem being not all ODBC drivers are up to it. I am able to insert many thousands of records in a few seconds using nothing special hardware. Good luck.

DBM Family: esp GDBM and Berkeley DB by Xife · 2005-05-15 09:03 · Score: 1

This family of databases is the heart of sendmail, and some SQL engines are built on top (MySQL if memory serves).

The interface is a model of simplicity: pointers to arbitrary length buffers for keys and data. All you need is key scheme that provides the post acquisition access that you require.

Berkeley offers hash and BTree style organization of the keys.

It may use memory mapped FileIO under the hood and handles all transfer of multiple buffers.

It provides multipe files or multiple tables in one file and you can control the cachesize.

It can run 2,000 inserts per second on hardware from the mid 90s. (UltraSparc II 450)

Berkeley DB (www.sleepycat.com)

As far a I know it runs on just about everything including several embedded OS's, Windows and every variant of Unix.

--
---- Smokin' another sig.

Re:DBM Family: esp GDBM and Berkeley DB by AmiChris · 2005-05-17 02:22 · Score: 1

Ok, looking at BDB at sleepycat. It looks like you've got one table per file with this thing. There's also the problem that I can't open up our source :-(

GDBM looks really simple. It also seems to have just one table per file. So all my instruments would have to go in that one. I'm wondering if something that looks so simple really has the performance I need.

How do you know? by hotpotato · 2005-05-15 09:23 · Score: 1

Have you actually tried doing this with a relational database? Which ones?

Based on my (relatively basic) knowledge of how databases work these days, using large in-memory caches and fast commits, I wouldn't be surprised if a good enterprise database could handle this rate of commits.

You should remember that 2000 commits != 2000 random disk accesses!

High energy physics by Anonymous Coward · 2005-05-15 10:27 · Score: 0

Maybe what you look for is already solved by high energy phycisists: The ROOT toolkit is at least supposed to handle very large datasets (I never tried that, though).

This might be able to do it by Bozovision · 2005-05-15 13:17 · Score: 1

Faircom CTree-Plus might.

Advantages:
- it's fast and it's not constrained by column length. If you want a table with 16,000 columns, go right ahead.
- it's very portable. Runs on just about every operating system that has more than 100 users.

The disadvantages:
- last time I looked (admittedly) about 4 years ago, their SQL integration could have been better.
- it's not a high-level database. To work most effectively with it, you need to know about the way that your data is stored.

I'm sure it's improved a great deal since then.

sqlite @ 120,000 inserts per second by bundaegi · 2005-05-15 23:46 · Score: 1

here. Effects of filesystem/RAM/CPU/SCSI on the results are discussed.

--
bundaegi is good for you

Re:sqlite @ 120,000 inserts per second by AmiChris · 2005-05-17 03:10 · Score: 1

Umm, with your example. Is that 120K/s for the first 10s, or will it keep that up for a few months? Is it all in memory or can I have serveral GB of data?
Re:sqlite @ 120,000 inserts per second by bundaegi · 2005-05-18 23:47 · Score: 1

>Is that 120K/s for the first 10s, or will it keep that up for a few months?
That, I want to know as well
>Is it all in memory or can I have serveral GB of data?
Definitely GB of data stored on the disk:

All I have now is an athlon 1.7 with 512 megs of ram (debian).
I don't care that much for fast inserts, more like, I have HUGE quantities of images (png+32bit timestamp) which I grab and store in a sqlite database (insert rate could go up to gigabytes of data per hour for 5 image/sec, I need to check if the rate goes down).
I then try to locate images using their timestamp. takes up to a couple of seconds to locate an image in a multigig database (from my experience if you wanted to extract a sequence of consecutive images, only the first image would take a long time to find). I use an index on the timestamps, need to check whether that helps or not.

Anyway, I just bought a 400 gig hd which I formated in ext3 to store my database, so I'll be able to tell if i notice a slowdown in reading/writing for a 10-100 gig sqlite database (I'll put something in my journal if that's of any interest).

I really like working with sqlite (both in c and python with apsw) so when I saw the thread, I kind of remembered the 120000 inserts/sec tests (the guy provided source code for the test, so that's something you could try for yourself).

--
bundaegi is good for you
Re:sqlite @ 120,000 inserts per second by bundaegi · 2005-05-19 21:58 · Score: 1
This is pretty much my set-up:
- I run daemons for logging my data into the database
- I use a web server on the database side (thttpd) with cgis that let me access the database in certain ways.
- I have cgis written in both c and python
- Keep it simple: each cgi is self contained, small and does only one thing well.
- The front-end (written in wxpython) queries the database over the web and display pretty graphs
- Replies from the webserver can be compressed/encrypted if need be
I wanted to access the database with an easy protocol (http) rather than writing my own socket stuff (I tried but... why bother?). I don't need full database acess either (like ODBC), just acess to data for a given time interval.
At the moment, I have my mind set on using sqlite. If I decide to change database for X reason, I don't want to get screwed. I found a wrapper called libsdb which I may use... the same SQL queries can be proxied to a variety of databases (oracle, progresql, sqlite...). The one thing I could lose in sqlite3 is the ability to join records in a table spanning across different files. Well... I don't know if that could speed-up my queries or not. Will need to try and check.

This tutorial I also found useful.
--
bundaegi is good for you

Re:Just use the file system - windows blows by AmiChris · 2005-05-17 00:38 · Score: 1

I tried writting a prototype a while back to see if it could be done like that. The peformance was ok with 1 file per instrument, as far as to program goes. I had two threads. The writter would append all the entries for one instrument at a time to the end of its file.

When you click near the directory (parent directory did it) in the "file explorer" the whole thing locks up for few minutes, desktop, task bar, etc. I haven't tried making a tree of subdirrectories to avoid this problem. I'm not too sure it would help.

I'm thinking it's probaby be better to have just a few files and allocate blocks out of them.

Re:NetCDF or HDF5 - interesting by AmiChris · 2005-05-17 01:04 · Score: 1

Cool, but these are file format specifications. Are there any engines that work with these which are really fast? Do they cache a bunch of stuff in memory or will that still be my job?

Re:HP-IB and ISAM - Ahh now I know as name for it. by AmiChris · 2005-05-17 01:46 · Score: 1

Searching freshmeat I actually found some projects with ISAM in thier names. I'm not sure they look too promising though.

Thanks. I now know the name for it. It still looks like I might be better off writting something from scratch. Maybe I can slap it up on sourceforge afterwards.

Wrap up by AmiChris · 2005-05-17 03:18 · Score: 1

Boy I didn't expect this thread to explode like this while I was gone. Some people asked for more info so I'll just make some points:

* database is 5GB right now, after improving the thing's performance, it could be 10 times bigger. 50GB

* Yes, some people guessed it. It's financial data. I'm tring to dump all the trades of all stocks and futures in the US and EU. Right now we do a subset, but there's always something missing.

* Hardware. Yes I can get one or two monster machines for our server farms. Some of our customers run the current software locally, so I can't demand anything too fancy, like a better OS, Oracle, or 20 disk RAID from them.

* The data needs to be accessed in real time. If you've got a chart open you want to see the ticks coming in real time, and you want to be able to scroll back a few weeks.

* As far as clustering/load balancing. Yeah I can do this in our server farm, but I want each unit to work better first.

* The individual entries are small(~50B) and fixed sized per instrument.

* "How much processing has to be done per item?" - almost nil
* "How long can you delay comitting them to a database?" as long as they'll fit in memory
* I'd say clients ask for a chart a few times a second. A chart doesn't require all the data points of an instrument.

Re:Wrap up by Anonymous Coward · 2005-05-17 11:15 · Score: 0

50B item *1200 items/sec=60KB/sec, not a big deal, just stuff it in append mode to a max-sized file (n-entries for file), and the read/store part is done. Add extra processes (soft/hard) to re-read/process the raw rada stored in those files as needed.

It seems you just need a good disk sub-system to store the incoming data.

Re:NetCDF or HDF5 - interesting by Salis · 2005-05-17 05:11 · Score: 1

The people who develope these formats are used to dealing with large data sets that need to be read and written fast. I've seen terabyte files used as inputs/outputs for scientific computing applications. They've certainly thought about the fastest ways of doing I/O. You can even substitute your own FFIO routines (people using Crays do this).

You can set the buffer to whatever you want and it really depends on your computing architecture on how buffering is handled. Normally, the data is kept in memory until flushed or until the file is closed.

If you have more questions, check out their websites and send emails to their mailing lists.

--
Favorite /. tagline: "On the eighth day, God created FORTRAN." And it was good.

You should check out... by sexylicious · 2005-05-17 09:08 · Score: 1

data acquisition systems for large experiments. Such things like the hardware and software used at particle physics labs on their detectors: lots of individual sensors in a huge array that has to be sampled a hell of a lot in one second.

Another thing to look into is testing for dynamic loads on cars or aircraft. At least for aircraft, they'll put thousands of accelerometers all over the frame to measure the various accelerations.

Both of those are prime examples of a similar system to yours. And such users would have insight into the problems you are looking at, as well as be potential customers in the future. :D

4GB not that much for a session... by coyote_oww · 2005-05-17 10:34 · Score: 1

4GB is peanuts. My best friend (same company as me) works on a system that does ~50MB/sec, max rate. That comes to 1GB/20secs. They have a proprietary box to do it, with 4 or 5 processors and a buttload of dsps (they are storing essentially a set of 24 ocsiliscope signals).

The application is vibration monitoring in industrial machinery. A reasonable "session" in such an environment would be a machine startup or stop - depending on the machine this could take several minutes to hours. For the "hours" scenarios, you are typically looking at heat soaking - run the machine up part way, soak, run it up some more, soak, etc. You want to avoid differential thermal expansion. That's where the rotor of the machine expands faster ('cause it's nearer the heat source) and grows right into the case of the machine. That's bad... :-(

Anyhow, I have no doubt there are even higher speed collection systems out there... likely for very specialized applications.

Re:Just use the file system - windows blows by amorsen · 2005-05-17 20:50 · Score: 1

How about just keeping the file explorer away from the directory? The file explorer tries watching for changes in the directories that you have open; that's probably what messes things up. The tree approach should help, but just remember that it's a workaround for broken applications -- NTFS itself is capable of handling many files in a directory.

Anyway, the only alternative to using the OS file system is to implement your own file system in user space, or use a database as a file system. If you choose to roll your own, remember to implement fsck and possibly journalling, otherwise you risk losing everything if the power goes.

--
Finally! A year of moderation! Ready for 2019?

Re:HP-IB and ISAM - Ahh now I know as name for it. by Decker-Mage · 2005-05-17 21:37 · Score: 1

I haven't looked in the C++ libs in quite a while but I would be rather surprised if the functionality were not in an existing library. I would, however, put serious thought into rolling your own. I'd offer to help but it's been far too long since I mucked with either C++ or rolling my own db code (25 years). Sadly, these days it's all SQL, XML, and web services, and that is about as interesting as watching paint dry, or grass grow {sigh}.

--
"[I]t is a wise man who admits the limits of his knowledge or skill, and that pretending either causes harm." --Terry Go

Slashdot Mirror

Dumping Lots of Data to Disk in Realtime?

127 comments