Slashdot Mirror


Community Test Data Repository?

BlizzyMadden inputs this query: "Currently I am working on a small utility to convert HTML to plain text. As I test this, I create more and more different types of HTML files to regression test it. I wonder to myself if these test files that I make would be beneficial to other developers who may be doing similar work. To expand on this thought, I wonder if there is a community-based repository of test data anywhere that developers and use and contribute to. Just curious if anyone knows of any project website out there that offers this." "Such a repository would be useful for files like the following:
Complex HTML files.
RFT and Word files with lots of formatting.
Large text files.
Excel files with complex equations and macros.
Files like this would be great if developers were to share them to debug their own applications."

13 of 50 comments (clear)

  1. You could try Mangleme by Gopal.V · · Score: 4, Informative
    Mangleme generates Malformed HTML used for testing browsers.

    Another good idea is to pull a couple hundred websites with Wget -r :)

    OF course, slashdot belongs in the "Broken HTML No-Css Table Mess" variety of HTML (just like they call Crushed Bean No-Froth Dark Latte - a coffee)
    1. Re:You could try Mangleme by Gopal.V · · Score: 4, Informative
      >Another good idea is to pull a couple hundred websites with Wget -r :)

      Feels wierd replying to my own post... but I remembered something else that I had. A copy of the Google Programming contest data files. Get a whopping 16000 web pages in one shot from research.google.com. (wish they'd gzipped it - but content-encoding: gzip works too)

      Sadly, all those pages are from .edu websites :)
    2. Re:You could try Mangleme by LardBrattish · · Score: 3, Insightful

      These are all useful resources but it's not what he's asking. What he wants to know is: is there a project that deliberately clooects test data in a GPL sort of way so developers don't have to generate the test data themselves...

      --
      What are you listening to? (http://megamanic.blogetery.com/)
  2. here's mine by DrSkwid · · Score: 3, Funny

    sed s'/<[^>]+>//g'

    =)

    --
    There are places where the networks are not touching,and there are places where they are-Boeing's Lori Gunter
  3. or you could... by cyborch · · Score: 2, Informative

    ... just use lynx --dump.

    1. Re:or you could... by seanyboy · · Score: 2, Interesting
      --
      Training monkeys for world domination since 1439
  4. Sourceforge? by LardBrattish · · Score: 5, Interesting

    If there isn't a test data project maybe you could start one. If people agree that it's a good idea then it'll grow... if not...

    I believe the idea has merit and should be done. This would be useful for the developers of many FOSS applications. A "torture test" of nasty Excel files or Word files would help Open Office etc. HTML files would be good for the Mozilla team. Maybe they would be interested in providing the first few sets of data.

    I'd also recommend tying the automated regression tests to this open source test data so every developer could download the source & the test data and make sure the new feature doesn't break anything...

    Any new troublesome files could be added to the test data and new tests could be built to ensure that the software deals with them.

    --
    What are you listening to? (http://megamanic.blogetery.com/)
  5. Great idea. by seanyboy · · Score: 3, Insightful

    Not only that, but it'd be great to see things like lists of made up addresses and other test data.

    --
    Training monkeys for world domination since 1439
    1. Re:Great idea. by seanyboy · · Score: 4, Interesting

      Why the hell is that a troll. In the past I've wanted 100,000 or so mailing addresses to test an indexing routine on, and have ended up spending time writing a random address generator. If I'd have been able to go to a site (like lorum ipsum), ask for 100,000 addresses in CSV format and had these downloadable as a zipped file, it'd have saved time. I'm sure I'm not the only developer this has happened to. Jeez.

      --
      Training monkeys for world domination since 1439
  6. yes by Tom7 · · Score: 3, Funny

    I hear that the internet is a community-driven repository of html

  7. IAWTP by Clover_Kicker · · Score: 2, Interesting
    I once needed a few thousand names for test data. The only big list I could find was the list of men killed in Vietnam

    Anyone have a less disturbing list of real or fake names? I suppose someone could grab some data from a geneology site, strip out just the names, and use that.

    If anyone knows of (or starts) a project like this I'd probably contribute.

  8. Interesting Idea, but basically useless by Jahz · · Score: 5, Insightful

    The idea of a testing repository is quite interesting, but, in practice, a useless one.

    Such a repository would end up as no more than a garbage collection. Additionally, it is generally not too hard to create test data for most projects. Also, the chance that someone else has created test data for the exact problem you are working on is quite slim. And then there is always the most important point of them all:

    If someone has already created test data for your specific problem, they have probably already solved your problem! Enter respositories like CPAN and SourceForge.

    --
    There are 10 types of people in the world. Those who understand binary and those who do not.
    1. Re:Interesting Idea, but basically useless by dubl-u · · Score: 2, Insightful
      The idea of a testing repository is quite interesting, but, in practice, a useless one.

      Your imagination is pretty limited. This is of use in any area where developers will use similar data for lots of different things, especially in areas of active research. Some examples include:
      • mailing addresses - all sorts of apps need to parse international mailing addresses: wouldn't it be better to test with real samples?
      • email - corpuses of known good email and known spam email are necessary for any spam recognition tool
      • speech - for both speech recognition and speech synthesis, a large body of audio with a variety of speakers and accents is useful
      • text - there are several large databases of text on-line, useful for everything from machine translation research to historical linguistic analysis
      • web pages - right now one of my clients is working on a project to extract structured data from web pages. If somebody had a large corpus of messy HTML pages and matching clean structured data, it would be hugely useful in evaluating algorithms. Instead, we're doing that by hand.
      So yes, I agree, don't reinvent the wheel. But also, don't recreate a test dataset.