Slashdot Mirror


Community Test Data Repository?

BlizzyMadden inputs this query: "Currently I am working on a small utility to convert HTML to plain text. As I test this, I create more and more different types of HTML files to regression test it. I wonder to myself if these test files that I make would be beneficial to other developers who may be doing similar work. To expand on this thought, I wonder if there is a community-based repository of test data anywhere that developers and use and contribute to. Just curious if anyone knows of any project website out there that offers this." "Such a repository would be useful for files like the following:
Complex HTML files.
RFT and Word files with lots of formatting.
Large text files.
Excel files with complex equations and macros.
Files like this would be great if developers were to share them to debug their own applications."

4 of 50 comments (clear)

  1. Re:You could try Mangleme by LardBrattish · · Score: 3, Insightful

    These are all useful resources but it's not what he's asking. What he wants to know is: is there a project that deliberately clooects test data in a GPL sort of way so developers don't have to generate the test data themselves...

    --
    What are you listening to? (http://megamanic.blogetery.com/)
  2. Great idea. by seanyboy · · Score: 3, Insightful

    Not only that, but it'd be great to see things like lists of made up addresses and other test data.

    --
    Training monkeys for world domination since 1439
  3. Interesting Idea, but basically useless by Jahz · · Score: 5, Insightful

    The idea of a testing repository is quite interesting, but, in practice, a useless one.

    Such a repository would end up as no more than a garbage collection. Additionally, it is generally not too hard to create test data for most projects. Also, the chance that someone else has created test data for the exact problem you are working on is quite slim. And then there is always the most important point of them all:

    If someone has already created test data for your specific problem, they have probably already solved your problem! Enter respositories like CPAN and SourceForge.

    --
    There are 10 types of people in the world. Those who understand binary and those who do not.
    1. Re:Interesting Idea, but basically useless by dubl-u · · Score: 2, Insightful
      The idea of a testing repository is quite interesting, but, in practice, a useless one.

      Your imagination is pretty limited. This is of use in any area where developers will use similar data for lots of different things, especially in areas of active research. Some examples include:
      • mailing addresses - all sorts of apps need to parse international mailing addresses: wouldn't it be better to test with real samples?
      • email - corpuses of known good email and known spam email are necessary for any spam recognition tool
      • speech - for both speech recognition and speech synthesis, a large body of audio with a variety of speakers and accents is useful
      • text - there are several large databases of text on-line, useful for everything from machine translation research to historical linguistic analysis
      • web pages - right now one of my clients is working on a project to extract structured data from web pages. If somebody had a large corpus of messy HTML pages and matching clean structured data, it would be hugely useful in evaluating algorithms. Instead, we're doing that by hand.
      So yes, I agree, don't reinvent the wheel. But also, don't recreate a test dataset.