Slashdot Mirror


Community Test Data Repository?

BlizzyMadden inputs this query: "Currently I am working on a small utility to convert HTML to plain text. As I test this, I create more and more different types of HTML files to regression test it. I wonder to myself if these test files that I make would be beneficial to other developers who may be doing similar work. To expand on this thought, I wonder if there is a community-based repository of test data anywhere that developers and use and contribute to. Just curious if anyone knows of any project website out there that offers this." "Such a repository would be useful for files like the following:
Complex HTML files.
RFT and Word files with lots of formatting.
Large text files.
Excel files with complex equations and macros.
Files like this would be great if developers were to share them to debug their own applications."

50 comments

  1. You could try Mangleme by Gopal.V · · Score: 4, Informative
    Mangleme generates Malformed HTML used for testing browsers.

    Another good idea is to pull a couple hundred websites with Wget -r :)

    OF course, slashdot belongs in the "Broken HTML No-Css Table Mess" variety of HTML (just like they call Crushed Bean No-Froth Dark Latte - a coffee)
    1. Re:You could try Mangleme by Gopal.V · · Score: 4, Informative
      >Another good idea is to pull a couple hundred websites with Wget -r :)

      Feels wierd replying to my own post... but I remembered something else that I had. A copy of the Google Programming contest data files. Get a whopping 16000 web pages in one shot from research.google.com. (wish they'd gzipped it - but content-encoding: gzip works too)

      Sadly, all those pages are from .edu websites :)
    2. Re:You could try Mangleme by LardBrattish · · Score: 3, Insightful

      These are all useful resources but it's not what he's asking. What he wants to know is: is there a project that deliberately clooects test data in a GPL sort of way so developers don't have to generate the test data themselves...

      --
      What are you listening to? (http://megamanic.blogetery.com/)
    3. Re:You could try Mangleme by JohnFluxx · · Score: 1

      Also try the pages that kde use. They are in the kde cvs tree:

      http://webcvs.kde.org/khtmltests/

    4. Re:You could try Mangleme by Anonymous Coward · · Score: 0

      // Sadly, all those pages are from .edu websites :)
      Googlebot -> Playboy, I crawl it for the text.

  2. here's mine by DrSkwid · · Score: 3, Funny

    sed s'/<[^>]+>//g'

    =)

    --
    There are places where the networks are not touching,and there are places where they are-Boeing's Lori Gunter
    1. Re:here's mine by Anonymous Coward · · Score: 0

      What about this:

      <a href="/" title="gotcha > here">click here</a>

    2. Re:here's mine by Anonymous Coward · · Score: 0

      It was a joke, but you can see other "weird HTML" examples here.

    3. Re:here's mine by DrSkwid · · Score: 1

      garbage in, garbage out

      you also forgot

      <a href="/"
      >click here</a>

      did you notice the =) ?

      --
      There are places where the networks are not touching,and there are places where they are-Boeing's Lori Gunter
  3. or you could... by cyborch · · Score: 2, Informative

    ... just use lynx --dump.

    1. Re:or you could... by Anonymous Coward · · Score: 1, Funny

      Yeah, but who wants all the bloated dependencies of that?

    2. Re:or you could... by Christopheles · · Score: 1

      What is this? 1999? elinks -dump

    3. Re:or you could... by Anonymous Coward · · Score: 0
      there != their
      That's why I always use they're. That way your correct alot of the time.
    4. Re:or you could... by seanyboy · · Score: 2, Interesting
      --
      Training monkeys for world domination since 1439
    5. Re:or you could... by RabidSquirrel · · Score: 1

      We did some tests with lynx and links and found that links produced slightly nicer looking output.

      Unfortunately both lynx and links had a maximum line length when dumping (width option). This created random line breaks in the middle paragraphs and what not.

      You can increase the max size of the width by editing the links source and recompiling. Clip from an email I sent to my coworker about changing this limit:

      "In case you want to play with links before I get there in in the morning.

      LI character replacement:
      In parser.c under the html_li function. Line 858. Looks like there may be some sort of option to define an alternate replacement too

      Size limit:
      In default.c under the option stuct. Line 1177

      I set it to 65536 which should be the max size of an integer right? For some reason it lets me go over that though. wtf?"

      I didn't test this much (nor do I know C++) so there may be more to change. So far the output looks very nice though.

    6. Re:or you could... by cyborch · · Score: 1

      I set it to 65536 which should be the max size of an integer right? For some reason it lets me go over that though. wtf?"

      Max size of a 16 bit unsigned integer is 65535. Today most integers are 32 bits or larger leaving you with a maximum of at least 4294967295 though I wouldn't recommend a max line length that high since lynx most likely (I didn't look at the lynx source) allocates memory enough to store the entire line and a 4gb memory footprint per line of output seems a bit excessive.

  4. Sourceforge? by LardBrattish · · Score: 5, Interesting

    If there isn't a test data project maybe you could start one. If people agree that it's a good idea then it'll grow... if not...

    I believe the idea has merit and should be done. This would be useful for the developers of many FOSS applications. A "torture test" of nasty Excel files or Word files would help Open Office etc. HTML files would be good for the Mozilla team. Maybe they would be interested in providing the first few sets of data.

    I'd also recommend tying the automated regression tests to this open source test data so every developer could download the source & the test data and make sure the new feature doesn't break anything...

    Any new troublesome files could be added to the test data and new tests could be built to ensure that the software deals with them.

    --
    What are you listening to? (http://megamanic.blogetery.com/)
    1. Re:Sourceforge? by Kiaser+Zohsay · · Score: 1

      Mozilla has done a tiny bit of html testing.

      --
      I am not your blowing wind, I am the lightning.
  5. Great idea. by seanyboy · · Score: 3, Insightful

    Not only that, but it'd be great to see things like lists of made up addresses and other test data.

    --
    Training monkeys for world domination since 1439
    1. Re:Great idea. by seanyboy · · Score: 4, Interesting

      Why the hell is that a troll. In the past I've wanted 100,000 or so mailing addresses to test an indexing routine on, and have ended up spending time writing a random address generator. If I'd have been able to go to a site (like lorum ipsum), ask for 100,000 addresses in CSV format and had these downloadable as a zipped file, it'd have saved time. I'm sure I'm not the only developer this has happened to. Jeez.

      --
      Training monkeys for world domination since 1439
    2. Re:Great idea. by Justice8096 · · Score: 1

      Yes - especially if you had common miss-spellings of the streets.

  6. Mozilla has it by Anonymous Coward · · Score: 1, Informative

    Mozilla has a plaintext serializer for HTML.

    Vidar Braut Haarr
    http://www.q1n.org/

  7. css zen garden by Free_Trial_Thinking · · Score: 1

    Here's a python script I wrote to download all the zen garden examples. It works by incrementing the url and getting the next page. (myutils.pad turns '1' into '001') This puts all the pages into one big file, but you could easily make it do seperate files:

    import os,sys,time,urllib2,urlparse,re
    import myutils

    baseurl=r'http://www.csszengarden.com/'
    for i in range(1,146):
    paddedi=myutils.pad(str(i),3,'0',True)
    url=baseurl + paddedi + '/' + paddedi + '.css'
    print 'trying: ' + url
    try:
    urlfile=urllib2.urlopen(url)
    content=urlfile.read()
    urlfile.close()
    savedurl=file(paddedi+'.css','w')
    savedurl.write(content+'\n')
    except Exception, inst:
    print "problem at " + url
    print inst # __str__ allows args to printed directly

    1. Re:css zen garden by DrSkwid · · Score: 1


      those who don't know unix .......

      curl -f 'http://www.csszengarden.com/[001-146]/[001-146].c ss' -o 'csszengarden#1.css'

      ok it does a 145 * 145 extra requests but hey, who cares !!

      btw how can you trust the design advice of a site that has dark brown text on a lighter brown background and grey body text. awful, try reading that when you're over 65 and your eyes get 30% less contrast!

      --
      There are places where the networks are not touching,and there are places where they are-Boeing's Lori Gunter
    2. Re:css zen garden by Anonymous Coward · · Score: 0

      I could write that in one line of perl if I wanted to.

    3. Re:css zen garden by Anonymous Coward · · Score: 0
      You can replace this:
      paddedi=myutils.pad(str(i),3,'0',True)
      with this:
      paddedi='%03d' % i
    4. Re:css zen garden by neglige · · Score: 1

      try reading that when you're over 65 and your eyes get 30% less contrast!

      Don't read. "Become one with the web." (SCNR)

      --
      My cats ate my karma. They also wrote this comment.
    5. Re:css zen garden by Anonymous Coward · · Score: 0

      You could also be less lazy if you wanted to.

    6. Re:css zen garden by Free_Trial_Thinking · · Score: 1

      Thanks, good idea. How does it work? Is it some kind of formatting instruction deal?

  8. dmoz... by Anonymous Coward · · Score: 0

    www.dmoz.org

    Tons of links there...that's there most web crawlers /caches start to get pages. Just follow some links and get some data from there.

  9. Proprietary Problems... by Justice8096 · · Score: 1

    I'd like to see something like this centralized for everything... (databases, C++ compilers, etc...) but there would need to be a way to anonymously post, because otherwise corporate counterintelligence could be gleaned from checking which things most companies check for (and don't check for).
    For your purposes, check out www.org . They have "test suites" that check the web standard compliances of browsers, readers, HTML, CSS, etc... I've used them whenever I do web sites as a way of assuring that my display difficulties aren't due to the inabilities of the browser being used.

  10. yes by Tom7 · · Score: 3, Funny

    I hear that the internet is a community-driven repository of html

  11. IAWTP by Clover_Kicker · · Score: 2, Interesting
    I once needed a few thousand names for test data. The only big list I could find was the list of men killed in Vietnam

    Anyone have a less disturbing list of real or fake names? I suppose someone could grab some data from a geneology site, strip out just the names, and use that.

    If anyone knows of (or starts) a project like this I'd probably contribute.

    1. Re:IAWTP by archeopterix · · Score: 1
      I once needed a few thousand names for test data. The only big list I could find was the list of men killed in Vietnam

      Anyone have a less disturbing list of real or fake names? I suppose someone could grab some data from a geneology site, strip out just the names, and use that.

      I would just scan and OCR a few pages from a phone book. As far as I know, the data in the phonebook cannot be copyrighted, although there might be some privacy protection laws that forbid keeping databases of personal data without some form of consent from the people whose data you gather. To be on the safe side I'd make a separate list of first names and family names and re-match them at random.
    2. Re:IAWTP by Thng · · Score: 1

      I used the US Census bureau list of names for a school project once (this is the 1990 listing). Wrote a small perl script that took random names from each file and put them together for a full name.
      There are last names, men's first names and women's first names files.

  12. And what about media tags? by ciroknight · · Score: 1

    To expand the original prompt: how about media tags? EXIF, ID3, etc?

    --
    "Victory means exit strategy, and it's important for the President to explain to us what the exit strategy is." G.W.Bush
  13. Interesting Idea, but basically useless by Jahz · · Score: 5, Insightful

    The idea of a testing repository is quite interesting, but, in practice, a useless one.

    Such a repository would end up as no more than a garbage collection. Additionally, it is generally not too hard to create test data for most projects. Also, the chance that someone else has created test data for the exact problem you are working on is quite slim. And then there is always the most important point of them all:

    If someone has already created test data for your specific problem, they have probably already solved your problem! Enter respositories like CPAN and SourceForge.

    --
    There are 10 types of people in the world. Those who understand binary and those who do not.
    1. Re:Interesting Idea, but basically useless by thpr · · Score: 1
      Such a repository would end up as no more than a garbage collection.

      I fear that this is a significant problem, but disagree some of the rest of your analysis.

      If someone has already created test data for your specific problem, they have probably already solved your problem! Enter respositories like CPAN and SourceForge.

      You have a powerful point, but that one solution may not work for everyone. It may not be in a suitable programming language or it might be an unusable license. Also, just because one issue with a file format has been solved (playing an mp3, for example) doesn't mean they all have (modifying the ID3 tags, for example) [In other words, while the very specific test data may not be reusable, more general files are reusable across a domain of problems]

      On the other hand, to add to Jahz's point: One other challenge is that the "debugging" of a library of test files (to determine the "right" answer) may take longer than explicitly building unit tests. The library of files would need to be well documented or well supported; and would have to contain both positive and negative examples of good files. This makes it a greater challenge than just having a list of files of a given file format.

    2. Re:Interesting Idea, but basically useless by Jahz · · Score: 1

      Good point.

      Although CPAN and SourceForge host almost only GPL'd (or MIT'd etc) code. Thus you should not have a problem using it as long as you license the derivative works under an equal or lesser restricting license.

      Also your point about other solutions being very close to what is needed, but not close enough, was interesting. Such a collection would be far more beneficial if the testing files came with a list of OSS that used them. That way you can see how other developers used the testing code.

      All in all, this repository would require alot more funding than is viable. You would need tons of space and decent bandwidth just to serve "garbage." An interesting idea, however impractical.

      --
      There are 10 types of people in the world. Those who understand binary and those who do not.
    3. Re:Interesting Idea, but basically useless by dubl-u · · Score: 2, Insightful
      The idea of a testing repository is quite interesting, but, in practice, a useless one.

      Your imagination is pretty limited. This is of use in any area where developers will use similar data for lots of different things, especially in areas of active research. Some examples include:
      • mailing addresses - all sorts of apps need to parse international mailing addresses: wouldn't it be better to test with real samples?
      • email - corpuses of known good email and known spam email are necessary for any spam recognition tool
      • speech - for both speech recognition and speech synthesis, a large body of audio with a variety of speakers and accents is useful
      • text - there are several large databases of text on-line, useful for everything from machine translation research to historical linguistic analysis
      • web pages - right now one of my clients is working on a project to extract structured data from web pages. If somebody had a large corpus of messy HTML pages and matching clean structured data, it would be hugely useful in evaluating algorithms. Instead, we're doing that by hand.
      So yes, I agree, don't reinvent the wheel. But also, don't recreate a test dataset.
    4. Re:Interesting Idea, but basically useless by Meostro · · Score: 1

      Also, something like facial recognition needs large test datasets, and it's never a "solved" problem. There's always a way to do it faster or better or more easily. Other things like Canterbury Corpus or Calgary Corpus are datasets used for comparison between compression algorithms. Meaningful comparisons can be made between different algorithms based on how well they perform on them simply because they've been used enough and are standard enough.

      I'm so interested in this that I just registered gpldata.com... finally something useful to do with my free time!!!

  14. I'd love to help... by Ciaran_H · · Score: 1

    I would definitely be interested in helping with making something like this if it turns out there isn't one (or if there is, I'd be interested in helping to maintain it). It sounds like a good idea.

  15. Crystal reports? by Destoo · · Score: 0, Offtopic

    I need a Crystal report to plain text converter.

    Anyone can cook up a script or something? I really can't make sense out of them...

    just drop me a note at my gmail if you'd like to try to help.

    --
    Nouvelles de jeux et technologies en français. TC
  16. Just found an existing repository by malcomvetter · · Score: 1

    It's located at www.theWholeDangInternet.com

    . For an older copy try the Internet Archive.

    1. Re:Just found an existing repository by BlizzyMadden · · Score: 1

      Oh, snap! You got me. No, my whole point is test data in all sorts of formats, such as PDF and Excel. Test data that is known to cause all sorts of problems. For example, if I have a Word files that crashes OO.org at one point, why not offer those files publicly to other developers of other products (Abiword, KWord) to see if they perhaps have similar problems.

  17. Great Idea by Agent_9191 · · Score: 1

    I think this would be a great idea because it would give developers a great starting point for applications. Specifically if your application can handle the files in the repository they way you expect them to, then you've done good. Granted you may be reinventing the wheel because if it's up there, it means that somebody else may have solved the problem already. OR they used the file to solve a different problem...Either way it'd be a great thing to have. Sounds like it'd be prime for Sourceforge (as long as it actually gets built out)...

  18. Use html2text by devillion · · Score: 1
    Works fine for me.

    link

  19. International languages by tungwaiyip · · Score: 1

    It would be great to have web pages for all natural languages that the current computer infrastructure supports.

  20. Test pages from the W3C by vil · · Score: 1

    This might be what you're looking for:

    http://www.w3.org/MarkUp/Test/

  21. For all sort of "languages" by Anonymous Coward · · Score: 0

    This would be useful for any kind of "language". For programming languages also. Any greedy wget for html or so is no good choice. Regression tests should be efficient, so a healthy balance between the amount of test data and coverage (amount of language constructs covered by the tests) is a must.

    There is some kind of public repository for the Java programming language at http://www-124.ibm.com/developerworks/oss/cvs/jike s/~checkout~/jacks/jacks.html

    Unfortunately (or not), it depends on tcl/tk. With scripting, they can reuse common templates and concentrate on the important parts for the specific tests. Good for reading and writing.