Slashdot Mirror


Community Test Data Repository?

BlizzyMadden inputs this query: "Currently I am working on a small utility to convert HTML to plain text. As I test this, I create more and more different types of HTML files to regression test it. I wonder to myself if these test files that I make would be beneficial to other developers who may be doing similar work. To expand on this thought, I wonder if there is a community-based repository of test data anywhere that developers and use and contribute to. Just curious if anyone knows of any project website out there that offers this." "Such a repository would be useful for files like the following:
Complex HTML files.
RFT and Word files with lots of formatting.
Large text files.
Excel files with complex equations and macros.
Files like this would be great if developers were to share them to debug their own applications."

4 of 50 comments (clear)

  1. You could try Mangleme by Gopal.V · · Score: 4, Informative
    Mangleme generates Malformed HTML used for testing browsers.

    Another good idea is to pull a couple hundred websites with Wget -r :)

    OF course, slashdot belongs in the "Broken HTML No-Css Table Mess" variety of HTML (just like they call Crushed Bean No-Froth Dark Latte - a coffee)
    1. Re:You could try Mangleme by Gopal.V · · Score: 4, Informative
      >Another good idea is to pull a couple hundred websites with Wget -r :)

      Feels wierd replying to my own post... but I remembered something else that I had. A copy of the Google Programming contest data files. Get a whopping 16000 web pages in one shot from research.google.com. (wish they'd gzipped it - but content-encoding: gzip works too)

      Sadly, all those pages are from .edu websites :)
  2. or you could... by cyborch · · Score: 2, Informative

    ... just use lynx --dump.

  3. Mozilla has it by Anonymous Coward · · Score: 1, Informative

    Mozilla has a plaintext serializer for HTML.

    Vidar Braut Haarr
    http://www.q1n.org/