Community Test Data Repository?
BlizzyMadden inputs this query: "Currently I am working on a small utility to convert HTML to plain text. As I test this, I create more and more different types of HTML files to regression test it. I wonder to myself if these test files that I make would be beneficial to other developers who may be doing similar work. To expand on this thought, I wonder if there is a community-based repository of test data anywhere that developers and use and contribute to. Just curious if anyone knows of any project website out there that offers this."
"Such a repository would be useful for files like the following:
Complex HTML files.Files like this would be great if developers were to share them to debug their own applications."
RFT and Word files with lots of formatting.
Large text files.
Excel files with complex equations and macros.
Another good idea is to pull a couple hundred websites with Wget -r :)
OF course, slashdot belongs in the "Broken HTML No-Css Table Mess" variety of HTML (just like they call Crushed Bean No-Froth Dark Latte - a coffee)Quidquid latine dictum sit, altum videtur
sed s'/<[^>]+>//g'
=)
There are places where the networks are not touching,and there are places where they are-Boeing's Lori Gunter
... just use lynx --dump.
If there isn't a test data project maybe you could start one. If people agree that it's a good idea then it'll grow... if not...
I believe the idea has merit and should be done. This would be useful for the developers of many FOSS applications. A "torture test" of nasty Excel files or Word files would help Open Office etc. HTML files would be good for the Mozilla team. Maybe they would be interested in providing the first few sets of data.
I'd also recommend tying the automated regression tests to this open source test data so every developer could download the source & the test data and make sure the new feature doesn't break anything...
Any new troublesome files could be added to the test data and new tests could be built to ensure that the software deals with them.
What are you listening to? (http://megamanic.blogetery.com/)
Not only that, but it'd be great to see things like lists of made up addresses and other test data.
Training monkeys for world domination since 1439
Mozilla has a plaintext serializer for HTML.
Vidar Braut Haarr
http://www.q1n.org/
Here's a python script I wrote to download all the zen garden examples. It works by incrementing the url and getting the next page. (myutils.pad turns '1' into '001') This puts all the pages into one big file, but you could easily make it do seperate files:
import os,sys,time,urllib2,urlparse,re
import myutils
baseurl=r'http://www.csszengarden.com/'
for i in range(1,146):
paddedi=myutils.pad(str(i),3,'0',True)
url=baseurl + paddedi + '/' + paddedi + '.css'
print 'trying: ' + url
try:
urlfile=urllib2.urlopen(url)
content=urlfile.read()
urlfile.close()
savedurl=file(paddedi+'.css','w')
savedurl.write(content+'\n')
except Exception, inst:
print "problem at " + url
print inst # __str__ allows args to printed directly
www.dmoz.org
/caches start to get pages. Just follow some links and get some data from there.
Tons of links there...that's there most web crawlers
I'd like to see something like this centralized for everything... (databases, C++ compilers, etc...) but there would need to be a way to anonymously post, because otherwise corporate counterintelligence could be gleaned from checking which things most companies check for (and don't check for).
For your purposes, check out www.org . They have "test suites" that check the web standard compliances of browsers, readers, HTML, CSS, etc... I've used them whenever I do web sites as a way of assuring that my display difficulties aren't due to the inabilities of the browser being used.
I hear that the internet is a community-driven repository of html
Anyone have a less disturbing list of real or fake names? I suppose someone could grab some data from a geneology site, strip out just the names, and use that.
If anyone knows of (or starts) a project like this I'd probably contribute.
To expand the original prompt: how about media tags? EXIF, ID3, etc?
"Victory means exit strategy, and it's important for the President to explain to us what the exit strategy is." G.W.Bush
The idea of a testing repository is quite interesting, but, in practice, a useless one.
Such a repository would end up as no more than a garbage collection. Additionally, it is generally not too hard to create test data for most projects. Also, the chance that someone else has created test data for the exact problem you are working on is quite slim. And then there is always the most important point of them all:
If someone has already created test data for your specific problem, they have probably already solved your problem! Enter respositories like CPAN and SourceForge.
There are 10 types of people in the world. Those who understand binary and those who do not.
I would definitely be interested in helping with making something like this if it turns out there isn't one (or if there is, I'd be interested in helping to maintain it). It sounds like a good idea.
I need a Crystal report to plain text converter.
Anyone can cook up a script or something? I really can't make sense out of them...
just drop me a note at my gmail if you'd like to try to help.
Nouvelles de jeux et technologies en français. TC
It's located at www.theWholeDangInternet.com
. For an older copy try the Internet Archive.
I think this would be a great idea because it would give developers a great starting point for applications. Specifically if your application can handle the files in the repository they way you expect them to, then you've done good. Granted you may be reinventing the wheel because if it's up there, it means that somebody else may have solved the problem already. OR they used the file to solve a different problem...Either way it'd be a great thing to have. Sounds like it'd be prime for Sourceforge (as long as it actually gets built out)...
link
It would be great to have web pages for all natural languages that the current computer infrastructure supports.
This might be what you're looking for:
http://www.w3.org/MarkUp/Test/
This would be useful for any kind of "language". For programming languages also. Any greedy wget for html or so is no good choice. Regression tests should be efficient, so a healthy balance between the amount of test data and coverage (amount of language constructs covered by the tests) is a must.
e s/~checkout~/jacks/jacks.html
There is some kind of public repository for the Java programming language at http://www-124.ibm.com/developerworks/oss/cvs/jik
Unfortunately (or not), it depends on tcl/tk. With scripting, they can reuse common templates and concentrate on the important parts for the specific tests. Good for reading and writing.