Domain: crummy.com
Stories and comments across the archive that link to crummy.com.
Comments · 23
-
Re:cryptographic signature?
The thing is, typos happen.
Sometimes it's not even that. Beautiful Soup is the module for parsing HTML and XML files. However, beautifulsoup (bs) is the legacy version and beautifulsoup4 (bs4) is the version that everyone should be using. It's very easy to install the former when you need the latter.
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup
-
Re: html sucks
No he is taking the piss because CSV is probably one of the worst ways to store verbose text data, as all escaping sequences are likely to turn up in the text, plus newline characters complicate record delimiting.
Wrong thread then. I used a text editor to fix HTML back in 1997.
Out of interest, how are you handling this?
Python and Beuatiful Soup 4 to parse the HTML tree. Once I have the comment text, I use string manipulation to remove the newlines, extraneous tags and white space. The end result is straight HTML.
This is what the previous comment looks like in the CSV file:
<div class="quote"><p>Did you export the data into to CSV format and then import into Microsoft Excel?</p></div><p>Post to the wrong thread, friend?</p>
-
Re:$200 more?
Everything you said is perfectly true but you totally didn't address my actual point, which is the question why Dell are charging an extra $150 more (my mistake not $200) for a laptop with a literally free operating system, than an identical one with an OS they have to buy a licence for.
It is very difficult to say why Dell is doing that although I am quite sure they can spin it in a way that makes them the good guys instead of saying things like "We just want to rip you off so you will feel obliged to purchase one of our laptops with a Microsoft OS on it".
:-)I am pretty sure Dell pulled something along the same stunt a few years ago. See the following (2010) although don't bother to try and follow the links since they conveniently don't exist anymore.
I do think, like it or not, if you get a laptop it is cheaper to just get one with Windows installed and pay the Microsoft tax then wipe it and put on your preferred Linux distro (obviously check you can do this before your buy). You should also get the Widows 10 ISO file (4.2GB) from here and keep it so if you ever wish to sell your laptop you can sell it with a legitimate fresh install of Windows 10 without the bloatware (takes about 10 to 30 minutes) to the Windows brainwashed.
In fact, I would actually recommend getting the ISO and do a fresh install on the default Windows 10 (don't do the quick setup) even if you were not going to install a Linux distribution.
-
Re:Goodbye, Python 2 - NOT
> Ah, denial
Cool intro, bro!
> So it's not [Guido's] problem that [Python] sucks.
Were your parents eaten by Python or something? Calm it down a notch! Here's some facts about the projects that I know something about:
- sgmllib was built into the Python 2 (which I mention in case you thought it was an external library). It existed solely to be used by the HTMLParser module, and as such it never fully supported SGML, making it useless for its stated purpose. Moreover, it's so absurdly trivial to port to Python 3 that I did it for the feedparser project.
-
BeautifulSoup relied on HTMLParser, which relied on sgmllib. That could be overcome but the author, Leonard Richardson, doesn't enjoy working on BeautifulSoup:
Beautiful Soup is a hobby that I don't really enjoy and that's similar to the work I do all day. It's competing against other hobbies and committments I have, hobbies and committments that are more enjoyable and significantly different from my day job.
He also notes that BeautifulSoup has been surpassed by other libraries, and recommends using those instead. It's no reflection on Python 3 that a library you used to use is not in active development.
-
I ported feedparser to Python 3 over the course of a week. It weren't no thang.
> If you're using Python for anything important, start working on your exit strategy.
I'm sorry Python made you cry, but I really do bristle that you hoisted up feedparser to support your sarcasm and hyperbole, particularly since you clearly have no idea what you're talking about in these three instances (and I think GooberToo handily dealt with some of your other points).
-
Re:What If Linus Torvalds Gets Hit By A Bus?
"What If Linus Torvalds Gets Hit By A Bus?" - An Empirical Study
by Leonard RichardsonPublished on segfault.org 02/23/2000
http://www.crummy.com/writing/segfault.org/Bus.html
It even coined the "Bus factor" phrase:
http://en.wikipedia.org/wiki/Bus_factor
And, in other news, Steve Ballmer was earlier seen buying shares in Greyhound.
-
What If Linus Torvalds Gets Hit By A Bus?
"What If Linus Torvalds Gets Hit By A Bus?" - An Empirical Study
by Leonard RichardsonPublished on segfault.org 02/23/2000
http://www.crummy.com/writing/segfault.org/Bus.html
It even coined the "Bus factor" phrase:
-
Re:Good and bad news
has any syntax inspired more flamewars than python's?
I suppose you mean the spaces vs. tabs thing, maybe you're right, but no one can deny that Python has an extremely simple syntax.
You can do anything with it, from HTML parsing to a game physics engine to 3d graphics to Excel spreadsheets to... you name it.
Even if Python isn't quite enough for your needs, you can very easily link it with C language or Fortran modules in a trivial way.
If I have an alternative that is, at the same time, simpler and more powerful, then why should I bother with this whole Octave/Scilab/Matlab mess? -
Browsers are far too forgiving
Browsers are incredibly forgiving of bad HTML. Worse, the definition of "acceptable HTML" is undocumented, both for IE and Firefox. We discovered this writing Sitetruth's parser. We started out with BeautifulSoup, which is supposed to be a "forgiving" HTML parser. By browser standards, it's not; we had to make some improvements. Here are some things that show up in real-world HTML:
- Incorrectly terminated HTML comments These are so widespread that you have to handle them, or entire web pages are sucked into unterminated comments.
- Unescaped spaces in URLs Spaces in URLs are supposed to be escaped, but there are A tags out there using URLs with spaces.
- Unescaped CR/LF within a URLThis is rare, and invalid, but multiline URLs are out there. Usually in hostile code.
- Unicode URLs I've seen a Unicode "Pi" symbol, unescaped, in a URL in a UTF8 document. This was on a phishing site, so it was probably there because it broke some security product.
Part of the reason for the growth in bad HTML is that Adobe seems incapable of making a version of Dreamweaver that consistently generates correct HTML for anything later than HTML 3.2. (Create a moderately complex page in Dreamweaver 8 in HTML 4.x or XHTML mode, and run it through a validator. It will fail.) If the best tools can't get it right, why should anybody else?
Since real world HTML parsing is ambiguous, and bad HTML is widespread, differences between browser parsers and other tools can be exploited as security holes.
-
Re:Opensource and Campaigns
A better link to BLOOP:
http://www.crummy.com/software/Bloop/ -
Opensource and Campaigns
Campaigns have always had a history of turning to low cost alternatives to conserve money. Often, opensource, free software, and other channels where costs are subsidized by third parties/grassroots efforts fills this need.
You can see this behind the recent push on Youtube, where campaigns are able to skimp on bandwidth by having youtube shoulder the cost. Or handmade signs by grassroots activists to save cash for the campaign headquarters. Or the substitution of free or opensource software in place of more expensive proprietory packages.
But though campaigns may avail themselves of free/opensource solutions, they very rarely contribute code back to the community.
I think the Wes Clark '04 campaign was special in that they were a truly grassroots effort who gave back to the opensource community. They took SCOOP and modified it heavily and gave it back to the community in the form of BLOOP. Not many other campaigns can say they contributed code to the opensource community.
http://www.crummy.com/cgi-bin/msm/map.cgi/Bloop
Here's hoping Wes Clark decides to run again. We can use more candidates who don't just take code, but actually give code back to the community. -
Re:What spam?
It seems to me that if you got a stock spam email, the motive of the person sending the email is to temporarily increase the amount a company is worth before selling it. The spammer could buy stock in the company before promoting it, then once it reaches a certain level they could sell. The point at which they sell could also be time based, as in, after 1 day the spammer sells their stock.
It should be feasible to play the market in the same way if you're not the spammer. The trick is to jump in and out at the right moment so that you don't get stuck. I'm no investor, but it seems like information about stocks is available with very frequent updates. It should be possible to determine if you're one of the first people buying into the stock, or if you'd be one of the last. If you're one of the first, you'll purchase the stock at a relatively low point. Then, over the course of a few hours or days, more people will buy on and the stock price will rise, at which point you can be disciplined enough to sell your stock. The tactic works at least moderately well for the spammers (or they wouldn't be doing it), so why not make it work for you? This is most certainly a short term type thing. In fact it's so short term, I'm not even sure it should be called investing. Maybe that was the problem with my original post.
I wonder what the activity of past spam-stock looks like. It might be possible to determine some kind of trend for past performance that would be useful in predicting future spammer behavior.
It's unfortunate that thinking differently than the rest of the Slashdot crowd garners a mod of flamebait. That just sucks. Many people here say that the folks who invest in spam stocks are idiots, but I just want to point out that maybe the smart ones aren't. You might think they're assholes, but they're not idiots. If there is a flaw in my logic, please let me know. Here are a couple of links to studies and information done about this method.
http://it.slashdot.org/article.pl?sid=07/01/21/202 9210
http://www.crummy.com/features/StockSpam/ -
Proof It Doesn't Work For Recipients
There's a great page that tracks spammed stocks. While TFA shows that people who buy in before the touts start arriving make a 5-6% gain, the spammed stock tracker shows that once the spam starts showing up in inboxes, it's too late.
The guy's got records going back over 2 years. It's pretty interesting.
- Greg -
Proof It Doesn't Work For Recipients
There's a great page that tracks spammed stocks. While TFA shows that people who buy in before the touts start arriving make a 5-6% gain, the spammed stock tracker shows that once the spam starts showing up in inboxes, it's too late.
The guy's got records going back over 2 years. It's pretty interesting.
- Greg -
Bah
-
Okay kids...Just so people who may come across this know, if you're going to do some HTML or XHTML parsing in Python, you'd be insane not to use BeautifulSoup or a similar tool.
Example to find all links in a document:from BeautifulSoup import BeautifulSoup
Yes, it's that simple. For an URL opener that also handles proxies, cookies, HTTP auth, SSL and so on, look into the urllib2 module that ships natively with Python.
for tag in BeautifulSoup(html_document).findAll("a"):
print tag["href"] -
Re:I can't help but wonder but...
That was a highly misleading article.
I recommend checking out the following monitors:
http://www.spamstocktracker.com/
http://www.crummy.com/features/StockSpam/
You will notice that without fail: they are money losers. The odds that you *might* actually make money are directly proportional to how soon you know they will hit a certain threshold. ie: very slim. They go up very briefly, then plummet like a rock for the long term.
It bugs me that nobody has piped up about the legitimacy of that report. What it fails to take into account are (specifically) the timing of the trades, and the long term result versus the short term. Long term will always mean you lose, without fail.
It's important to note that in most cases the stocks being spammed are not even real companies. They're paper companies that exist solely to have stock and be pumped and dumped. It's a further illegal manipulation of the market. The SEC needs to do a lot more to fight this kind of crime but (naturally) they don't have the staff to do so.
ad -
Re:The only thing suprising about this is...
For those who haven't already, check out Beautiful Soup, which is a great python module for web-scraping - particularly when used together with ClientCookie... the results are shockingly elegant in many cases.
I've personally written functionally equivalent scripts of 100-200 lines to search MySpace for underag... oops, I've said too much.
-
Beautiful SoupIf you like Python, there is an app out ther called Beautiful Soup which can suck in ugly, malformed markup and give you a parse tree you can play with before dumping it back out to html.
P.S. There is a Ruby Port as well.
-
Beautiful SoupIf you like Python, there is an app out ther called Beautiful Soup which can suck in ugly, malformed markup and give you a parse tree you can play with before dumping it back out to html.
P.S. There is a Ruby Port as well.
-
Re:Quantity over Quality
or just randomly generate them...
Like this? -
Not too trivialize Robin, but she's hot in leather
I come not to trivialize Robin, but to celebrate her. Not to take away from her skills but to acknowledge her assets.
Lara Croft ain't got nuthin on her.
in leather, and now with more leather, and a smile.
You go gurrl! -
Re:Important one...
Are you Weird Al Yankovic? Or one of his relatives ?
I second this question... To quote Leonard Richardson, Larry Wall has many of the mannerisms of Weird Al Yankovic, and he also had the Weird Al glasses and Hawaiian shirt (and old-school mustache).
I'm serious, you look a lot like him to me... ! :-)
Even perl.org mentions this possibility (with pictures)! -
Leonard Richardson of Segfault.orgI just nominated segfault's Leonard Richardson. Not only does he write some of the funnier stories (this being my favorite), he is also one of the two main cofounders and coders of backend stuff. Granted, everything on segfault is not hilarious, but you have to work with what you're given. Plus, imagine reading the stuff that gets weeded out. Ugh.
Not to slight the hard work of Scott James Remnant (the other main segfault guy), but Scott is often recognized in press interviews and in friends and family stock offerings of Linux IPOs. And Leonard is chained to his Internet connection fixing things while Scott is off having a life or something. I say $2,000 is a small price to pay to keep a sense of humor about things. After all, laughter is the best medicine and who can put a price tag on your health?
:)