It was my task to develop log anaylsis software to count visitors
for large web sites. I was not only surprised to find out how inaccurate
the art was, but also had difficulty in convincing other web-experienced
collegues on how impossible it was. "All the other web analysis programs
display number of visitors" they said. Well they all make guesses is the
answer. The current best practice is to count unique login names,
but most sites don't use authenicated logins and even then you can have many hotmail accounts.Here's the disclaimer I eventually wrote for my sites unique visitor stats.
The number of visitors displayed does not accurately
represent the number of actual people visiting your site.
Many people can appear as a single IP address by
sharing proxies, caches, NAT firewalls or
even simply sharing the same computer at home.
One person can also appear as many IP addresses by
using dynamic IP addressing (most dialup and PPPoE users),
being load balanced across proxies and caches or simply using
multiple computers (e.g., at home and at work.)
Other reasons for overcounting include:
robots; rogue client software that keeps changing
its ID string; users that delete cookies,
upgrade software or use multiple client software agents.
Other reasons for undercounting are
clients that don't (or have been set not to)
accept cookies or operate through anonymizers.
If authenticated logins are used, determining the number
of real people from server-side logs may be best derived
from a cookie which is only set after authenticated login
that only holds a value which uniquely corresponds to the user
(e.g., an user name or account number).
When the statistics for more than one day is selected,
the peak daily number of visitors is displayed rather than a
sum of the daily visitors.
Since the above was written I discovered a common practice of sysadmins and help desks
is the suggesting manually deleting all cookies (since you can't do it selectively with MS-IE)
to get over site bugs.
And now the increasing popular spyware removal tools (E.g., spybot)
remove 3rd party cookies used just to count unique visitors in
the name of removing sypware and viruses from your computer.
Originally I thought of defining a visitor for HTTP domains as
the cookie if it exists, and the client IP address otherwise.
But the flaw in this is that it
will double count first time HTTP visitors.
Once for the log line of their first hit with no cookie.
And again for the subsequent hit.
With streaming logs, using the GUID (effectively a cookie these days)
and the client IP address is more useful as a unique visitor.
The log lines in streaming are actually the summary of a
sequence or request/reply transacations and so the
first "hit" log line does have a GUID/cookie logged.
What follows is addition research I turned up:
ABCi(a web traffic auditor)
says: ``
A visitor is defined as "a unique IP address with heuristic."
To properly account for visits, the Web site needs to identify a "visitor"
so that visitor activity is properly tracked. Registration and/or
cookies are the best way to track a visitor's activity through the Web site.
Unfortunately, a lot of Web sites do not require registration, nor do they
use cookies [and browsers can disable cookies]
If cookies are used, it is the clients' responsibility to provide
the auditor with details on how the server sets the cookie,
the cookie format and how the cookies are used.
An alternative that has been suggested is to use the IP
address AND user-agent in combination, to identify a unique visitor.
The interaction with the site by this "visitor" is then analyzed to
determine the number of visits which should be recorded.
Using only the IP address to identify a visitor is not acceptable
due to the number of visitors that may not be accurately reported
because they are operating behind a proxy server or firewall. ''
Actually last month's flop was not a DNS issue, but it did effectively shutdown websites also for one and a half hours.
And if we're pushing dual sourcing don't overlook Speedera.
You can even see today's outage on the Speedrank performance page.
BTW, C & W USA declared chapter 11 a while ago and got aquired by Savvis. Seems like they change company name every couple of years.
AUP really is a classic. I may buy it just for sentimental reasons, even though I don't need the tutorial introducton to Unix anymore.
Nowdays though, my definitive reference for writing portable unix programs is the merged IEEE POSIX and Open groups's
Single Unix Specification. Registration is free.
Agree that this is not new to the world of cardiology. 10 years ago the pacemakers already logged heart rates, breath rate and volume, motion detecion (running vs walking) all smaller than a box of matches and the battery lasted for 5 years with no recharging. I wouldn't be suprised to see this device become much smaller.
An interesting sensor to add would be a GPS receiver. Some prisoners are released into public with GPS ankle bracelets with this kind of technology. There are plenty of non-nefarious spying uses for this too. A GPS combined with the CPOD could be used by athletes to track their performance.
And I agree there are many reasons why you wouldn't want to permantly wear one. When I was working for a pacemaker company 10 years ago we were developing a new kind of motion sensor to detect the running vs. walking. I strapped on the prototype for a day to collect some data. That night my wife refused to have sex with me, because she didn't want anyone at the office to annotate the trace with "they did it here" and stick it up on the cubicle wall.
"A NASA guy [That was me, but I don't work for NASA directly, but for Speedera who delivers their traffic] says... Slashdot was a drop in the bucket compared to links from mainstream news web sites".
I said it here. The Slashdot load depends on the size of the objects downloaded of course, but a reasonable generalization is that the traffic from a top 10 portal is about five to ten times higher.
I work at Speedera who is delivering their content and NASA TV.
At 6pm EST when slashdot posted this story the traffic increased only about 100Mbps. Articles posted on AOL, MSN and Yahoo home pages increase the traffic much more. The NASA TV live stream when Opportunity landed was 4 Gbps. There are lots of other sources that are bigger than the slashdot effect.
See the press release for more details on the traffic
and our SpeedRank index for historical performance and availabilty of NASA's site.
Let me explain with an example symlink (which works fine on unixen):
/mnt/server1/symlink ->/mnt/server2/target
where server1 and server2 dirs are nfs mounted.
I want the equivalent thing (a symlink on one remote file system targetted to another remote filesystem) to work on windows.
I too liked the fact that SFU has more access to the Windows core. E.g., some per process stuff can be seen via ps and/proc, The cmd.exe shell executes many of the utilities. But still not enough for me to switch kick cygwin off my system.
The cygwin bash shell default setup beats ksh.
Here's some features that would have excited me, but I didn't find in SFU.
I was hoping to be able to truss(1) the native windows executeables, but I didn't have any luck with that.
A list of file descriptors in use under/proc/PID/fd/...
The SFU NFS client did follow symlinks when the target was on the same device, but it didn't seem to follow a symlink to another device. I tried making targets of c:\temp
and \\host\share, but even though Windows Explorer could see the target directly, when Windows Explorer browses the remote NFS Network the the symlink target did not resolve. (A trace shows the NFS server returning the right target name to the SFU NFS client.)
Microsoft has had this PC-NFS client out for a while now. I see knowledge base article 324084 was last updated on 6/6/2003 and my MSDN Aug 2002 Unix for Windows Services 3.0 CD included this too.
And seems like cheap options have long been available DOS/Windows NFS clients for a long time.
In 1994, this summary mentions
XFS (shareware NFS client from Germany, not the SGI filesystem)
TSoft and Sun's PC-NFS.
Nowdays you also have at least these option, and you are right, many are not cheap.
HummingBird $300 My past impressions were always of good quality and features.
One is licensed under GPL, and the other isn't....
Actually the August 2002 20.1 MSDN Unix Services for Windows 3.0 CD I used does contain the GPL. And in section 1.e of the Microsoft EULA it says:
Component Products. The Product includes certain components licensed to Microsoft from third parties (each, a "Component Product"). A Component Product may contain its own license agreement and/or copyright notice (each, a "Component Agreement"). The Component Agreements are located on the Product media at \PUBS\CPYRIGHT.TXT and \PUBS\GPL.TXT. In the event of inconsistencies between this EULA and any Component Agreement, the terms of the Component Agreement shall control solely with respect to that Component Product
And on this CD I also see the sources for all the GNU software I checked.
My pet peeve is not being able to use NetMeeting without a server in the middle when both ends are behind a NAT. This happens all the time from one work place to another work place.
Doesn't the same problem affect all p2p applications?
While comcast and other ISPs may be running a transparent proxy, note that non-transparent proxies are coming. The Open Pluggable Edge Services (OPES) group is working on standard framework for non-transparent proxies.
Personally I approve of this because it will allow for a more efficient operation of many useful web services like content filtering, virus checking and ad stripping.
An important part of this work will also be define a standard way for conforming OPES software to only invoke edge services after authorization from end-users and/or content providers.
The Tasmanian Wilderness Society
has decorated a 80 metres (262 ft) tall Eucalyptus tree as a way of attracting attention to the plight of their tall native forests.
And here's link more likely to survive the slashdot effect.
Is eMFORCE part of the problem or solution?
on
Crazy Stats on Spam
·
· Score: 1
I notice that eMFORCE's main business seems to be in sending "targeted" email
Some quotes from their service plan are below with my comments
Capable of sending differentiated and personalized 3,000,000 emails to each customers based on transactions, preferences, and demographics data within one hour by effective targeting tools and a high-tech assembling solution. Score 1 part of the problem.
Maintaining appropriate email sending speed and considering effective speed with stability. Utilization of perfect gradual email sending, considering spam regulations of the email service provider. Is this in order to be nice or avoid tripping the automated spam detectors? The later I think.
Analysis of the co-relationships between targeting variables and reaction rates.
Based on analysis and customer scoring algorithms, such as Recency, Frequency, and Monetary Value (RFM), segment and score their customer base. I think I'm prepared to give up some privacy for fewer and better targeted ads, but am skeptical of ever seeing less spam. In the ideal world, the only spam I get, will just be news and ads for stuff that interests me but I didn't know existed. Unfortunately, given the cost structures and unenforcibilty of global regulations on the net being the way they are, I don't see less spam becoming a reality.
Resource limits are needed by hosting companies
on
One-Machine Linux Cluster
·
· Score: 3, Insightful
My particular interest was to find virtual hosting solutions that would (1) not allow one runaway virtual server to deny the others of at least a predefined minimum level of CPU, RAM and I/O (disk and network) resources and (2) give any one virtual server extra resources if they were available. From my reading of other slashdotter's posting and the info on the web I've summarized below the various virtual server hosting solutions mentioned. Someone who actually has used these products should actually correct me.
Linux can natively be configured to enforce disk quotas and (with more difficulty) manage network bandwidth without any special virtual server software.
Also the native unix process scheduling algorithm does reduce the priority of CPU bound tasks.
The getrlimit(2) system call can be used to set various limits per process (not per virtual server
unless the virtual server runs as one process I guess.) I know of no way to specifically limit disk bandwidth on Linux.
Freeware such as s_context and user mode linux provide no control over how much resources one virtual server gets over another besides disk usage. Other limited resources like CPU, disk and network bandwidth (RAM?) are shared just like they would be shared by separate processes under a single Linux system.
FreeVSD is not a virtual server, but a collection of scripts, binaries and multiple copies of hard-linked read-only filesystems for the common system environment. It is has the best chance for winning the total performance award but has no extra features for resource limits between systems.
True virtual machines. (E.g., vmware) provide very good isolation, but this leads to little sharing of excess unused resources between virtual servers I believe. They also have poorer performance in general because so much emulation is done.
The commercial, proprietary Private Server product from Ensim seems good from the marketing blurbs which say that they have "their own guaranteed share of the servers resources, including CPU, memory and bandwidth". I wonder what the performance penalty for this is and how much does it cost? Can anyone comment?
Yeah, but afaik, akamai doesn't cache the actual html pages, just flash, images, videos, and so forth. Kinda difficult for those to be useful when no one can get CNN's index.html file, eh?
I don't know about Akamai, but other CDN's such as my employer, Speedera Networks, can cache HTML pages. We can even provide the raw logs back to content provider so you don't lose your statistics. E.g., we do this for the PGA, HP, our own page www.speedera.com and some news portals.
As for CNN on Sept 11th, they never delivered their HTML base page via a CDN which would have made for seemless handling of the traffic. But instead they solved the immediate congestion problem (after 3 hours and 40 minutes) by creating a single stripped down static page that used fewer resources for the site. Here is a timeline of the www.cnn.com home page as seen by our Site Analyser service.
08:50 EDT - Base page errors started occuring, presumably due to lots of requests generating a too high load on CNN's servers. This resulted in end users not being able to see any of the site's content.
12:00 to 13:30 - Base page errors fluctuate with embbeded content errors and a few seconds of DNS response time to 205.188.214.121 which nslookup calls tswebsys2.ptn.aol.com
13:30 - Successfull, sub-second delivery of a stripped down 2915 byte index.html page from www.cnn.com with only single 14144 byte image from akamai.net.
Okay you've convinced me that it is currently as
bad as you report. Especially for the first page
at a site. (Subsequent pages probably result
in significant browser cache hits of site-wide navigation
images).
Theoretically, if you send all the requests at the
beginning of the connection, you can reduce that latency. However you
still have a minimum added latency of 2.25 seconds,
...
There's the big win. The HTTP/1.1 spec (RFC 2616 Section 8.1.2.2) already explicitly
allows for such pipelining.
...
and it's questionable
whether or not you can do that with current browsers (I personally don't
know).
Me neither.
But maybe if two-way satelite delivery becomes
popular enough, more browsers and proxies will make use of this.
However, for things like web surfing
where you're setting up lots of connections
(up to 30 or 40 per
page sometimes), it's unbearable.
It probably isn't as bad as you suggest.
You shouldn't be seeing 30 to 40 connections
to pages with modern (or at least future) browsers and servers.
After getting the HTML,
the popular browsers usually open up 4 to 7
concurrent connections which incur the round
trip times in parallel and then reuse them
for subsequent requests when possible.
Browsers could also send all
the requests in advance (pipelining) and then await all the replies.
But I don't think any popular browsers do any request
pipelining.
They are making most of their source code available.
That's giving something back.
Their Business FAQ says:
Which portions of the QNX platform can I access in source form?
You can download source for most components, including driver toolkits, OS
utilities, TCP/IP stacks, startup code, media players, Internet
applications, games, and so on.
Components that will remain protected include the OS kernel, core OS modules
(e.g. QNX process manager, QNX file system manager), and software licensed
from third parties.
Also they say:
Why doesn't QNX provide source to the kernel and other core OS modules?
Because QNX developers don't need kernel source to extend the OS. With QNX's
advanced architecture, most OS-level services (drivers, file systems, and so
on) exist as user programs that run outside the kernel, just like regular
applications. As a result, developing OS extensions doesn't require kernel
source - or for that matter, kernel debuggers (tricky) and kernel
programmers (expensive). You just use the same tools as for developing user
applications.
I completely agree them them here.
I used QNX for 3 years and never needed
to see how the kernel was implemented.
Sometimes to help writing my own
applications I needed request
the source code in their
system utilities under a Non-Disclosure
Agreement. Now these should be
freely available without an NDA.
While our embedded customers want the flexibility provided by source code,
they also demand a stable, high-performance core of technology that they can
rely on. With our approach, they can enjoy both. Put simply, we can offer
OEMs key benefits of an open source OS, but without the drawbacks.
But I think this is a bogus excuse.
If a customer wants a a stable, high-performance core of technology,
they could choose to use QNX Software Systems
core only, but that doesn't stop
QSS from open sourcing it.
I think the real reason why they don't open source is a comercial one.
Maybe they would be better off open source
the lot and offer their services for hire,
but that's a risky decision which is theirs to make. At this point in time I'd rather
thank them for
code they are making available, instead of
chastizing them for the code they aren't making available.
On the minus side, QNX (at least then) did NOT let you create a bootable floppy,
something that annoys me no end. We had sufficient licences for all nodes (at $hundreds
per node), but ya still needed those double-damned fingerprinted floppies to make it
work.
When I used QNX 5 years ago I didn't needed
to use a floppy to boot them.
In fact, I often never used a hard disk either.
Since the above was written I discovered a common practice of sysadmins and help desks is the suggesting manually deleting all cookies (since you can't do it selectively with MS-IE) to get over site bugs. And now the increasing popular spyware removal tools (E.g., spybot) remove 3rd party cookies used just to count unique visitors in the name of removing sypware and viruses from your computer.
Originally I thought of defining a visitor for HTTP domains as the cookie if it exists, and the client IP address otherwise. But the flaw in this is that it will double count first time HTTP visitors. Once for the log line of their first hit with no cookie. And again for the subsequent hit. With streaming logs, using the GUID (effectively a cookie these days) and the client IP address is more useful as a unique visitor. The log lines in streaming are actually the summary of a sequence or request/reply transacations and so the first "hit" log line does have a GUID/cookie logged.
What follows is addition research I turned up:
says: `` A visitor is defined as "a unique IP address with heuristic." To properly account for visits, the Web site needs to identify a "visitor" so that visitor activity is properly tracked. Registration and/or cookies are the best way to track a visitor's activity through the Web site. Unfortunately, a lot of Web sites do not require registration, nor do they use cookies [and browsers can disable cookies] If cookies are used, it is the clients' responsibility to provide the auditor with details on how the server sets the cookie, the cookie format and how the cookies are used. An alternative that has been suggested is to use the IP address AND user-agent in combination, to identify a unique visitor. The interaction with the site by this "visitor" is then analyzed to determine the number of visits which should be recorded. Using only the IP address to identify a visitor is not acceptable due to the number of visitors that may not be accurately reported because they are operating behind a proxy server or firewall. ''
Actually last month's flop was not a DNS issue, but it did effectively shutdown websites also for one and a half hours.
And if we're pushing dual sourcing don't overlook Speedera. You can even see today's outage on the Speedrank performance page. BTW, C & W USA declared chapter 11 a while ago and got aquired by Savvis. Seems like they change company name every couple of years.
Disclaimer. I work for Speedera, an Akamai competitor.
Nowdays though, my definitive reference for writing portable unix programs is the merged IEEE POSIX and Open groups's Single Unix Specification. Registration is free.
An interesting sensor to add would be a GPS receiver. Some prisoners are released into public with GPS ankle bracelets with this kind of technology. There are plenty of non-nefarious spying uses for this too. A GPS combined with the CPOD could be used by athletes to track their performance.
And I agree there are many reasons why you wouldn't want to permantly wear one. When I was working for a pacemaker company 10 years ago we were developing a new kind of motion sensor to detect the running vs. walking. I strapped on the prototype for a day to collect some data. That night my wife refused to have sex with me, because she didn't want anyone at the office to annotate the trace with "they did it here" and stick it up on the cubicle wall.
"A NASA guy [That was me, but I don't work for NASA directly, but for Speedera who delivers their traffic] says ... Slashdot was a drop in the bucket compared to links from mainstream news web sites".
I said it here. The Slashdot load depends on the size of the objects downloaded of course, but a reasonable generalization is that the traffic from a top 10 portal is about five to ten times higher.
I work at Speedera who is delivering their content and NASA TV. At 6pm EST when slashdot posted this story the traffic increased only about 100Mbps. Articles posted on AOL, MSN and Yahoo home pages increase the traffic much more. The NASA TV live stream when Opportunity landed was 4 Gbps. There are lots of other sources that are bigger than the slashdot effect.
See the press release for more details on the traffic and our SpeedRank index for historical performance and availabilty of NASA's site.
Here's some features that would have excited me, but I didn't find in SFU.
And seems like cheap options have long been available DOS/Windows NFS clients for a long time. In 1994, this summary mentions XFS (shareware NFS client from Germany, not the SGI filesystem) TSoft and Sun's PC-NFS.
Nowdays you also have at least these option, and you are right, many are not cheap.
- HummingBird $300 My past impressions were always of good quality and features.
- Reflection $88 I know this name.
- ProNFS $40 (shareware?)
- DiskAccess $179
- SuperNFS $160 Found with google.
I only heard of the first two. The rest found with Goggle.My pet peeve is not being able to use NetMeeting without a server in the middle when both ends are behind a NAT. This happens all the time from one work place to another work place. Doesn't the same problem affect all p2p applications?
Personally I approve of this because it will allow for a more efficient operation of many useful web services like content filtering, virus checking and ad stripping. An important part of this work will also be define a standard way for conforming OPES software to only invoke edge services after authorization from end-users and/or content providers.
And here's link more likely to survive the slashdot effect.
Score 1 part of the problem.
Is this in order to be nice or avoid tripping the automated spam detectors? The later I think.
I think I'm prepared to give up some privacy for fewer and better targeted ads, but am skeptical of ever seeing less spam. In the ideal world, the only spam I get, will just be news and ads for stuff that interests me but I didn't know existed. Unfortunately, given the cost structures and unenforcibilty of global regulations on the net being the way they are, I don't see less spam becoming a reality.
Linux can natively be configured to enforce disk quotas and (with more difficulty) manage network bandwidth without any special virtual server software. Also the native unix process scheduling algorithm does reduce the priority of CPU bound tasks. The getrlimit(2) system call can be used to set various limits per process (not per virtual server unless the virtual server runs as one process I guess.) I know of no way to specifically limit disk bandwidth on Linux.
Freeware such as s_context and user mode linux provide no control over how much resources one virtual server gets over another besides disk usage. Other limited resources like CPU, disk and network bandwidth (RAM?) are shared just like they would be shared by separate processes under a single Linux system.
FreeVSD is not a virtual server, but a collection of scripts, binaries and multiple copies of hard-linked read-only filesystems for the common system environment. It is has the best chance for winning the total performance award but has no extra features for resource limits between systems.
True virtual machines. (E.g., vmware) provide very good isolation, but this leads to little sharing of excess unused resources between virtual servers I believe. They also have poorer performance in general because so much emulation is done.
The commercial, proprietary Private Server product from Ensim seems good from the marketing blurbs which say that they have "their own guaranteed share of the servers resources, including CPU, memory and bandwidth". I wonder what the performance penalty for this is and how much does it cost? Can anyone comment?
As for CNN on Sept 11th, they never delivered their HTML base page via a CDN which would have made for seemless handling of the traffic. But instead they solved the immediate congestion problem (after 3 hours and 40 minutes) by creating a single stripped down static page that used fewer resources for the site. Here is a timeline of the www.cnn.com home page as seen by our Site Analyser service.
Browsers could also send all the requests in advance (pipelining) and then await all the replies. But I don't think any popular browsers do any request pipelining.
I think the real reason why they don't open source is a comercial one. Maybe they would be better off open source the lot and offer their services for hire, but that's a risky decision which is theirs to make. At this point in time I'd rather thank them for code they are making available, instead of chastizing them for the code they aren't making available.