Ask Slashdot: Remote Server Support and Monitoring Solution?
New submitter Crizzam writes I have about 500 clients which have my servers installed in their data centers as a hosted solution for time & attendance (employee attendance / vacation / etc). I want to actively monitor all the client servers from my desktop, so know when a server failure has occurred. I am thinking I need to trap SNMP data and collect it in a dashboard. I'd also like to have each client connect to my server via HTTP tunnel using something like OpenVPN. In this way I maintain a site-site tunnel open so if I need to access my server remotely, I can. Any suggestions as to the technology stack I should put together to pull off this task? I was looking at Zabbix / Nagios for SNMP monitoring and OpenVPN for the other part. What else should I include? How does one put together a good remote monitoring / access solution that clients can live with and will still allow me to offer great proactive service to my servers located on-site?
Set up a script to initiate a reverse-SSH tunnel from the remote device back to a monitoring server, set up no-login on the tunnel but distribute keys for the monitoring user on the remote devices.
You should be able to passwordless login from the monitoring box over a completely secure link that doesn't require port-forwarding at the remote site.
Will you do my job if I tell you the answer? You've already gotten your start. What more do you need?
Install one central server at a colo facility. Use VPN between each of yours to it. You'll call your server 'Support.'
Now you can use SSH to connect to the remote systems as needed to solve problems. Capture SNMP Traps, Syslog, etc from each via the VPN connection to Support. Filter those and and have notification via email to you.
how much data is created on each system? I not much and you use mysql you can run a mysql instance on different ports on Support as slaves to those masters. In this case you'll be backing up their data live. At night do a dump of each to a file and compress. Now you have a snapshot. Use these to replace hardware failures where they have not done the backup.
Should ask them!
Check out www.newrelic.com - even their free service tier offers great features and it's easy to deploy on all servers
For Server active status (eg: am i dead?)
Inside a while loop or sleep() if you cant be bothered.
for(int i=0;iMAX_SERVERS;i++)
{
IcmpSendEcho(..........);
}
For everything else monitoring related. Employ someone to make a custom monitoring application ,or, Google "server monitoring software".
Have the clients connect with ssh to your server and open a reverse port. They'll each have to pick a different port on your server.
Use something like autossh ( http://www.harding.motd.ca/aut... ) to make sure the ssh connection is always open.
Having said all that, sounds like a great security hole if your server is ever breached. Plus lots of potential privacy violations.
Marqis
500 OpenVPN connections is going to be a bit of a headache to keep straight. Obviously you won't have 500 tun devices so it'll be a multi-client to server config. You'll need a means of knowing that 10.20.20.x is client x and 10.20.20.y is client y. Of course OpenVPN allows you to do this but maintaining that table by hand could be a bit of a pain.
HTTPS solutions like NewRelic aren't an option because you want to be able to ssh back into the host..
Assuming all clients will allow it I can only think to create an out-of-band registration process whereby the clients do something like HTTPS POST to a URL you manage. The POST would contain some degree of identifying information which your system would then use to configure a new OpenVPN client config.
For monitoring, check out PRTG (http://www.paessler.com). I think I'd purchased it by the second or third day of the free trial. Nagios worked but we spent a lot of time fiddling with it.
ditto.
Just download JFFNMS - it's a Net Monitoring system more than capable enough to watch 500+ servers. It can also be configured to do email and text alerting. It monitors CPU, Memory, Disk etc. It's pretty much the open source version of Nagios.
Nagios monitoring servers over OpenVPN? Works like a charm.
BUT before you set this up, be damn sure that you don't punch a hole in your customers' firewalls by having a VPN to your monitoring server. Having 500+ VPN connections from one Linux box to servers located in customers' internal networks might backfire at some point if it's implemented incorrectly.
I did something similar by having a cron script in the server in customer's network that POSTed some statistics over HTTPS to my server. The firewall in the customer's network blocked pretty much everything else. On my end it was relatively easy to get periodically received statistics into Nagios from 100+ servers around.
2X works
You can install remote probes, you can monitor any number of things, not only SNMP (apache's server_status limited to your PRTG server's IP, for example, is great)
www.paessler.com/PRTG
Saltstack is a framework designed to accomplish precisely this kind of thing. They don't quite have the fancy dashboards yet, but it has the remote control and the framework for monitoring all there. It's free, open, cross-platform, and documented. It's lightweight and doesn't require any VPNs, and it scales to any size.
Learn about it here:
http://docs.saltstack.com/en/latest/
Spiceworks is free, easy to setup and maintain. You get some pretty good reporting and monitoring capabilities. It's not as robust as New Relic, we use NR for our "real systems" SW for our internal servers/desktops. NR really shines at monitoring applications. SW is definitely more of an inventory software but you get the basics - heartbeat, RAM, and hard drive stats. I've also used OpManager too - SNMP is good if you have the ability of getting into the network, but if you don't then you probably want something that has an agent on it. SW has a remote agent that collects stats and sends them to a main point for collection through http or https and is pretty much made for Windows, so depending on the server OS at the client site it may be a good fit. SSH is good for the connections, just use fingerprints and keys for authentication and make a plan to swap out the keys 3/4 times a year.
Hope that helps.
I do something similiar. I use openvpn and x11vnc. I have a cron on each client that runs a
small perl script that grabs the output of several programs like top, uptime, and sensors
and then saves the results in an easy to parse file that my server periodically grabs so that
I have stuff like cpu temperature, cpu usage, memory usage, etc...
I also grab a screenshot of x11vnc using vnccapture.
I also have a way to remotely activate reverse ssh if for some reason openvpn fails.
My only problem with openvpn is key management. Creating and distributing unique keys
to each client is kindof a pain.
I'm sure that there's a Microsoft Solution for you.
Now, let's discuss the licensing terms . . .
We are not going to allow a permanent remote access into our network.
But there is really only one.
Www.n-able.com.
It is not free but is designed to manage thousands of nodes. If you are looking for free then you really need some more expireience and to change your mindset.
I tried multiple options out but Hobbit (now called Xymon) fit the bill. It's simplistic enough but also has the features I wanted. I found some of the other systems felt like I was configuring the system more than just monitoring the servers.
Make damn sure your clients are aware of exactly what you're doing. They probably don't care about the specifics (e.g. openvpn, reverse ssh); but they need to know you can remotely access the boxes.
It's probably a good idea to have some sort of document to give them that does spell out all the specifics - something they need to acknowledge/sign, with both of you keeping copies.
#DeleteChrome
is the solution I use and is working well. Routers are 1U mini atx boards with pfSense. Nagios mostly with NRPE, SNMP for devices, on which I can't install packets. Works well for last ... 8 years or so.
not really. snmp is an afterthought for them and its clumsy as hell to add snmp to it. I tried and gave up. instead, I picked hobbit (uhm, the new name is 'xymon').
xymon has its quirks but it was not hard to modify to add more snmp features to and its coding was not too bad to get thru. its not written in a lot of 'strange' languages, and that's a plus, to me, too.
personally, I usually just write snmp code fresh, from scratch, using net-snmp mgr tools. its not hard and you get just what you want and you are not muddled down in lots of 'infrastructure' that someone else thought was good but useless to you (like zabbix).
--
"It is now safe to switch off your computer."
Excellent monitoring solution can generate KPI based reports, email/sms/snmp notifiactions etc, comes with a bunch of out of box server monitoring modules and you can build your own with scripts or SNMP GETs. I swear by it.
1) Setup your own VPN server elsewhere, reachable on the net. Make sure OpenVPN (or the like) support client isolation/incapsulation, and be ready to enforce it with your firewall (at the very least) or some authentication besides the VPN certificates. .php script or the like to read the vpn active user database, let it display nicely and place it on your vpn server's webserver (with authentication of course)
2) Deploy to each client's machine its own VPN access, and let them connect at boot and reconnect if the link goes down.
3) Setup a
If a host is up you know it, and you can use the very same vpn to reach each single client. Anyway, each client can't and won't reach anything besides your vpn server. The same VPN should help with nagios, collectd or anything in between.
An agent on each box creates outbound connections to your central server...
I know of a bunch of little IT shops that use it, so its not overly expensive... $2.50 or less per agent.
You might take a look at the RHQ project. It can likely do what you need. You install a server part, plus an agent on each client machine. The agent uses various plugins to monitor various aspects of the server ranging from OS parameters (disk space, I/O, CPU usage, memory usage), to specific pieces of software (JBoss, Tomcat, Apache, MySQL, Oracle, etc). You can define alerts to monitor using any of the metrics gathered by the client agents.
Oh, and it's open-source, though RedHat would be more than happy to sell you a license with support. https://docs.jboss.org/author/display/RHQ/Getting+Started
We have used Kaseya to monitor our servers. It seems like it may be worth looking into for your situation.
a very basic php script that returns a status code on all the remote servers, a db on my server with list of the various urls, and a jquery page that changes the codes into pretty colored lights or something
I was researching this exact situation. Two great remote server support solutions are SecureLink and Bomgar. Both allow access to unattended machines, both have strong auditing for customers that want to track that, and both have allow command line access. SecureLink is cheaper, easier, more barebones, Bomgar is way expensive and has a more involved setup, but has a lot more features.
The idea is you have a public location and an out going protocol t which removes the VPN issue.
I came across a service called logic monitor last year though didn't start to evaluate it until about a month ago. I've been doing custom monitoring (mainly graph/trending stuff) for 15 years and nothing I've ever seen comes close to what this service offers and the price is really cheap too, with costs starting at $20/server/month(costs go down with volume of course).
If you have a vcenter server for example you can monitor all of the metrics for all of your VMs on that vcenter server and it only counts as 1 server. For my org anyway it paid for itself in the first few days of using it since we can consolidate load balancer, firewall, switch, vmware, mysql, and other metrics into very easy to use dynamic dashboards. They do alerting and reporting too but have not had time to mess with them yet.
The service works on agents, you deploy one or more agents per network segment and they communicate back to their SaaS platform. Then you configure things and view the graphs/dashboards/etc on their platform with your desktop or mobile browser.
They have another feature that I have not tried that allows you to SSH and remote desktop in through their platform (to their agents which then proxy the connection to the destination system). I suspect ssh wouldn't work for me since all of our servers use ssh keys and I wager the java ssh client they are using doesn't support them (but I haven't tried, don't really need that feature). You can disable this functionality if you wish as well.
I use it to monitor Sonicwall firewalls, Citrix Netscalers, VMware vCenter (w/600 VMs) & ESX/ESXi hosts, mysql, network switches(SNMP & sFlow), power strips, fibre channel switches, memcache servers, varnish servers, rabbitmq servers, and will be adding custom HP 3PAR monitoring as well soon too. See more info here - http://www.logicmonitor.com/monitoring/ . I got it mainly for the infrastructure end of things and less for the linux-end of things, though it does that well. Literally probably 20,000 data points a minute being collected (probably 15k of those are coming from vCenter).
It is secure, unlike a massive set of openvpn connections, because it is push, it doesn't need any VPN, it will keep your clients isolated from each other, it will allow you to consolidate monitoring across clients (say create a graph of top 10 CPU usage of your servers cross client for example).
It is way overkill if all your looking for is simple availability checks, but if you need more sophisticated monitoring again I haven't found anything that comes close, I've spent literally thousands of hours working on monitoring stuff over the years and this product/platform makes it so easy I want to cry. The only knock I have against it is it is SaaS, I would prefer to host it myself, but since I have to make a choice use SaaS or use some other product that can't do this then I have to use SaaS.
I am not affiliated with this company at all -- as another poster mentioned Newrelic is good too(I am a customer of theirs too) though IMO New relic is more of a developers platform their main value add is real time code instrumentation. They try to do the ops thing as well they just aren't as an attractive platform for me anyway.
They have a free 2 week trial available as well.
I've posted only maybe 5 times on slashdot in the past 15 years so I don't have an account, if you wish to get more info from me personally on this you can reach me temporarily anyways at slashdot .@t. linuxpowered .dot. net (I will kill that email address in coming days to avoid spam)
nate
Zenoss is awesome and as your business scales so can it. Our organization monitors 5000+ servers worldwide in all sorts of places. Zenoss lets you do everything you'd want. Setup notifications for one or more servers, types of errors, and filters within filters. It's a rocking platform and if you're big enough, they'll set it all up for you for a fee.
Would need more information on the locations. Running Linux, Windows, Solaris? I presonally use Zenoss for all of my monitoring. It is handling around 1800 devices right now and monitors all aspects of the network and servers. Zabbix uses agents. So you could run the server at your location and of course the agents connect to it for monitoring. People talk about needing a VPN connection to be safe. But another solution that I would do is use stunnel for encrypting. I do run a large openvpn setup as well. With this large of a VPN setup I would look at possibly using Quagga and doing RIP. It will be easier to manage all the routes and netblocks.
When all else fails, hire me!
NAV is a great network and server monitoring suite...I have it monitoring much stuff connected over VPN.
I once did this and it worked like a charm. I had a central server via which I established a connection to the remote sites. You don't need to write more than a few lines and add a cronjob to make it functional.
The ELK Stack (ElasticSearch, Logstash, Kibana) are great tools for capturing logs from *anything*, indexing and massaging of the data captured, and then offering up visualization, searches, and dashboards (that refresh). Built with Angular.js so the speed happens.
We could be talkin' web server logs of the NY Times servers, centralized and displaying dashboards in real-time, or maybe 24/7 sensor data streaming from the ocean floor. The ELK Stack can do it.
First googled citation, and there's plenty more where this came from: http://thepracticalsysadmin.co...
You can't be ahead of the curve, if you're stuck in a loop.
Or, do the right thing and hire a network admin so someone with a clue is involved.
If you have to ask this question on slashdot, you need to change the question to something appropriate. Based on exactly what was posted, he doesn't have any idea what his requirements are. He knows the conceptual goals, but not the actual goals or requirements. Unless he is trying to change careers from whatever he is to a full time network infrastructure person he is going to be wasting a lot of time getting a clue. That means time he won't be spending doing whatever his actual job is.
He needs someone who can look at his actual setup, figure what what actually needs monitored, and knows the appropriate ways to do it.
Short of multiple Bennett hasleton length posts, and many discussions in depth, no answer coming from slashdot or all of them combined is going to be useful.
Everyone here posting solutions has their own, certainly incorrect idea of what he wants but no one actually knows. No one so far has even started by asking the right questions. It's the blind leading the blind at best.
Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
The reverse-SSH tunnel is the correct way to "phone home". Maintaining a VPN is a shit show.
Pure Storage does it this way, and they are quite the experts.
How does he not have it in the first place? Explorer with 500 client servers.. Soon as I had 5 servers I setup central mobile monitoring lol He needs to hire someone that knows what they are doing for sure. Google it and the top open source monitor comes up as a start...
Problem solved. Next topic please.
http://www.gfimax.com/
Life is not for the lazy.
Just because you're unfamiliar with networking administration doesn't mean this needs to blown up into "hire a network guy". That's just ignorance and (I suspect) trying to make yourself sound important on an anonymous message board.
The solution is not complicated and has been pointed numerous times in this thread: ping NewRelic, set up the system and you're done.
As my granddaddy used to say, if you don't know what you're talking about, it's best to not open your mouth and prove it. So no need to apologize, just take the advice and consider it a lesson learned. Best of luck.
You must check out n-central from n-able. It is really great and gives you all kind of monitoring features.
Admin console is web based and agents push data over ssh back to your central server.
You really must check it out.
But shouldn't this have been part of the design BEFORE you rolled out 500 servers?
The only thing worse than a Democrat is a Republican.
Maybe he should hire someone considering he has 500 servers and he is just now thinking of implementing a monitoring solution. And this board is only anonymous if you post as AC ;) also, fuck your granddaddy
I would write a wrapper though to make the whole thing bit more robust. Groundwork does this with their GDMA agent and it allows you centrally configure and have the client pick up its configuration.
-----
That paid product looks like shit
Check_mk works like a charm. We have over 2000+ servers and 100,000 items monitored all done by phone home autossh. And yes this is my day job.
Let me know if you want to do something like this and we can work something out. Reply to this and we can connect.
A place I worked for did exactly that. There are a few details that you should attend to - give out ip addresses based on the ssl certificate used by the openvpn client (and make sure you don't deploy the same ssl cert to two servers!), and have a method of restarting openvpn every time it crashes/disconnects (and exits). You'd be surprised how flaky enterprise internet connections can be. From there my work kept a database of all the openvpn servers and used it to generate a nagios config. Honestly, I've never loved nagios since it frequently doesn't QUITE do what I want, but it's good enough. If your clients are all internet accessable, I've been using a slightly expensive commercial service call Monitis which I really like. Contrary to what a number of people here have said, I don't think you need a network admin at all, if you can get the vpn stuff working with a simple acl (to keep clients' interns from bothering each other) then you should be set.
Have you considered SolarWinds Server & Application Monitor? The latest version, currently in beta adds an optional agent that negates the need for VPN tunnels. It supports overlapping IP address space, NAT traversal, passing through authenticated proxy servers, and communications are fully encrypted. These agents report back to a single, centralized server at your location, or in the cloud, such as Amazon EC2, Azure, RackSpace, etc.. More information can be found at the following links. https://thwack.solarwinds.com/... https://thwack.solarwinds.com/... https://thwack.solarwinds.com/... If that doesn't fit the bill, you should consider taking a look at N-able, which is a purpose built solution designed specifically with MSP's in mind. More information on N-able can be found at the following link. http://www.n-able.com/
I want to drastically increase my clients exposure to attack by opening remote holes in their network firewall through my equipment. How can I best go about doing so?
The reverse-SSH tunnel is the correct way to "phone home". Maintaining a VPN is a shit show.
A blanket statement like this shows your cluelessness and shear ignorance.
Without considerably more information neither you nor I nor anyone else can make such a statement.
Pure Storage does it this way, and they are quite the experts.
Oh well, since a company thats barely 5 years old does it this way, and since their primary business line is selling flash drive arrays ... not network administration and monitoring ... they must be the most qualified and perfect example to follow.
IS IT the right way for THEM? Maybe. Maybe not. To pretend that just because they do it that way, they are experts again just shows your ignorance. Let me guess, you work for them on their monitoring team, don't you?
Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
... is your friend. A simple shell script run from cron every so many minutes to test to each server, and then text / email / raise an alarm if no answer. I'd do this from at least 2 locations to allow for transient network issues or the monitoring systems have hardware issue and tank. And don't use windows for critical stuff. A couple of low end linux systems on amazon or similar would work. Low cost, efficient and very manageable.
Just because you're unfamiliar with networking administration doesn't mean this needs to blown up into "hire a network guy". That's just ignorance and
As someone who's been a network admin for a few years, I'm fairly confident in my statements. Do you do even minor surgery on yourself if you're not a surgeon? If you come to slashdot to ask how to do something for your business, you already fucked up and the only valid responses you should be getting from slashdot are help on finding someone who can help you. If he asked 'how do I find someone, like a consultant for a short term project, like this' that would be one thing. He didn't, he came here expecting a solution which illustrates his complete lack of understanding of the problem, THAT IS WHY he needs to hire a network guy.
He is, by definition, ignorant, which is why he is asking for help ... clearly you are as well as your choice of words indicates. I suggest you learn what the word ignorant means before you brandish it about like an insult as you just end up insulting yourself through your own ignorance.
(I suspect) trying to make yourself sound important on an anonymous message board.
I have no need to make myself sound important, I certainly don't need your approval ... and if you bother to google for my nick, you'll find its not even a little difficult to link to a real name, address, and everything else. I'm not in the least bit anonymous. People have been able to recognize that nick and its association with me for 20+ years. On the other hand ... your post ... is from ... anonymous coward. Do you know the meaning of the word ironic?
As my granddaddy used to say, if you don't know what you're talking about, it's best to not open your mouth and prove it. So no need to apologize, just take the advice and consider it a lesson learned. Best of luck.
Your grand daddy said that too you a lot, didn't he? Did you ever wonder WHY he said it too you so much? Maybe he was trying to get some sort of point across to you ... Go look in the mirror and repeat those words until you get the point of them and who he was talking about. Hint: Its the guy in the mirror.
You're an absolutely shitty troll. You just suck at it. Nothing you've said did anything other than show how stupid YOU are, not me.
Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
Pure is the premiere brand in SAN storage at the moment. Their engineers are simply doing it better than the competition from EMC etc. They understand Linux and they understand network storage.
They also happen to understand the cloud and the Internet of Things. And apparently they understand how to use the networking Swiss Army knife of our time, SSH.
You can take your non-technology-related quibbles to your management.
Nagios + NRPE (+ iptables).
As someone who's been a network admin for a few years, I'm fairly confident in my statements. Do you do even minor surgery on yourself if you're not a surgeon?
I am a network (and Linux) admin by profession, but I can also repair my audio equipment and do some repairs on my car, even though I do not work as a car mechanic or electronics repair guy. While I could find a mechanic to repair my car (and sometimes I do), a lot of the time is is cheaper and faster to do it myself.
So, if the OP wants to create a monitoring solution himself (assuming he knows something about the monitoring systems) more power to him. I probably would ask a similar question if I had to monitor 500 remote servers that are in different locations (if they are all in the same place I would just use VPN). It would be possible to use VPN or SSH tunnels or something else, but sometimes one may need an advice from others as to which option is the best.
Smug prick.
Frankly, I never understood why ANYONE would ask Slashdot contributors a question. You get nothing but grief from arrogant puds who never give a helpful answer.
LabTech Software is lightweight, paid software or monitoring servers. Minimal RAM usage and good at sleeping until needed. The Managed Service Provider industry is pretty big, now, so you've got choices for professional tools specifically designed for this kind of job.
Disclaimer: I work for LabTech Software.
I agree. A meeting needs to be held with the technical team to determine what exactly needs to be monitored.
With that being said, ask yourself a few questions:
Are you looking for a heartbeat?
Are you actually more concerned for the applications running on the servers?
Are you looking to monitor individual pieces of hardware, e.g. CPU, RAM, etc.
Are you trying to determine if a there is a network hardware failure as well, e.g. router, switch, etc. (did a switchport die and did I lose a particular subnet or cluster?)
All or none of these things can be important, but BitZtream is correct. Without a lot more knowledge of what is needed there is no way of giving OP a method of accurately monitoring the required infrastructure.
Take a look at The Assimilation Project : What we do: Continually discover and monitor systems, services, switches and dependencies with very low human and network overhead.
The Russians have won. They have made the world a cesspool of distrust, greed, fear and hate.
You sound like a Windows admin for a gov't entity.
You spend a lot of energy telling people they do it wrong without having any real insight or advice on how to do it correctly.
A blanket statement like this shows your cluelessness and shear ignorance.
What does his knowledge of a specific cutting tool have to do with anything?
Someone flopped a steamer in the gene pool.
xymon (aka hobbit) grew out of bigbrother, one of the early monitoring systems. This has extemsive extensions, is highly customisable and doesn't require a client. The only weakness is the windows client, bbwin, but it does work.
Anon - Why base your opinion on an experience back in 2008? This is six years later and the product has matured since then. The Zenoss Core (http://www.zenoss.org) open source project is bigger than it's ever been, it is very reliable, and is used by many large corporations today.
OP - For what it's worth, any open source monitoring software should play just fine with OpenVPN. However, the monitoring feature set should be simplified into a single interface, you don't want to have to be fixing scripts and maintaining the software all the time.
I actually used to deploy OpenVPN + Zenoss for remote site monitoring. In my case I needed to monitor multiple systems at the customer premises (using Zenoss Enterprise/Service Dynamics for the remote collector integration), but you should have it a bit easier since you only have one server to monitor. I found configuring OpenVPN to be a bit of a challenge, but once that part was done the rest was a piece of cake. It will be a lot of work with the sheer volume of 500 clients (with that amount of traffic you might even need to break it into two OpenVPN endpoints) but I'm sure you are already aware of that.
I would say definitely take a closer look at Zenoss Core. A side note, Zenoss Service Dynamics is their enterprise product with advance features, but for you the "technology stack" needs only to consist of Zenoss Core (free) + OpenVPN. Set up OpenVPN as you described so that the clients deployed on your remote servers can connect back through https - as long as they have an internet connection no holes need to be poked through your customer's firewalls. Drop Zenoss on the OpenVPN endpoint box(s). Then use the OpenVPN IPs to monitor the servers. For each individual server, configure the SNMP string if Linux, or set up WMI if windows (no need to configure traps, Zenoss polls the boxes at specific intervals). Use the wizard on the Zenoss web interface to add the host and model it. Away you go, you can now see the events in the Zenoss console for everything from ping status to CPU utilization. Events go to the console which you can monitor, or you can easily set up e-mail alerts to trigger. For example, say one of the disks throws a SMART error; trigger an e-mail you so you can ship the customer a new disk to install just like NetApp does.
As I mentioned, you can definitely use Zabbix or some other variant to do the monitoring part. I researched and played with many monitoring solutions (commercial and free) before I settled on Zenoss. What made the difference for me was that I found I was spending way too much time learning the quirks of the software (e.g. Nagios - config file to add a client, really! SolarWinds - Agent installation required, really!) and not enough time actually deploying monitoring to the targets. Good luck, hopefully this info helps you find the right fit for your environment!
Ping is almost the worst way to check to see if your server is up. In fact, certain machines will return an ICMP response even after you've broken into their bios-equivalent (hello, Solaris).
Do a service level check.It's not that hard to do a curl instead of a ping. A curl's results can show you if it's present and functioning. A ping just shows you that the network interface is responding or not.
People disable ping because if you don't know a server is there you can't attack it. It's like enabling MAC address filtering - it doesn't really help that much, but it in a specific set of circumstances help a bit.
The media says Target was breached due to a compromise at their HVAC vendor. Do you want to be the vendor that gets hit with a liability suit because someone broke in through your network?
It's obvious from your question that you're not really sure what you're doing. SNMP? That's for network crap, not for server and application level stuff. Why would you even talk about SNMP? Why would you even want a VPN into the customer network?
If you need access to your server, write it into your support contract, and ask the vendor for a VPN login. Then the vendor can turn that login on and off when an outage occurs. Then just use NewRelic for monitoring (assuming your machine can get out).
If you need continuous access to your server, write it into your support contract, then make sure that (1) you really need it, and (2)your security is better than your customers' security.
Or, if you want to screw everyone, just run a TeamViewer instance on it and connect to it on the sly. I'm sure your customers would love that, but that's what you're basically asking them to allow you to do.
I manage a hub server and a backup server. Every 60 seconds the backup server crontab (wget) fetches a 'web page' from the hub server which as a side effect records the callers IP address into a file. Even though the backup srever has a dynamic IP address I can always find it by going to the hub server and looking into that file.
I have a page I can go to on the hub server which checks the timestamp on the file BackupServer.ip. if it is suspiciously old then that web page turns red and tells me that things are cut off. If all is OK the background stays green. You can see it at http://gregor/ServerCheck.php. I check it every time I start my browser.
It would be trivial to support more than one call-in server. It would be easy to add more complex status information. From your notebook computer anywhere in the world you can go to that web page and see that all is OK, or, if it is not, what remove server has a problem.
Our company uses 'Whats Up' by Ipswitch. Currently monitoring over 2500 devices such as servers, routers, temperature sensors. You can ping devices, monitor for SNMP events, logged events in Windows, AIX, Linux, WMI monitoring, services, tasks.... You can script custom monitors either via VBscript, Powershell, or JavaScript. You can script custom actions for Whats Up to take upon detecting a condition. Can restart services on either *nix or Windows boxes if they go down. Can launch applications if needed if a condition is detected. Can create audio, visual, and email alerts, as well as SMS. They license on a per-device basis as opposed to a per-port basis like SolarWinds. Only thing I don't care for on this software is you can only run Microsoft SQL for a database. Can't use any open source solutions. The default install uses MS-SQL desktop version, but the db size is limited. If you need to go bigger, you have to install a full install of SQL on the server, or connect to a remote SQL server on your network to host your database (as we are). My .02 cents...
You're messin' with my Zen Thing, man.....
Your spelling of sheer incorrectly shows your sheer ignorance.
Depending of the composition of your client base, this could easily run afoul of their security policies and trigger all kinds of intrusion detection alerts. Make sure that your clients have a clear understanding of what you are implementing and that they are ok with it. Preferably get it in writing.
I've used cacti to monitor servers before, works quite well.
Supports many SNMP functions, easy to setup.
Everyone who is posting suggestions is actually being useful. These are called 'Ideas' and the original poster can read these 'ideas' and see if something suitable comes to mind. It may also provide him with the valuable insight after researching these solutions that maybe he in fact does need a network admin.
It's also possible one of the solutions offered will actually suit his needs. He may not require slashdots "help" in any form, he may simply just reaching out to see if there is something he hasn't thought of yet. Communal advice and information. Welcome to the digital age.
For all you know he could be an exceptionally skilled network admin and more advanced than the people around him so a conversation with them may have a lower chance of yielding results, so he thought he'd throw it out there and see if anyone with knowledge would come forward and offer some ideas.
Being thorough doesn't mean you're "screwed" because you're at the point where you "need" help from people who browse slashdot.
You are asking for two things. Both these tasks are easy to implement, but a more important issue is security. You are interfacing with a system with very sensitive information. Employee SSN, bank info, full names, etc. You may also have access to sensitive company information. Employee names, Employee pay, etc. This is the information you need to protect or you will lose a customer.
Setting up a notification system should be easy. Sending SNMP is one way of doing this, but you could also use HTTP to POST information to your website. You would not need to modify firewall rules, nor setup a VPN tunnel. By documenting the information and limiting it to generic data (what happened, and who did it so you know what happened) you can make your customer feel comfortable that you are taking security seriously. I must mention the "who did it" should be a unique ID to identify who performed the task that caused the issue and not the user's SSN nor name; keep it cryptic so that if your system is compromised you can ensure your customers that their information is safe.
The second part, remote access, raises the bar significantly. As you will have access to all the customer's information and network, you might want to consult a network and security expert on best practices. To give your customer the ultimate level you might want to have some hardware that they can plug in to give you access and unplug to stop everyone from access. This could be as simple a VPN router or a POTS modem.
This is an amazing product. I've used this in the past and LOVE it! Need to run a remote powershell command from your android? It does that. Dashboards for all the things? Has that covered.
Check it out:
http://www.pulseway.com/
If you're using linux or BSD, another option to reverse ssh tunnels or openvpn would be EPS Conduits: http://eps-conduits.sourceforg...
It was written with the goal of having a large number of remote devices form a virtual network for ease of management/maintenance.
We have VPNs to each data centre and client site and administer them over SSH generally. Some systems (eg ones dealing with customer details like credit cards) we have a single external facing host with Yubikey authentication to reach that network, and we use SSH port tunnelling to reach other hosts.
Can you rebuild the transmission and improve shift firmness while doing it? Can you replace a damaged quarter panel and color-match the paint? That's the analogy for developing a full remote monitoring solution. He's probably already doing basic "replace the plugs" type work on the network, but there's a big jump to what he's asking, and he doesn't seem to know enough to even ask the right questions.
No, an "exceptionally advanced admin" would not ask the question this way. We do know that's not the case here. :)
So about 7 years ago I tested out Nagios, What's Up Gold, Cacti, Zabbix, SolarWinds Orion, and a variety of other software monitoring solutions and the problem that we had for almost all of them is that they required heavy customization or that they were incredibly expensive when they included more initial customization regarding device discovery, included templates, etc. (a la SolarWinds). We finally settled on PRTG (www.paessler.com) because it had some of the industry standard devices templated already in a basic fashion, has an easy to use interface, and has the ability to be heavily customized.
Another feature that we were really needing was remote monitoring for our customers as we are an MSP. All Remote Probe agents with PRTG will create an encrypted SSL tunnel between Remote Probe and your core server installation at your office or colocation. This requires no customization at all excepting if you are denying certain ports outbound from the probe server in which case you simply need to allow port 23560 (or whatever you've customized it to) outbound to your core server's public NAT IP). This does not give you remote control of servers necessarily but it does provide a channel for all locally monitored data to be sent upstream to your location without requiring an OpenVPN or anything like that (except if you wanted remote access you could have PRTG's remote probe piggyback across there as well and you would then also have the ability to remote control). You can deploy as many remote probes as you would like and can therefore centralize all your monitoring data as well as create reports, custom maps, and even provide customer access via nested Access Rights dependencies.
One thing I will mention - SNMP trap monitoring is a wasted effort. I know there are many proponents of it out there but if you are not actively polling your data and gathering graphable results then you have no troubleshooting abilities, no trending reports, no data utilization analysis for service management, etc. You should configure templates for your devices to standardize them and monitor all of your critical data actively so can then use the historical information to say "Ok...this server just went down - why? Check CPU utilization - OH it looks like all cores on this CPU jumped to 100% CPU utilization just before this device went unresponsive. Let me check my individual process utilization - OH there's the process causing the problem." Troubleshooting done. Imagine receiving a trap for this device - if the device is already unresponsive by the time the trap is sent, the trap never reaches your monitoring server and everything is still hunky-dory. You may also have ICMP monitoring in place so you know the device is offline but is the ISP down? Is some LAN resource like a Router/Firewall/Switch down? Is the server down? Why? Most of these questions can be answered by historical monitoring data and I cannot say enough that SNMP traps are useless 95% of the time.
For validation of my claims & experience with SNMP, I am a Principal Network Engineer for an MSP in LA for over 9 years and we currently operate a PRTG install for our MSP customer monitoring with over 18,000 sensors monitored actively, polled every 30 seconds.
Really, it's time for your medication.
Check out http://www.continuum.net/. I've been using their services for over 5 years, and they've been steadily improving it since they split from Zenith Infotech. No, it's not free, but it's quite cheap per unit and you get a lot of bang for your buck. Remote monitoring and alerts on any service, remote access, at-a-glance dashboard, etc. With 500 clients, I'm guessing you'd rather spend your time monitoring the situation than putting together a custom solution.
Similar situation, servers (appliances really) all over the world in customer's networks. We use chef to manage the systems (or puppet, pick your poison). Each system connects back to our management network using OpenVPN, certs managed by chef. Collectd runs on all servers, with some custom plugins for own stuff, plus statsd for instrumenting our own code. All collectd metrics and logs (via syslog protocol) are sent back to our management network, stored in graphite and elasticsearch respectively. Nagios is configued using the chef nagios cookbook plus our own custom layer, which dynamically adds servers and metrics as they appear in chef, and removes them if they are deleted. A good chunk of the checks are requesting metrics from graphite instead of going out to the servers directly, some are passive checks with logstash triggering alerts based on log patterns. We're still in beta-ish stage, and I currently have 3K check in Nagios, which would be impossible to manage by hand. This entire setup would be impossible without completely trusting chef.
It took a lot of automation work to get to this point, but I'm confident I can easily scale the number of systems out in the wild and everything will continue to work with the exception of needing larger servers for nagios/graphite/logstash.
We've worked with plenty of Open Source monitoring solutions and in your situation www.gwos.com might be useful. Essentially it includes Nagios, Cacti, RRDTool etc. and you can create custom monitoring for specific services or pick out the services that are already configured (the list is quite extensive). Hope that helps?
I used to support Foglight which is an enterprise monitoring tool. At 500 system you are just big enough to be able to maybe justify using it. It's a bit expensive, but its pretty nice when everything is all setup.