Slashdot Mirror


Ask Slashdot: Getting a Grip On an Inherited IT Mess?

First time accepted submitter bushx writes "A little over a month ago, I assumed the position of programmer and sole IT personnel at a thriving e-commerce company. All the documentation I have is of my own creation, as I've spent most of my time reverse-engineering the systems in place just so I can understand how everything works together. Since I've started, I've done everything from network and phone upgrades to database maintenance with Perl, and thus far it's been immensely rewarding. But as I dig deeper, I notice the alarming number of band-aids applied by my predecessor, and it seems like the entire company's infrastructure is just a few problems away from a total meltdown. The big question now is, how do I, as a single person, effectively audit the network, servers, databases, backups, and formulate a long-term plan that can be implemented by one person? Is it possible? Where do I begin?"

14 of 424 comments (clear)

  1. Configuration management by Neil+Watson · · Score: 4, Informative

    Automate your servers so you can focus your time elsewhere. I use Cfengine.
    http://watson-wilson.ca/2011/03/enterprise-system-administration-using-configuration-management.html

    1. Re:Configuration management by 1s44c · · Score: 4, Informative

      Yes, automate everything, monitor everything, backup everything, document everything.

      I used to use cfengine but find puppet an easier tool to work with. Nagios and BackupPC are also wonderful tools but you might want to choose alternatives if they better fit your needs.

      You might want to express some concerns to management just in case something critical does fall over you don't look quite so bad.

    2. Re:Configuration management by vajrabum · · Score: 4, Informative

      Lots of folks here have talked about backups but if you're company is really successful then restores could be more of a problem than backups. Large databases and system configuration can take a loooong time. Develop a plan for restore and execute it regularly as a test. Make sure management understands the time for restoration. Two other things--virtualize (that reduces the coefficient of friction for moving things considerably) and consider using Amazon or some other cloud provider in your restore plan to in case your cage/server room/whatever burns. Some of those services are low or no cost until you start loading things up. If you go the cloud route be sure to get a read on your traffic, storage and other billable numbers. If that's the disaster plan then if the numbers are of any size at all you need to run the cost by the CFO to make sure that it's sustainable.

    3. Re:Configuration management by bill_mcgonigle · · Score: 3, Informative

      They all have root access still :-( A political fight I'm not yet prepared to have. I was able to take it away on the web servers, at least, and that's the only thing our developers touch, so life is a bit better.

      A fine baby step is to move everybody over to sudo. If you can get buy-in that everybody will track changes with git, then you have somebody to blame and can build a case if they break it. With sudo you have a record of who was mucking (in your /var/log/secure).

      If they're perfectly reasonable/responsible and you can track changes, it's not such a problem, really, unless you're worried that they're secret agents meaning to break your stuff. I typically only see frustrating carelessness where people can get away with it.

      --
      My God, it's Full of Source!
      OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
  2. Re:I wouldn't worry... by mikael_j · · Score: 3, Informative

    Sadly there's a lot of truth to this. In my experience the difference between most "good" and "bad" networks is whether the WTFs are vendor-blessed hacks or in-house hacks.

    Of course, there are always those places where this is not the case but I've seen enough IT environments to believe that for a majority of companies this is sadly the state of things. If maintenance in the average factory was handled the same way IT is handled at the average company most machines would consist of approximately 30-50% duct tape, newspaper, string and glue...

    --
    Greylisting is to SMTP as NAT is to IPv4
  3. What, where, why... by ScottyLad · · Score: 5, Informative

    I've spent the best part of my career undertaking tasks like this (as an external consultant), with my average time on an assignment lasting somewhere between 18 months and 3 years.

    My aim on every project is to make myself obsolete - in that I try to get documentation up to a point where a suitably qualified individual could come in, read the documentation, and work the rest out for themselves.

    My primary objectives are to implement some form of inventory control to document the what / where / why...

    • What - What have you got (servers, software, services, contracts, operating systems, databases, users)
    • Where - Where is it - where are your servers, what machine is this software licence running on?
    • Why - What is the Business Justification for this service - what is the Business Impact if this database stopped running tomorrow?

    Once you've got to that stage, then you're ready to get in to the real technical details. Remember that you are pitching your documentation to your successor, or to some imaginary "suitably qualified individual", so documenting what a system does and why is a higher priority than commenting every line of code.

    It is possible to do with one person, depending on the size of the organisation, it can be particularly rewarding to do on your own - in a small business you often find some of the users have a good understanding of some of the systems, or are keen to learn.

    You stated in your post that you've assumed the role of programmer and sole IT personnel - which means you need to learn to think like a manager as well as a techie (which is harder than most people imagine!). Once you learn to focus on the business priorities, you'll understand where to begin with the technical detail, and what level of documentation is required.

    --
    Philosopher (n) - a wise person who is calm and rational; someone who lives a life of reason with equanimity
  4. Re:Getting a Grip by Atanamis · · Score: 5, Informative

    agreed. As soon as I saw this was an IT department of one, I could tell the exact amount of care that management has on getting things like this corrected. These things are in place because management does not want to provide what is needed. If they only want to pay for band-aids, that is all they will have.

    This isn't necessarily the case though. I have a friend who took over IT at a small business. When he walked in they were using pirated software and their IT was a complete mess. After he put in hours to get it fixed up (with personal support from the owner), they ended up offering him an executive position with a massive pay increase. Some small shops with one IT guy really just don't know what they are doing, and haven't had a person in the job to tell them what is being done wrong. Your advice is still good though. A person in that situation needs to test whether they have management support to do things better. If so, it can turn into a career making opportunity to turn things around. If you can't get the management on your side though, it very well could be time to start looking for another job with more supportive management.

    --
    Atanamis
  5. Re:Quit by v1 · · Score: 5, Informative

    Yep. Hop into the waders and get to work. It can be a very rewarding experience turning a steaming pile back into a smooth running good looking machine.

    To add to the above, document everything. Though it sounds like you're already doing that. Make sure it's documentation that works for anyone not just you. Don't take anything for granted. Automate whatever you can, including problem detection and notification. (save yourself from having to check things daily or weekly, have it shoot you an email or something if a common issue crops up again)

    Make sure your employer fully understands the situation you and they are in, so they don't expect you to be doing improvements and striking things off their sore to-do-list that they were probably hoping you'd tackle the day you started. Get them a timeline as soon as you get something of a grip on the situation, tell them where you're going to be spending your time to start with, and the reasons why it's essential and going to delay their getting their bells and whistles and visible bang-for-the-buck of hiring you. Otherwise they may think you're just sitting on your butt because they're seeing no tangible benefits.

    If you've got a LOT of things that need to be fixed, things that can be done by closer to trained-monkey level, consider getting a temp assistant to help you dig out. Someone to run around and reimage machines, fix networks, repair stations, do RMAs, etc while you pull up your sleeves and unhack the servers. But if they're not in that big of a hurry this may not be appropriate.

    Good luck with it, sounds like fun actually, a challenge at the least.

    --
    I work for the Department of Redundancy Department.
  6. Re:Get management buy in... by kiwimate · · Score: 4, Informative

    Exactly correct.

    Step 1.

    Document. Look at your critical systems. Document what they are. Start at a high level - line of business, internal (HR, etc). Drill down - I have an Oracle server, I have a Citrix system to allow the users to remote connect, which uses a VPN, etc.

    Cost: your time.

    Step 2.

    Prioritize. What are the most important systems? Start with the systems which, if they go down, will cause the company to lose money. Then the ones which support internal processes. Rank order.

    Cost: your time. Possibly management's time - they may have input into priorities.

    Step 3.

    Audit. Start at the top and find out just what state they're in. If you don't feel sufficiently comfortable with a particular technology to do this yourself, hire an SME for a few hours.

    Cost: potentially the consulting SME to evaluate various systems. Note - the initial contract is an audit, not a "find everything and fix".

    Step 4.

    Fix. If you have audit notes which say "this critical line of business system is on the verge of death and once it dies it can't be resurrected", that goes first. If you have audit notes which say "this is a system which provides some reporting capabilities and it's a bit shaky, but worst case is you have to reboot the server and the reports to management go out a bit late", not so bad.

    If you get to step 3 and management won't pay, then you have a problem.

    If you get to step 2 and management won't give up their time, then you have a very big problem.

    A big question will be the level of support from management. If they are not supportive, or if money is tight and they say "we'd like to pay for the consultants but", then that's why you've rank ordered.

    If they're cooperative but don't have the money, work with them to figure out some kind of timeline based on highest risk.

    If they're stubborn, urgh, bad spot. Do your best to determine level of risk. Work with the company accountant to figure out the cost to the company if a critical line of business system goes down for 10 minutes. 2 hours. Include some waffle about reputation, if you can. Include any penalties or SLA violations, if you have those.

  7. Observium, ESXi, and Hobbit by charnov · · Score: 4, Informative

    The combo of Observium (network monitoring), Hobbit (monitor everything with extreme ease), and either ESXi or Proxmox VE for consolidation and ease of management/isolation/testing/etc has served me well for years to take control of large organizations quickly. Last two business I was hired to fix, I set this up and then built a parallel enterprise as VMs (the right way this time) and then cut everyone over in a weekend. No one noticed the change except to say stuff didn;t crash anymore and it was really fast.

    Also OpenFiler and NexentaStor make for a great SAN.

    If you need more: PFSense for firewall or VLAN router, BlueIris for IP cameras, PBX in a Flash for VoIP, SoGo for Outlook compatible email, LibreOffice, etc.

    --
    [RIAA] says its concern is artists. That's true, in just the sense that a cattle rancher is concerned about its cattle.
  8. Re:1 suggestions by Anonymous Coward · · Score: 5, Informative

    Heavily

  9. Re:Quit by malkavian · · Score: 4, Informative

    I've been in a few similar situations over the years. The first thing you put on the table is "This is not an acceptable situation. Your risks are .".
    If they don't cover this, then that's really not your problem. I've coding for 32 years, and doing sysadmin stuff as well for about 20 (among other strings to the bow), and live in despair of people who really don't understand that this stuff doesn't happen by waving a magic wand, and there is more to it than making pretty buttons appear on a screen.
    At interview, if someone said they'd reverse engineered and documented a system in this environment (and yes, I interview people for dev/admin jobs from time to time), I would seriously ask them why they didn't get management a junior to cover the paperwork and cover duties, while they dealt with the heavy lifting of reverse engineering and planning. I want someone around who will grok the risks, take responsibility and come up with a resilient service (not just a few machines that may be able to fail over). Budget isn't always easy to come by, especially if there are political axes to grind.
    I'm with the AC on this, from the limited info available. Either get them to get you a second, or get out. If the business is thriving, they can afford it, and they're just being cheapskates (and in many years, I've met quite a few like that) if they don't. You don't want to work for a cheapskate.

    The time to take this kind of work on solo is if you're part of a startup, when you've got a lot invested in the success of the company. You live or fall on your wits, capability, and ability to lose every evening, weekend, and many a night too, on keeping this up and running as cheaply as possible.
    Once the 'thriving' level arrives, you'd better make sure you're not still carrying that load alone, otherwise your own lifespan (as well as that of the company) may be quite severely limited.

  10. Re:methodically and late into the night by Anarke_Incarnate · · Score: 3, Informative

    The pride and arrogance in this trend of "I am god" here is sickening. I was part of a 2 man crew on a company of 40 people and it sucked. I got called on my way to vacation, on vacation and on the way back. I was called when ON the doctor's table. I left, and while I value the learning opportunities I was given, it is no way to run a company.

    If a company does not value IT and the efforts put into maintaining it, then they are bleeding you and deceiving themselves.

  11. Re:methodically and late into the night by racermd · · Score: 4, Informative

    I'm likely commenting too deeply for the person that asked the original question, but my advice seems to fit best here. What the company needs is an IT manager, whether hired directly or outsourced.

    Firstly, assess the corporate attitude towards hiring (competent) staff directly and buying or leasing hardware directly vs. purchasing outsourced services. Once you know where that conversation leads, you'll have a better idea of how to address the larger problems that only a bunch of time (and usually money) can solve.

    If the former, start the interviewing process ASAP. What you're looking for is self-starters that really do know their stuff. Take a handful of real-world scenarios, change some of the minor details a bit, and ask candidates what they'd do in that situation (or if they've encountered something like it before). Don't take them at their word, ask them to back it up with details of their own. Also, since you're going to wind up spending money on staff, you're probably going to be spending money on tools like new systems, software, and basic architecture hardware. Use an appropriate procurement process (and make sure it's followed) to meet your specific needs.

    If the latter, like I and many others here suspect it is, be sure to negotiate favorable contract terms with this in mind - everything is about money. You might be able to get a better rate on some services if you limit support to 8x5 instead of going 24x7, for instance. Is remote support acceptable or do you want someone on-site when you have to make that call? What is the response time to various levels of service calls? Do you want to host hardware on-site or have that done elsewhere? Things like that should be priced out and assessed against the needs of the business.

    Lastly, an important bit regardless of how the company wants to do it, the goal is to streamline operations which includes any support that's required when systems are not operating properly. Identify the weak subsystems and put them on a roadmap to be replaced with something more robust. It's a boring exercise in IT management that involves budgets and change control procedures but it does pay off in the long-run. If you need to get approval for spending, it helps to show what the current cost is, what the cost could be if things go wrong, and what costs could be if replaced with the more robust system. As long as you speak to your management in terms of money, they should listen.

    --
    My sources are unreliable, but their information is fascinating. -- Ashleigh Brilliant