Slashdot Mirror


Google Open Sources Its Image-Captioning AI (zdnet.com)

An anonymous Slashdot reader quotes ZDNet: Google has open-sourced a model for its machine-learning system, called Show and Tell, which can view an image and generate accurate and original captions... The image-captioning system is available for use with TensorFlow, Google's open machine-learning framework, and boasts a 93.9 percent accuracy rate on the ImageNet classification task, inching up from previous iterations.

The code includes an improved vision model, allowing the image-captioning system to recognize different objects in images and hence generate better descriptions. An improved image model meanwhile aids the captioning system's powers of description, so that it not only identifies a dog, grass and frisbee in an image, but describes the color of grass and more contextual detail.

2 of 40 comments (clear)

  1. Re:would be cool if it stayed on MY machine by Anonymous Coward · · Score: 2, Informative

    It does stay on your machine. The Google Cloud Compute API doesn't even have image captioning as a service right now. If you want to test this: you're going to have to get a nice NVIDIA GPU and compile their Tensorflow code by following the Readme.MD on github.

    The reality is this isn't a useful product for robotics because the output of the network is a natural language caption. If you wanted to use this model for robotics, you would chop off the classifier and use the pre-trained Inception v3 model for whatever your needs were.

  2. Re:Wish I could spend serious time on this by Anonymous Coward · · Score: 5, Informative

    If you've got $1200 you've got enough money to play in the arena. If you want to do "DeepMind" level work: you need a substantially larger farm of GPUs.

    If you don't feel a need to replicate the latest flashy advances: there's still plenty of opportunity to make really interesting contributions with an NVIDIA GTX 960 training networks on MNIST 28x28x1 Resolution Images.

    Time requirement is mostly reading in 15-30 minutes chunks. It took me a year to read enough to feel fluent.

    Start here:
    http://www.dspguide.com/ch26.htm
    Then read these:
    https://en.wikipedia.org/wiki/Artificial_neuron
    https://en.wikipedia.org/wiki/Artificial_neural_network
    https://en.wikipedia.org/wiki/Multilayer_perceptron
    https://en.wikipedia.org/wiki/Softmax_function
    http://stats.stackexchange.com/questions/126238/what-are-the-advantages-of-relu-over-sigmoid-function-in-deep-neural-network
    http://image.slidesharecdn.com/cnn-toupload-final-151117124948-lva1-app6892/95/convolutional-neural-networks-cnn-44-638.jpg?cb=1455889178
    (TLDR: Using the Sigmoid/Tanh for your transfer function suffers from something called "vanishing gradients" where the derivative(used for "backpropagation") approaches zero as the weights of the network become large. Restricted Boltzmann Machines(RBM's) use an alternative to backpropagation known as "contrastive divergence", and so it was popular to stack these to form "deep belief networks"(just a multi-layer RBM trained one layer at a time). The ReLU transfer function has grown popular because it solves this problem more easily, which means you can safely ignore RBMs and DBNs from your reading, at least initially.)

    Then read these:
    https://en.wikipedia.org/wiki/Support_vector_machine
    https://en.wikipedia.org/wiki/Convolutional_neural_network (Will explain what "Pooling Layers" are)
    https://www.reddit.com/r/MachineLearning/comments/3klqdh/q_whats_the_difference_between_crossentropy_and/

    Difference between "regression" and "classification":
    A regression network outputs the activation of the output neurons directly, while a classifier network uses the softmax function to ensure that the sum of all the output neurons' activations add up to one.

    The most important thing to understand: it is trivial to train a neural network to perform well on it's own training data(that's what backpropagation DOES). What is difficult is collecting enough data(preferably labeled) to where you can hold out a significant portion for validation(prevents overtraining), and another set of holdout data for TESTING. Your goal is to teach the network to generalize to work on the general case. This is called "regularization". The test data hold out set is for verifying that the validation data wasn't overtrained via "hill climbing".
    Cool trick: https://en.wikipedia.org/wiki/Dropout_(neural_networks)
    http://fastml.com/regularizing-neural-networks-with-dropout-and-with-dropconnect/
    https://en.wikipedia.org/wiki/Neuroevolution_of_augmenting_topologies (Neural Networks meet Evolutionary Algorithms)
    https://people.cs.uct.ac.za/~gnitschke/projects/papers/2009-Niche%20Particle%20Swarm%20Optimization%20for%20Neural%20Network%20Ensembles.pdf

    Other things to know: learning rate is how quickly the network adjusts it's weights(how quickly you jump around during stochastic gradient descent). Bigger steps = faster approach of local minima, but you tend to "overshoot" the high-performing valleys and get stuck on the low-performing surface. This is why it's generally a good idea to "aneal" your learning rate over time.
    http://sebastianruder.com/optimizing-gradient-descent/

    Other cool things to learn about:
    Autoencoders and "Transfer Learning" IE. You can get most of the value of having Google's enormous GPU farms by simply downloading their pretrained inception models, then using them as pretrained features for other experiments.

    Caffe vs. Tensorflow vs. Keras vs. Torch? I vote: Tensorflow.
    https://www.tensorflow.org/versions/r0.9/tutorials/mnist/beginners/index.html

    Good luck!