I recently watched a Ted Talk with Fei-Fei Li where she discusses visual intelligence. Much of her discussion is about how her team is creating software for computers to recognize and describe objects in digital formats such as images and videos. Dr. Li and her team of researchers at Stanford University and Princeton University launched the ImageNet project in 2007. The project built one of the largest image databases with over 14,000,000 images and 22,000 classes to organize the objects on ImageNet. The ImageNet project was able to test with the convolutional neutral network algorithm. The algorithm is able to describe who is in the picture and what is happening, at any angle. However, the algorithm is not perfect and still makes mistakes with identification.
The hope is to connect human vision and computer vision together. Before the ImageNet project a computer could not decipher objects in photographs; it could only tell us numbers. Now it can tell us so much more. The research has progressed but it is still not where Dr. Li had imagined. Her goal of the research is to have computers tell stories behind images, the way humans do. Computers should mimic humans’ level of thought process and understanding. Dr. Li hopes that her research with visual intelligence will be put to better use, such as using it to fight crime smarter, helping in times of natural disasters, medical purposes, surveillance, and much more.
After listening to Dr. Li’s discussion, it got my thinking about Facebook. Facebook uses a type of facial recognition software similar to one they are researching, where it can detect faces in pictures of users. The project is called DeepFace, which was developed by Facebook’s Artificial Intelligence research group. The project uses a method called deep learning that uses data to develop recurring patterns in faces. Essentially it can take two photographs and determine if the faces are the same. It uses a 3D model to rotate faces (to a specific angle) so that the face is looking at the camera, which they view as the correct version. The model has a 97.25% accuracy rate, which is almost as good as humans.
The DeepFace project recognizes faces whereas the ImageNet project identifies objects and actions by giving us a short description of the photographs. The ImageNet project takes it a step further by hoping a computer can create a story. I wonder if Facebook will eventually adopt this type of technology. After all, the network seeks to make our usage easier. So since we already can click a button automatically to tag people in photos (without manually doing so), why couldn’t Facebook create captions and descriptions for our pictures? Obviously, we aren’t there yet but it looks as though it may not be as far away as we thought.
When I think about the ultimate goal of ImageNet, I wonder about the creativity aspect. Dr. Li wants computers to learn how to tell a story from images but what does she mean by story? The computer can already recall a brief description of photographs. How much detail can/should a computer have? When I think about stories I think of them as a way to connect with readers, entertain or inspire them. There is emotion in storytelling. My question is, how can you make a computer have emotion?
The discussion about visual intelligence brings up another big issue, privacy. For decades we have been able to detect faces in photographs, but identifying a person in photos is relatively new. While the technology has many perks and become quite useful, not everyone is happy. It can be scary or frustrating for some because we seemingly no longer have control over our identities. Anyone is able to snap pictures of us whether that is via drones, surveillance cameras, or people walking the streets. This is no surprise to any of us. We aware we are being watched and monitored. But how will we know when they’ve gone too far? Have they already? Is there such a thing as privacy anymore?