Abstract

This paper describes MIT UAV Team's full system for Mission 7a of the International Aerial Robotics Competition (IARC). Our system involves a hardware layer and vision, modeling, communication, and planning software layers to herd Roombas in an environment denied of external navigation aids such as GPS or large stationary points of reference. The vision component processes the camera input from a GoPro attached to the aerial vehicle to identify Roombas and gridlines at each frame. These partial observations of the field are combined to create a global model, which allows for vehicle localization and Roomba tracking. Using this model, our vehicle utilizes a heuristic strategy for high-level coordination planning consisting of a finite state machine, where each state is a hard-coded "behavior module".

The Project

The seventh mission of the IARC involves interaction between an aerial vehicle and moving ground robots. In particular, Roombas moving in a noisy trajectory must be guided across one side of a square field within a time constraint all while avoiding four moving obstacles in the field. The interaction of an agent with moving objects in an environment has applications in the use of aerial robots in moving platforms such as ships, trucks, or other air vehicles. There has been extensive previous work on aircraft control and navigation in noisy environments. However, few of these works involve the aerial robot interacting closely with moving objects in the environment. In such cases, determining the position of the aircraft is crucial. Traditionally, the aircraft position is determined using a combination of a global positioning systems (GPS) receiver and an inertial measurement unit (IMU). However, this solution is prone to failure when the GPS signal is weak or unavailable, as in indoor environments. In such cases, the localization task is handled with a vision system combined with other on-board sensors.These methods, such as Simultaneous Localization and Mapping (SLAM), generally rely on large stationary points of reference such as walls.

In our work, we are constrained with an environment denied of all external navigation aids except for gridlines on the field. As such, we introduce a localization method based on gridline tracking which allows the aircraft to build a global model of the field from frame-level observations. Our localization and mapping method is combined with a heuristic strategy for high-level coordination planning to complete Mission 7a of the IARC.

The Mission

The arena consists of a 20-by-20 meter field with three red sides and one green side. The entire field contains gridlines that divide the field into 400 one-by-one meter squares. The UAV hovers near one of the corners of the arena at the start of the game. During the mission, it cannot hover above three meters from the ground nor leave the arena for more than five seconds.

There are 10 Roombas initially placed around the center of the arena facing outwards. Every 5 seconds, a trajectory noise of <= +- 20 degrees is added independently to each Roomba, and every 20 seconds, all of the Roombas rotate 180 degrees. Otherwise, the Roombas move in a straight line at 0.33 meters per second.

In addition to these 10 Roombas, which the UAV must herd, there are also four obstacle Roombas with large poles of varying length on top that the UAV must avoid. The obstacle Roombas are placed 5 meters from the center, and all travel clockwise in a fixed circular motion.

The mission consists of autonomously herding the 10 Roombas with a UAV across the green side of the arena. The UAV acts on a Roomba by tapping the top of the Roomba to move it 45 degrees clockwise. Additionally, the UAV can also land in front of a Roomba, which activates the Roomba's collision sensor, turning it 180 degrees. The task of the UAV is to herd at least seven Roombas across the green side of the arena within 10 minutes

The System

Hardware

Solo 3DR quadcopter with gimbal, optical flow, and laser rangefinder.

Vision

Custom and classical machine vision techniques to identify gridlines and roombas.

Modeling

Persistent state belief updating based on vision's estimates.

Communications

Tapping/landing/movement control with IMU integration.

Planning

Heuristical algorithms to direct movement based on Modeling's estimates.

Hardware

The hardware and communication systems are based off of the 3DR Solo Drone. Rather than building a custom hardware and communication solution, we opted to use the Solo based on reliability of the Pixhawk, strength of the open-source community, and familiarity with ArduPilot, the autopilot firmware. In addition to the Solo Gimbal and GoPro, we equipped the Solo with a LIDAR Lite optical rangefinder, a PX4Flow sensor, and a kill switch.

The flight controller used onboard the Solo is the latest stable developer release of the quadcopter version of ArduPilot - ArduCopter PX4-quad version 3.4.0. The Solo uses a Pixhawk 2 to run ArduCopter onboard. The Mavlink protocol is supported by this software, which is used to send and receive information between the groundstation and the quadcopter's autopilot. Through Mavlink, data such as the quadcopter's current yaw angle, its arming state, and its local position are accessible. Additionally, commands can be sent through Mavlink to instruct the quadcopter to takeoff, land, navigate to a position, and more. The open-source ArduPilot community is very active, and this year was particularly advantageous to use the Solo as much work was done on GPS-denied navigation.

The PX4Flow sensor can be seen as a hardware substitute in a GPS-denied environment. It is an optical flow sensor that tracks the change in pixels of a point and estimates velocity. This sensor reading is a substitute for the velocity received from a GPS signal, and allows the quadcopter's onboard estimate of position to converge. Coupling the optical flow sensor with the LIDAR Lite optical rangefinder allows for a precise measurement of height that feeds into the velocity and position estimates. These two sensors enable precise indoor navigation of the Solo and reliable autonomous flight.

The kill switch was built to take advantage of the 3DR Solo's electronic speed controller (ESC) functionality. The Solo's ESCs automatically cut power to the motor as soon as the signal wire is cut. This is important, as we are able to cut power to the motor without needing to account for the total 380A of current that can be sent to the motors at maximum. Our design of the kill switch enables us to cut all four signal wires to the motors at the same time, leading to an instantaneous cut to the motors' power. These signal wires are sent through a relay that has a separate remote controller. This system is completely independent of the groundstation and on-board autopilot. Two remote controllers are provided and are on a different radio frequency than any other component onboard which eliminates radio interference.

A horizontal plexiglass plate is attached to the legs of the Solo. This custom plate is intended to be the piece of hardware that activates a Roomba upon landing. A hole is cut for each of the three optical elements on the bottom of the Solo: the optical flow sensor, the optical rangefinder, and the downward facing GoPro camera.

Vision

The vision system is the first level of processing. It takes in the raw camera input and outputs a list of gridlines and roomba positions represented in pixels. The transformation happens over a number of stages using a mixture of OpenCV functions and custom image processing code.

A GoPro Hero 3 attached to the bottom of the drone using a Gimbal provides a continuous downward facing camera stream. The first stage of vision processing reads in the GoPro image and debulges it to remove warp effects from the GoPro's large field of view. The debulging is accomplished through a spherical remapping tuned to the GoPro's focal length, which produces a resulting image with straightened gridlines.

Now that the camera feed is straightened, image processing begins by identifying roombas. The image is thresholded by red and green, noise is removed, and connected components are identified. The connected components are then pruned to identify potential roomba candidates (and remove other similarly colored objects, such as gridlines). The centers of the potential roomba candidates are calculated, and saved.

The roomba identification is placed on hold and the gridline identification begins. The image is thresholded to produce a binary image representing the white areas of the camera frame. The binary image is then cleaned to remove noise from white specks that aren't a part of the gridlines. Using a Hough line transform, the image is then fitted with a number of lines represented by starting and ending pixel positions. Over the remainder of the processing steps, these lines will be transformed to be represented by their line number on the field.

Dealing with the camera frame directly is now complete, and line processing & pruning begins. The primary axes are identified. Duplicate lines are then merged using information about the line's starting/ending points and angles. A number of pruning steps then occur: short, misoriented, floating, and intermediate endlines are all processed out to deal with a number of edge cases. Endlines are determined by looking for lines that act as boundaries which no other lines extend past. The pruned lines are then reprocessed using the new endline information, and incorrectly removed lines are re-added. We now have a set of lines represented in pixels each of which corresponds to a distinct gridline in the camera frame.

Modeling

The modeling system deals with the next level of processing: transforming vision's gridline & roomba estimates into continuous position estimates for the UAV & roombas. Most of the modeling processing happens closely in tandem with vision, with data and processing steps being traded back and forth.

Modeling takes vision's gridline estimates and compares them to gridline estimates from the previous frame. Additionally, modeling maintains a few persistent variables: the UAV's integer coordinate position, the UAV's floating point coordinate position, and the current axis of rotation. Combined, this information allows for a persistent beliefs about gridline numbers and UAV positions.

The current gridlines and previous gridlines are compared and greedily merged. The center of the camera frame is considered to be directly below the UAV: if a line was previously on one side of the center and is on the other after the update, the believed UAV's integer coordinate position is updated accordingly. The line numbers are then updated based on the new belief of the UAV's position. A corrective step occurs to check that no unreasonable line numbers are predicted (lines that would be off the grid) and to ensure that the endlines identified from vision have the correct numbers. Finally, the UAV's floating point coordinate position is updated by comparing the distance to the immediate gridlines on the UAV's right and left.

The roomba model persists and updates information about roomba beliefs across frames. Game rules are used in the update step: roombas move at a consistent velocity and turn around every 20 second. Each roomba is represented by a coordinate position, an angle, and a belief certainty that decays over time.

Modeling takes vision's identified roombas in the camera feed (represented in pixels) and transforms them into grid positions (in the same manner that the center of the camera is transformed into the UAV's floating point coordinate position). Next, modeling updates the roomba positions by greedily merging the observed roombas with roombas from the model. Missing roombas that should be in the camera frame but aren't are pruned.

Communications

Tapping roombas involves a coordinated effort between the vision, modeling, and communications systems.

Vision operates in two modes: the "observing" mode described above and an additional "landing" mode. The secondary mode is used when tapping to ensure that estimates don't get confused when the UAV descends. In landing mode, vision disables position processing/updates and switches to tracking the target's pixel position, which is communicated to the communication system.

The communication system executes tapping by attempting to keep vision's reported pixel target in the center of the camera feed.

Planning

Vision and modeling act as the UAV's peripheral system; they read in sensor input and translate it to a believed view of the world. Planning takes this model and treats it as a reality. As the final level of processing, the planning subsystem directs the communication system where to move the UAV to on the field.

Initially, we attempted to apply deep reinforcement learning to the planning problem directly. We provided a deep RL agent with the true state of a simplified, two-dimensional simulation, and we rewarded it when it successfully tapped roombas and when it scored points. We heavily penalized crashing into obstacles and leaving the arena.

With these rewards in place, the agent quickly learned to stay within the bounds and to avoid the obstacle roombas. It eventually learned how to tap roombas. However, at this point, learning plateaued, and the system did not improve further. Perhaps this was due to our reward function being too vague, but regardless, we sought other strategies.

We made a user-controlled, two-dimensional simulation to examine how humans would perform if they had full control of the UAV. We soon discovered that we could consistently beat the challenge (score seven roombas) with a little bit of practice. Having had this experience, we realized that human players adopt a handful of distinct strategies at different stages of the game. These strategies complement one another in such a way that makes successful play possible.

A number of such strategies were developed and then tested in our Gazebo simulator.

The "follow" strategy "babysits" the topmost (closest to the green line) roomba in the roomba model. This refers to following a roomba and tapping it whenever its path deviates from that of the direct route to the goal line. Humans use this strategy when they mentally commit to scoring a particular roomba.

The "circle" strategy loops around the arena to get a sense of where all of the roombas are. After all, only a fraction of the arena floor is visible at any given time.

The "defensive" strategy travels along the out-of-bound edges and taps roombas that are in danger of leaving. This helps prevent potential future points from being lost and is an essential part of scoring seven or more roombas.

Our planning strategy involves alternating between these strategies when certain criteria of the belief state are met. Time is an essential component of the transition decisions, but the locations and angles of the roombas also matter. Note: planning controls both the movement of the UAV and when it decides to land and tap a roomba.