Maker Kit

The ultimate DIY AI machine. Use machine learning to make projects with accelerated vision and audio intelligence.


The AIY Maker Kit is a quickstart manual for embedded artificial intelligence (AI) projects.

Using a Raspberry Pi and a few accessories, our tutorials show you how to build a portable device that can identify objects, people, and body poses with a camera, and recognize voice commands or other sounds with a microphone.

You'll use machine learning (ML) to create new TensorFlow Lite models that recognize your own objects and respond to your own speech commands. And you'll learn to write code that reacts to these inputs to create a variety of smart applications.

You don't need to know anything about machine learning and you don't need to be a skilled programmer, but you should know your way around a computer. We'll provide step-by-step instructions to run some code examples and make a few code changes, and then you can decide how much more code you want to write to build your own projects. (We provide some help and ideas toward that as well.)

So let's get started!

Locate objects

Identify body parts

Recognize speech

List of materials

The Maker Kit hardware is based on a Raspberry Pi computer, and to execute advanced ML models at high speeds, we use the Coral USB Accelerator. This USB accessory includes a Coral Edge TPU, which is a computer chip that's specially designed to run machine learning (ML) models really fast—it's like an ML turbocharger for your Raspberry Pi. Then you just need the eyes and ears—a Pi Camera and USB microphone—to build your smart projects.

Raspberry Pi Camera
Raspberry Pi 4 (or Pi 3)
Coral USB Accelerator
USB-C cable (included w/ Coral)
microSD card (8 GB or larger)
5V/3A power adapter
(USB-C for Pi 4, or Micro-USB for Pi 3)
USB microphone

Required hardware

  1. 1 Raspberry Pi Camera (×1)
  2. 2 Raspberry Pi 4 (or Pi 3) (×1)
  3. 3 Coral USB Accelerator (×1)
  4. 4 USB-C cable (included w/ Coral) (×1)
  5. 5 microSD card (8 GB or larger) (×1)
  6. 6 5V/3A power adapter
    (USB-C for Pi 4, or Micro-USB for Pi 3)
  7. 7 USB microphone (×1)

Get the hardware

All the parts listed above are available from a variety of electronics distributors. If you need all the parts or just some of them, you can find them at the online shops below.

You also need a personal computer that's connected to Wi-Fi. You can use a Windows, Mac, or Linux computer. Our instructions include steps for Windows and Mac only, but if you're a Linux user, you should be able to follow along just fine.
Hardware exceptions: You can use a USB camera instead of the Pi Camera (it just won't look as neat in the cases below). You can also use a Raspberry Pi Zero 2, instead of a Pi 3 or Pi 4, but you'll need an extra USB adapter for the accelerator and microphone.

Optional materials

The AIY Maker Kit isn't just about making AI, it's also about being a maker!A maker is a person who builds stuff. Often, a maker will creatively combine technology with hardware to solve problems, provide entertainment, create art, or just to explore technology. So it's up to you to create a case for your kit.

We offer two DIY options, but neither are required and you can create a case that's completely different. If you build your own case, just make sure the Raspberry Pi and USB Accelerator have room to breathe because they can get hot.

Case top
Case base
Glue stick
Craft knife
Cardstock or cereal box cardboard (10" x 6")
Box from the Coral USB Accelerator

3D-printed case

  1. 1 Case top (×1)
  2. 2 Case base (×1)

Cardboard case

  1. 3 Glue stick (×1)
  2. 4 Craft knife (×1)
  3. 5 Scissors (×1)
  4. 6 Cardstock or cereal box cardboard (10" x 6") (×1)
  5. 7 Box from the Coral USB Accelerator (×1)

Create a case

Here are the design files for our DIY cases.

3D-printed case

This 3D-printed case snaps together with two parts. The top includes a mount for the Pi Camera, and you can re-orient the camera by printing a separate camera enclosure (not pictured). The design includes variants for Raspberry Pi 4 and Raspberry Pi 3.

If you don't own a 3D printer, you can still get this case:

  • Search the internet for "local Makerspace" where they might have a 3D printer you can use.
  • Find an online "3D printing services" that will print the parts and mail them to you.

The following ZIP file includes all the STL files for printing, plus STEP files so you can remix the design. For more information, see the included README files.

Note: Begin printing the case or place your order for one and then continue with the setup guide. You can put the hardware into your case later on.

Cardboard case

This cardboard case holds everything together and can be assembled in less than an hour. The case is built using a small sheet of cardstock (or other thin cardboard such as from a cereal box) and the box in which your Coral USB Accelerator was packaged.

The following PDF files include all the assembly instructions and the templates you need to print for construction. Be sure you download the appropriate PDF format for your printer paper.

Note: Don't begin assembling the cardboard case until you begin flashing the SD card.
Beware: Unfortunately, if your Coral USB Accelerator included a black USB cable (instead of a white USB cable), it will not fit into this cardboard box design.

Setup guide

The following instructions show how to flash your microSD card with our AIY Maker Kit system image and assemble your case. The Maker Kit system image includes Raspberry Pi OS with all software and settings that enable the ML projects that follow.

Before you start, double check that you have all the required materials. It's okay if you don't have your case yet; you can start programming now and put that together later.

Note: If you already have a Raspberry Pi system, we suggest you follow these instructions to re-flash your microSD card (or flash a different SD card), because it will avoid many possible conflicts with the software dependencies. However, if you insist on using an existing system, we do offer a guide to upgrade an existing Raspberry Pi system (for experienced Raspberry Pi users only).

1. Flash the SD card

Get the SD card image

Download the Maker Kit system image (based on Raspberry Pi OS):

Download aiy-maker-kit-2022-05-18.img.xz

Install the flashing software

  1. Download the Raspberry Pi Imager from here.

    The web page detects your operating system, so just click the blue Download button.

  2. Install the Raspberry Pi Imager app.

    • On Windows, launch the .exe file and follow the setup wizard.
    • On Mac, launch the .dmg file and drag the Raspberry Pi Imager into the Applications folder.

Note: You must have Raspberry Pi Imager v1.7 or higher.

Connect the SD card to your computer

Use an SD card adapter (if necessary) to connect your microSD card to your Windows, Mac, or Linux computer.

Select the SD card and system image

  1. Launch Raspberry Pi Imager.
  2. Click Choose OS, scroll to the bottom, and select Use custom.
  3. Select the Maker Kit system image (the .img.xz file you downloaded above).
  4. Click Choose storage and select your microSD card (it might be named something like "Generic MassStorageClass Media").
  5. Do not click Write yet.

Configure advanced settings

In order to log into the Raspberry Pi wirelessly, you must specify the Wi-Fi details for the system image:

  1. Click the gear icon in Raspberry Pi Imager to open the Advanced options.
  2. Scroll down a little, select Configure wireless LAN, and then enter your Wi-Fi name (SSID) and password. Beware that the SSID name is case sensitive (capitalization matters).
  3. Be sure that Wireless LAN country is set to your current location.
  4. Triple-check that your Wi-Fi name and password are correct, and then click Save.
  5. Now click Write to begin flashing the SD card.

This can take several minutes. So while that’s working, proceed to assemble your case.

Need more help? Watch this video about how to use the Raspberry Pi Imager.

2. Assemble your case

The following instructions are for the 3D-printed case. If you don't have your case yet, that's okay, just connect the hardware as described below (connect the camera and USB Accelerator) and you can put it in the case later.

If you're building the cardboard case, instead follow the instructions in the PDF download and when you're done, skip to turn it on.


Open the camera latch on the Raspberry Pi

Between the HDMI ports and the headphone jack, you'll find the camera cable latch.

Pull up on the black latch to open it.

Insert the flex cable and close the latch

Grab the end of the camera flex cable (it should already be attached to the Pi Camera) and insert it so the blue strip is facing toward the black latch.

Be sure the cable is inserted fully, and then push down on the latch to lock the cable in place.

Insert the Raspberry Pi

Start by inserting the side of the board with the headphone jack so those plugs rest in the corresponding holes.

Then press down on the opposite side of the board (with the 40-pin header) so it snaps into place. You can gently pull outward on the case wall so it snaps in easier.

Make sure it's inserted fully: The Ethernet port should rest flat on the case opening, and the SD card slot on the opposite side should be accessible.

Note: The SD card should not be in the board yet—it can make inserting the board more difficult.

Attach the Camera

First set the camera's top mounting holes onto the matching pegs, and hold them firmly in place.

Be sure the holes on the camera board are aligned with the posts, and then press firmly on the cable connector to snap the camera into place. You might need to press hard.

Ensure that both side clips are now holding the board and the camera lens is set in the hole.

Fold the flex cable

Press the cable against the roof of the case and then fold it backward where it meets the wall, creasing the cable.

Then fold the cable again at a 90-degree angle after it passes over the camera, creasing it so the cable turns toward the outer USB cutout, as shown in the photo.

Don't worry, you will not damage the cable by creasing it.

Set the cable inside

As you close the lid, the flex cable should turn backward one more time, right behind the USB ports. You do not need to crease the cable here.

Snap the lid shut

First set the clips on the side near the board's GPIO header.

Then set the opposite side by pressing gently on the side with the plug openings, while pressing down on the lid.

Attach the USB Accelerator

Start by setting the top corners of the USB Accelerator into the case, so the USB plug is on the same inside as the microSD port on the Raspberry Pi.

Then press down to snap it in.

Connect the USB cable

Connect the USB cable to one of the blue ports on the Raspberry Pi, and press the cable into either of the two cable guides.

The blue ports support USB 3.0, which provides much faster data-transfer speeds than USB 2.0.

Leave the USB Accelerator unplugged so you can access the microSD slot.

Attach the USB microphone

If you plan on using audio classification (speech recognition), then attach your USB microphone to any of the available USB ports.

Note: The headphone jack on the Raspberry Pi does not support input, so you must use a USB microphone (unless you're also using audio HAT with a mic input).

3. Turn it on

Insert the SD card

  1. Wait until your SD card is done flashing. Raspberry Pi Imager will show a dialog that says Write Successful and it automatically "ejects" the card.
  2. Remove the SD card from your computer.
  3. Locate the SD card slot on the bottom of the Raspberry Pi and insert the microSD card (the card label faces away from the board).
  4. If you're using the 3D-printed case, you can now plug the USB cable into the USB Accelerator.

Power on the Raspberry Pi

Plug in your USB power adapter to the board, and connect it to your wall power supply.

Note: You do not need a monitor, keyboard, or mouse attached to the Raspberry Pi. You will connect to the board over Wi-Fi.

Let it boot up

Make sure the red LED next to the SD card slot is illuminated. The green LED also flickers during boot-up.

Allow about two minutes for the system to boot up for the first time. When it's ready, the green LED mostly stops blinking, while the red LED remains solid. (Subsequent boots are much faster.)

Warning! The Coral USB Accelerator can become very hot after extended use. Always avoid touching the metal surface during operation.

Caution: The Raspberry Pi OS is running directly from the microSD card, so you should never remove the card while the board is powered on. Doing so can corrupt the card data.

4. Log into the Pi with VNC

Install VNC viewer

  1. Download the VNC Viewer from here.

    The web page detects your operating system, so just click the blue Download VNC Viewer button.

  2. Install the VNC Viewer app.

    • On Windows, launch the .exe file and follow the setup wizard (we recommend you leave any install settings as their defaults).
    • On Mac, launch the .dmg file and drag the VNC Viewer into the Applications folder.

Launch VNC Viewer

On Windows:
  1. Press the Windows-logo key to open Windows search and type "vnc".
  2. Click VNC Viewer to launch the app.
On Mac:
  1. Press Command + Spacebar to open Spotlight search and type "vnc".
  2. Click VNC Viewer to launch the app.

At the startup screen, click Got it to continue.

Find the Raspberry Pi's IP address

Your Raspberry Pi should already be booted up and connected to your Wi-Fi network (because we specified the Wi-Fi login during the flash procedure). So we now need to get its IP address.

On Windows:

Open a Command PromptThis is a Windows app that allows you to issue command-line instructions. To open it, press the Windows-logo key to open search, type "command prompt" and then press Enter to open the app. and run this command:

ping raspberrypi -4 -n 1

The first line in the response should include the IP address in brackets. Keep this IP address visible for the next step.

On Mac:

Open a TerminalThis is an app that allows you to issue command-line instructions. To open it, press Command + Spacebar to open search, type "terminal" and then click and run this command:

ping raspberrypi -c 1

The first line in the response should include the IP address in parentheses. Keep this IP address visible for the next step.

Help! If this does not print the IP address, try the following:

  • Wait a couple minutes in case the Pi is not fully booted up. If the ping command still fails, unplug the Raspberry Pi and plug it in again, and then wait a minute to try again.
  • Connect an HDMI monitor to your Raspberry Pi. On the Raspberry Pi desktop, you should see a file with your IP address as its name—write it down. If the filename is "Network not found," the Raspberry Pi could not connect to your Wi-Fi, so reflash the SD card and verify the Wi-Fi name and password in the advanced options.
  • If you're still unable to connect over Wi-Fi, connect an Ethernet cable between your Raspberry Pi and PC and run the ping command again. You'll need to keep the cable connected while using VNC.

Enter the IP address

Type the IP address in the text box at the top of the VNC Viewer window.

VNC Viewer will save this device address. So next time you want to connect, you can just double-click on the device that appears in VNC Viewer.

Note: After you shut down your Raspberry Pi for a while, it's possible the Raspberry Pi's IP address will change when you turn it back on. So if VNC is unable to connect to your saved device, repeat the previous step to find the IP address. Then right-click on the device in VNC Viewer, select Properties, and update the IP address on that screen.

Log in using the default password

You should then see a login dialog where you can use the following login information:

Username: pi
Password: raspberry

Click OK, and in moment you'll see the Raspberry Pi desktop appear.

Note: You might need to resize the VNC Viewer so you can see the whole Raspberry Pi desktop. (Because we did not connect a monitor to the Raspberry Pi, it uses a fixed window size. You can customize the resolution later by running sudo raspi-config.)

Complete the Pi OS setup

The desktop might show two dialogs on top of each other. If the dialog on top is a warning to change your password, click OK to dismiss it (you'll change the password next).

The other dialog says "Welcome to Raspberry Pi" and helps you configure some settings:

  1. Click Next and follow the wizard to set your country, language, and timezone.
  2. You'll then see a page where you can change the password, so enter a new password and click Next.
  3. Ignore the page to set up your screen and click Next.
  4. Also ignore the page about WiFi because you're already online, so click Skip.
  5. When it offers to update your software, click Skip because it's not necessary and can take some time.

Now you're ready to start developing with ML!

Note: If you want to change your password again later, open the Terminal and run the passwd command.

How to shut down

You just got started, so don't shut down yet, but when you decide you're done for the day, it's important that you properly shut down the system before unplugging the power.

You can shut down the system in two ways:

  • When viewing the Raspberry Pi OS desktop, open the applications menu (the Raspberry Pi icon in the top-left corner of the desktop) and click Shutdown.
  • From the Raspberry Pi terminal, run the following command:

      sudo shutdown

Wait for the green LED to stop blinking (the red LED stays on even when the operating system is shut down), and then unplug the power.

Caution: Failure to properly shut down the system before unplugging the power could corrupt the SD card data.

Project tutorials

Now the real fun begins! Let's build some Python projects with machine learning models!

To make these projects easier, we created some helpful Python APIs in the aiymakerkit module. These APIs perform the complicated (and boring) tasks that are required to execute an ML model. Our APIs also perform common tasks such as capturing images from a camera, drawing labels and boxes on the images, and capturing audio from a microphone. This way, you need to write a lot less code and can focus on the features that make your projects unique.

So let's get started!

Note: Basic programming experience is required to complete these tutorials. Each tutorial begins with a fully functional Python script that you can run from the terminal. To expand on these scripts, the tutorials provide some new code you can copy-paste into the files. That means you need to know just the basics about Python code syntax, such as line indentations, so you know how to insert the code.

Verify your setup

All the software you need is already installed on the system image you flashed to your SD card. So let's run a quick test to make sure your hardware is connected and working as expected.

  1. Open the TerminalThe Terminal is an app that allows you to issue command-line instructions for the computer. On the Raspberry Pi desktop, the Terminal icon appears in the task bar at the top (it looks like a black rectangle). Click it to open a terminal window. on your Raspberry Pi. By default, the terminal opens to your Home directory and the prompt looks like this:

    pi@raspberrypi:~ $

    pi is the username and raspberrypi is the device hostname. These are the default names, but perhaps yours are different if you changed them in the Raspberry Pi Imager options.

    The squiggly line (called a tilde) represents the Home directory, which is currently where the command prompt is located. (You can run pwd to see the full path name for your current directory.)

  2. Change your directory to the aiy-maker-kit folder by typing this command and then pressing Enter:

    cd aiy-maker-kit

  3. Then run our test script with this command:


    This runs some tests to be sure the required hardware is detected. If all the tests succeeded, the terminal says "Everything looks good." If something went wrong, you should see a message with a suggestion to solve the problem, so take any necessary steps and try again.

Now you're ready to build some projects with ML!

But before you continue, change your directory again to the examples folder where we'll get started:

cd examples

Build a face-detecting smart camera


  • Create an app that takes a photo when all visible faces are centered in the frame.

What you'll learn:

  • How to continuously capture frames from the camera.
  • How to feed each frame to an object detection model.
  • How to check if a detected object is inside a rectanglar region.

This project builds upon the example, so let's start by running that code:

  1. Your terminal should already be in the examples directory, so just run this command:


    Make sure the camera is pointed toward you and your face is well lit (point the camera away from any windows/lights).

    When the video appears on the screen, you should see a box drawn around your face. Also try it with multiple faces. If there's nobody else around, you can hold up a photo instead.

    Note: If you see a warning that says, "Locale not supported by C library," you can ignore it. It's harmless and related to your locale change during setup, so it will go away after a reboot.

  2. Press Q to quit the demo. (If it does not close, click on the camera window so it has focus and then press Q again.)

  3. To start building the smart camera, make a copy of with a new name that will be your project file. Here's a quick way to make a copy from the terminal:


  4. Now open the new file for editing in ThonnyThonny is an integrated development environment (IDE) for Python programming—basically a fancy text editor that specializes in Python code. (learn more about Thonny):

    thonny &>/dev/null &

Next, we'll start writing some code to build the smart camera.

Note: If you think the above thonny command looks weird, you're right, but those extra instructions are helpfulWhen you launch an app like Thonny from the command line, by default, it runs in the "foreground" of the terminal, which means you cannot issue new commands until that app quits (try running just thonny to see for yourself). So, by adding &>/dev/null & at the end, we specify that any terminal output should be discarded into the "null" file (it's ignored) and the app process should run in the "background" so we can get the command prompt back.. Alternatively, you can run xdg-open . (include the period) to open the current directory in the File Manager, and then you can double-click the file to open it in Thonny.

1. Get to know BBox

For this project (and the next one), you need to know the location of a detected object inside the image. To help with that, we'll use the BBox API from PyCoral, which specifies a set of bounding-box coordinates. The code you're looking at now doesn't use BBox because the vision.draw_objects() function actually draws the boxes around each face. So let's take a look at how we can get the bounding-box details for our own purposes.

The detector.get_objects() is what actually detects objects using an ML model, and it returns a list of the objects it found (as Object objects). So let's look at this by printing the function's returned value:

  1. In your editor, add a line to print the faces results:
for frame in vision.get_frames():
    faces = detector.get_objects(frame, threshold=0.1)
    print(faces)  # You add this line
  1. Press CTRL + R to run the script from Thonny. After a moment, the video should appear and you will see lots of results like this in the Thonny shell:

    [Object(id=0, score=0.79140625, bbox=BBox(xmin=227, ymin=84, xmax=347, ymax=227))]

What this means is the get_objects() function returns a list (denoted by the square brackets) with objects (each defined by the Object class) that hold members named id, score, and bbox:

  • The id is the type of object detected (in this case, the model detects only faces so it's always zero).
  • The score is the model's confidence about the detection (in this case, the model is 79% sure it sees a face).
  • The bbox is an BBox object that specifies the bounding box coordinates—that is, the face's location in the image.
Note: To quit the program, either press Q when the video window has focus, or press Stop in Thonny (if you launched it from Thonny).

As you build programs using object detection models like this one, you'll find this BBox object and its functions to be very useful. For now, let's focus on the member variables within it that specify the box coordinates:

  • xmin is the box's left-most point.
  • ymin is the box's top-most point.
  • xmax is the box's right-most point.
  • ymax is the box's bottom-most point.

Just remember that the x and y coordinates begin at the top-left corner.

Now that you understand the bounding-box details, you can do all sorts of stuff by reading the location of each detected object from the BBox API.

2. Draw a rectangle

You can also use the BBox API to define your own rectangles. So let's draw a rectangle that specifies the center region where all faces must be in order to trigger a photo.

First import this API at the top of the file:

from pycoral.adapters.detect import BBox

Then, instead of specifying a fixed rectangle size, we'll make our rectangle proportional to the image provided by the camera, which is specified by vision.VIDEO_SIZE. And you need to define this box only once, so add the following code before the for loop:

# Define the auto shutter detection zone
width, height = vision.VIDEO_SIZE
xmin = int(width * 0.25)
xmax = int(width - (width * 0.25))
ymin = int(height * 0.2)
ymax = int(height - (height * 0.2))
camera_bbox = BBox(xmin, ymin, xmax, ymax)

Notice that each coordinate is cast to an integer with int() because when the time comes to draw the rectangle, the pixel coordinates cannot be float values (fractional pixel locations don't make sense).

Then to draw this rectangle on the video, call the vision.draw_rect() function every time the for loop provides a new frame, passing it the frame to draw on and the BBox to draw. So the for loop should now look like this:

for frame in vision.get_frames():
    faces = detector.get_objects(frame, threshold=0.1)
    print(faces)  # You can delete this line now
    vision.draw_objects(frame, faces)
    vision.draw_rect(frame, camera_bbox)

You can run the script again and you should see both a red box around your face and a blue box in the center of the video. This blue box is what we'll use to detect when faces are near the center.

3. Check if a face is inside the rectangle

Assuming your face is the only one visible in the room, you now have two BBox objects during each loop (the BBox returned by get_objects() and the BBox you just created). So you just need to figure out if the face box is inside the camera box.

Here's a function to do just that—put this just above the for loop:

def box_is_in_box(bbox_a, bbox_b):
    if ((bbox_a.xmin > bbox_b.xmin) and (bbox_a.xmax < bbox_b.xmax)) and (
        (bbox_a.ymin > bbox_b.ymin) and (bbox_a.ymax < bbox_b.ymax)):
        return True
    return False

Then at the end of the for loop, you can detect if a face is inside the camera detection box by passing the face BBox and camera BBox to box_is_in_box(). Notice that because get_objects() returns a list of detected objects (faces), in this code we assume there is only one face, so we get the first item in the list with faces[0] (zero is the index position of the first list item):

for frame in vision.get_frames():
    faces = detector.get_objects(frame, threshold=0.1)
    vision.draw_objects(frame, faces)
    vision.draw_rect(frame, camera_bbox)
    if faces and box_is_in_box(faces[0].bbox, camera_bbox):
        print("Face is inside the box!")
Hint: Later on, if you want to make this work with multiple faces, you should create a loop that calls box_is_in_box() for every item in the faces list (instead of calling it just once with faces[0]).

4. Take a picture

Instead of giving you all the code for this project, we'll leave this part up to you.

But our API does provide another function that will help: vision.save_frame(). Just call this function anytime during the for loop and pass it a filename and the current frame to save it as a photo. For example:

timestamp =
filename = "SMART_CAM_" + timestamp.strftime("%Y-%m-%d_%H%M%S") + '.jpg'
filename = os.path.join(PICTURE_DIR, filename)
vision.save_frame(filename, frame)

But there's still more code you need to write to make this work properly, such as:

  • Count how many faces are detected and make sure each one is inside the box.
  • Do you want the saved photo to include the bounding boxes? Maybe don't draw them when you save a photo.
  • Even when all faces are inside the box, do you want to take a photo every single frame (several times in a second)? Maybe you should to add a time delay between each photo.

We encourage you to try solving these challenges on your own and expand the project with your own ideas. But if you get struck, you can peek at the rest of our finished code in ~/aiy-maker-kit/projects/

Build a person-detecting security camera


  • Create an app that detects when a person enters a forbidden area.

What you'll learn:

  • How to read a labels file and filter results so you respond to only certain objects.
  • How to detect a specific amount of overlap between bounding boxes.

This project has some similarity with the smart camera because you'll inspect the location of an object. But this project expands on the concepts you already learned because it teaches you how to handle a model that can recognize lots of different things (not just faces) and you'll perform more sophisticated bounding-box comparisons.

First, try the object detection demo that we'll use as our starting point:


See if it can correctly label your keyboard, cell phone, or scissors.

Remember that each ML model can recognize only what it was trained to recognize. Whereas the previous face detection model could recognize only faces, this model can recognize about 90 different things, such as a keyboard, an apple, or an airplane. To see what objects it was trained to recognize, open the coco_labels.txt file in aiy-maker-kit/examples/models/.

Play with this for a while by showing it different objects that are listed in the labels file (try searching the web for images on a phone or computer, and hold that up to the camera). You'll probably notice the model is better at recognizing some things more than others. This variation in accuracy is due to the quality of the model training. But one thing the model seems to detect consistently well is people.

So let's build a program that responds when a person appears in the image and detect when they only partially enter an area.

To get started, make a copy of and open it:

thonny &>/dev/null &

1. Define a fence region

Just like the smart camera above, your security camera needs a rectangular region that triggers an action when a person enters it. This region is called a fence because it will behave like a virtual fence that detects intruders.

Here's how to define a new BBox for your fence as a proportion of the camera's image size:

# Define the protected fence region
width, height = vision.VIDEO_SIZE
xmin = 0
ymin = 0
xmax = int(width * 0.5)
ymax = int(height * 0.5)
fence_box = BBox(xmin, ymin, xmax, ymax)

Looking at this code, you should be able to figure out that fence_box is a bounding-box that occupies the top-left quadrant of the image: the top-left corner of the box is the top-left corner of the image, and bottom-right corner is at the middle of the image. If this isn't clear, take another look at the section above about BBox coordinates.

Now you can draw the fence inside the for loop:

for frame in vision.get_frames():
    vision.draw_rect(frame, fence_box)

2. Filter the detection results

If you recall from the face detector, when you printed the results from get_objects(), it always printed 0 for the object's id:

[Object(id=0, score=0.79140625, bbox=BBox(xmin=227, ymin=84, xmax=347, ymax=227))]

The id didn't matter for face detection, but now it does matter because it tells us exactly what kind of object the model thinks it found. But the id is just a number, so we need to translate it into a text label. This is easy to do by reading the labels file that belongs to the model: each label is on its own line, and the id number corresponds to the line number (starting at zero for the first line, just like the index position in an array).

But writing code to parse a text file is no fun, so we'll use the read_label_file() function from PyCoral. This function takes the path to the labels file and returns a Python dictionary (a structure of key-value pairs). You can see this line of code in the file already, which then draws boxes and labels for all detected objects. But now we want to inspect the label of each detected object so we can respond only when a person is detected.

To understand what the model actually detected before we draw anything, we need to look at the object ids returned by get_objects() and compare them to the labels returned by read_label_file(). Because read_label_file() provides a dictionary that maps label ids to text labels, all we need to do is call Python's get() method on the dictionary. This method returns the value (the label string) that matches the given key (the label id).

Now, by iterating over all objects returned by get_objects(), we can figure out what is each object with a string comparison:

for frame in vision.get_frames():
    vision.draw_rect(frame, fence_box)
    objects = detector.get_objects(frame, threshold=0.4)
    for obj in objects:
        label = labels.get(
        if 'person' in label:
            print('A person was detected!')

And you should delete the draw_objects() function so the code ignores everything that is not a person.

3. Detect an intersection of boxes

Now you know if the camera sees a person, so let's figure out if they entered the security fence area.

The smart camera above only cared if a face was fully inside a rectangle, but for our security system, we want to know if a person even partially breaches the fence area. To do this, we'll use the BBox.intersect() method, which takes two BBox arguments, compares them, and returns a new BBox that represents the area where those two boxes intersect.

So after you call BBox.intersect(), you have 3 bounding-box objects:

  • One for the fence area
  • One for the person
  • One for the intersection of them both

How you now determine whether to trigger your security alarm depends on how sensitive you want your system to be. For a very sensitive system, you might want to trigger the alarm if there's any intersection at all between them. It also depends on how far away the camera is from the area in question (how big the fence is compared to people in the image).

For this example, let's allow a little forgiveness: We'll allow up to 30% of the person's body to enter the fence; any more and they're classified as an intruder (we merely change the color of their box from green to red, so this is where you can add your own alarm shenanigans).

We calculate the overlap by comparing the total area of the intersection box to the total area of the person box:

if 'person' in label:
    person_area = obj.bbox.area
    overlap_area = BBox.intersect(obj.bbox, fence_box).area
    if (overlap_area / person_area) > 0.3:
        vision.draw_rect(frame, obj.bbox, color=RED)
        vision.draw_rect(frame, obj.bbox, color=GREEN)

Also be sure to import the BBox API at the top of the file, and define the color constants:

from pycoral.adapters.detect import BBox

RED = (0, 0, 255)
GREEN = (0, 255, 0)

That's about it, but there's certainly more you can do to improve your system. Try it out and see if there are any situations where the system doesn't behave as expected. For example:

  • Adjust the fence area to match an area of the room you want to guard.
  • Adjust the overlap_area math to suit your camera's environment.
  • Add a comparision between the intersection area and the fence area to account for situations such as when the fence area is small enough to be occluded by the person even if they are not entirely in the fence area (such as when they are very close to the camera).
  • Add some speakers and play an obnoxious alarm noise!

To see our finished code, open the aiy-maker-kit/projects/ file.

Build a custom image classifier


  • Train an image classification model to recognize new objects (all on the Raspberry Pi).

What you'll learn:

  • How to create your own image dataset and labels file to train an ML model.
  • How to train an image classification model on the Raspberry Pi.

The object detection model used in the previous project was trained to recognize only a specific collection of objects (such as people, bicycles, and cars)—you can see all the objects it's trained to recognize in the coco_labels.txt file (about 90 objects). If you want to recognize something that's not in this file, then the model won't work, you would have to train a new version of the model using sample images of the object you want it to recognize.

However, training an object detection model (as used for face and person detection) requires a lot of work to prepare the training data because the model not only identifies what something is, but where it is (that's how we draw a box around each person). So to train this sort of model, you need to provide not only lots of sample images, but you must also annotate every image with the coordinates of the objects you want the model to recognize.

Instead, it's much easier to train an image classification model, which simply detects the existence of certain objects, without providing the coordinates (you cannot draw a bounding box around the object). This type of model is useful for lots of situations because it's not always necessary to know the location of an object. For example, an image classification model is great if you want to build a sorting machine that identifies one item at a time as they pass the camera (perhaps you want to separate marshmallows from cereal).

There are several ways you can train an image classification model that works with the Coral USB Accelerator, and we've linked to some options below. However, our favorite (and the fastest) method is to perform transfer learningTraining an ML model ordinarily takes a long time and requires a huge dataset. However, transfer learning allows you to start with a model that's already trained for a related task and then perform further training to teach the model new classifications but with a much smaller training dataset. on the Raspberry Pi, using images collected with the Pi Camera.

So that's what we're going to do in this project: collect some images of objects you want your model to recognize, and then train the model right on the Raspberry Pi, and immediately try out the new model.

1. Prepare your photoshoot

  1. Find a few objects you can place in front of your camera.

    Anything will do for your first attempt, because you'll soon understand how easy it is to retrain the model again and again. For example, grab a shoe, a bottle, a game controller, a figurine, or a potted plant. Try at least 3 objects, but no more than 9.

  2. Orient the Pi Camera so it's facing a mostly-empty background. The camera's field of view should not be filled with complicated shapes like a large bookshelf or a messy desk. You should also stay out of the pictures as much as possible.

  3. Make sure you have good lighting so the camera can clearly see the objects you put in front of it: Good: A light source coming from behind the camera or from the sides. Bad: A strong light source coming from behind the object, placing the object in shadow.

2. Collect the training images

To simplify this part, we've created a camera program that captures photos and organizes them into separate folders based on the number key you press to take the photo. For example, when you press 1, it takes a picture and saves it in a folder named "1", and when you press 2, the photo is saved in folder "2". This way, you'll have all the photos for each object in a separate folder. But actual names are certainly better than numbers for folder names, so we'll create a labels file that maps each number to an object label.

  1. Create the labels file:

    1. On your Raspberry Pi, open the ~/aiy-maker-kit/examples/ folder in the File Manager, and then right-click on an empty space in the window and select New File.
    2. Name the file "labels.txt" and click OK.
    3. Double-click on the new file to open it in the text editor.
    4. On the first line, type "Background" (do not include quotation marks).
    5. Then for each item you'll use for training, enter its name on a new line (such as "Shoe" or "Bottle").
    6. When you're done, save the file and close the editor.
  2. Open the Terminal, navigate to the ~/aiy-maker-kit/examples/ directory, and start the camera program (it won't take any pictures until you press a number key):

    python3 --labels labels.txt

    You'll see the camera preview appear on the desktop, and the Terminal prints how each number key maps to each object from your labels file. Notice which number represents each object.

  3. Start by taking pictures of your mostly empty background: (None of your objects should be visible in the camera preview, and you should also not be visible.) Press 0 (the zero key) about 10 times to capture several pictures of your "Background."

  4. Now start with the object represented by number 1: (If you forgot which number is which object, press the H key and the terminal will print the label mapping again.) Place that object in front of the camera and press number 1 just once.

    Then reposition the same object so the camera can see it from a different angle (perhaps also move it closer or farther from the camera), then press 1 again. Now repeat this process several times: Move the object slightly and take another picture. Do this about 10 times.

  5. Then go on to object 2, and do the same thing, taking several photos with the object in slightly different positions and visible from different angles. But be sure you never move the camera!

  6. After you've captured several photos of each object, press Q to close the camera.

  7. Open ~/aiy-maker-kit/examples/ folder in the File Manager, and open the new captures folder. Look inside each label directory to review your images and count how many images are in each folder. For the best training results, you should have the same number of images in each folder.

Note: You might notice that the photos created with are saved in PNG format, instead of JPG, which is often used for photos. This is because the JPG format is a compressed image format, meaning some image detail is lost in order to make the file size smaller, while PNG format is lossless and all image detail is retained. Although you might not be able to see the difference (which is why JPG is such a popular image format), our machine learning model can be sensitive to these compression details, so we want to train our model with the uncompressed data that's provided by a PNG file.

3. Retrain the classification model

Now you can train the model with just this command:

python3 -l labels.txt

By default, the script expects your training images to be in the captures directory. If you renamed that directory, you need to specify the name with an extra --capture_dir argument.

After a moment, the training script begins and it prints a line for every image in the training dataset. For example:

Class: Can (id=1)
  capture/Can/PI_CAM_20210827_013859660080.png => [-0.0546875  0.015625   0.0078125 ... -0.03125   -0.0234375  0.0390625]
  capture/Can/PI_CAM_20210827_013845829924.png => [-0.0390625  0.0078125  0.0078125 ... -0.0234375 -0.0078125  0.03125  ]

When this finishes, your retrained model is ready to perform image classifications, saved at ~/aiy-maker-kit/examples/my-model.tflite.

As for what's being printed here, you probably recognize the image file name, but the numbers on the right are more complicated and you can just ignore them. But if you're curious, the numbers are an abbreviated view of the "image embedding" (it's truncated by the ellipsis; the full length is 1,024 numbers): It's a type of tensor that provides valuable information about the image features. This tensor comes out of the base model (upon which we're performing transfer learning) so it's actually the data we start with to learn similarities between images within each of our classes (such as how all the "Can" images are similar), and learn the differences across classes… And because the base model has already done all the work to calculate these image features, the learning that remains to be done for our new images happens very fast.

4. Run the new model

Now you can run the model and evaluate how well it has learned to recognize the objects you provided for training. Just run this script that performs image classification and specify the new model and your labels file:

python3 -m my-model.tflite -l labels.txt

As long as the lighting conditions and camera perspective match the settings from when you captured your training images, it should correctly label each object on the screen, along with the confidence score.

Build a pose classifier


  • Train a pose classification model to recognize your body poses.

What you'll learn:

  • The difference between pose detection and pose classification.
  • How to collect a large image dataset.
  • How to train a TensorFlow Lite model using Google Colab.

If you finished the project above to train an image classification model then you've taken your first big step into the world of machine learning, because you actually performed the "learning" part of machine learning (usually called "training"). In this next project, we're going to continue with model training, but things are going to get more interesting (and a bit more complicated).

In this project, we're going to train a brand new model that can recognize a variety of different poses that you make with your body. The basis for our pose classification model is the output from a pose detection model (also called a pose estimation model), which is trained to identify 17 body landmarks (also called "keypoints") in an image, such as the wrists, elbows, and shoulders, as illustrated below.

Photo credit: imagenavi / Getty Images

This pose detection model is already trained and we won't change it, but it only identifies the location of specific points on the human body. It doesn't know anything about what sort of pose the person is actually making. So we're going to train a separate model that can actually label various poses (such as the name of yoga poses) based on the keypoints provided by the pose detection model.

To get started, try the pose detection model:


Orient your camera and step away so it can see all or most of your body. Once it's running, you should see a sort of stick figure drawn on your body.

To understand this a little better, open the file and update it so it prints the pose data returned by get_pose():

for frame in vision.get_frames():
    pose = pose_detector.get_pose(frame)
    vision.draw_pose(frame, pose)
    print(pose, '\n')

Now when you run the code, it prints a large array of numbers with each inference, which specifies the estimated location for 17 keypoints on the body. For each keypoint there are 3 numbers: The y-axis and x-axis coordinates for the keypoint, and the prediction score (the model's confidence regarding that keypoint location). The draw_pose() function simply uses that data to draw each point on the body (if the prediction score is considered good enough).

This is very cool on its own, but we want to create a program that actually recognizes the name of each pose we make.

To do that, we'll train a separate model that learns to identify particular patterns in the keypoint data as distinct poses. For example, the model might learn that if both wrist keypoints are above the nose and elbow coordinates, that is a "Y" pose. So, this model will be trained, not with images from the camera, but with the keypoint data that's output from the pose detection model above.

So let's start by collecting images of these poses…

1. Collect your training data

You can choose whatever poses you want your model to recognize, but just as an example, we'll train our model to recognize some aircraft marshalling signals. (These are the signals you might see from folks on the airport tarmac to communicate with pilots.)

Here are the four marshall poses we're going to teach our model:

Note: These poses are illustrated from the perspective of the observer. So, to indicate "left," you actually hold your right arm straight while your left is bent upward. (Also, these poses are simplified because real aircraft marshall signals involve movement, but our model can recognize only stationary poses.)

To train this pose classifier, we need a sample dataset much bigger than the one we used for the image classifier. So instead of just 10 images, let's capture 80 images for each pose. So instead of capturing one image at a time, we'll capture a few photos every second by adding the --continuous argument when we run

And to be sure we don't get these images mixed up with the others from the previous project, we'll specify a new images directory name with --capture_dir argument.

So here are the steps to collect your pose dataset:

  1. Create the labels file, like you did for the image classifier, but name it pose_labels.txt:

    No pose

  2. Start the camera with the extra arguments:

    python3 --labels pose_labels.txt --capture_dir poses --continuous 80

  3. Position the camera so it can see your whole body (or at least everything above your thighs). You'll have three seconds to move away from your keyboard and get into position.

    Note: While the camera is capturing images, do not stand still. It's important that you move around the frame a little and also reposition your arms slightly while still holding roughly the same pose.

  4. Similar to the "Background" images you created before, start by taking pictures of yourself not posing: Press 0 (the zero key) and keep your arms down while the camera takes pictures of you.

    Because the camera automatically captures 80 photos, you need to press each number only once.

  5. Now collect pictures for each pose. (If you forgot which number represents each pose, press the H key and the terminal will print the label mapping again.)

  6. When you're all done with each pose, press Q to stop the camera, and then go back to the Terminal and compress all the new images into a ZIP file:

    zip -r poses

You need the images zipped up in order to upload them to the model training script in Google Colab...

Note: As mentioned above, the pose classifier will be trained using only the pose keypoint data, not these images. We're merely uploading the images to Google Colab so we can convert all the images into keypoint data using a highly accurate pose detection model.

2. Train the model in Google Colab

What is Google Colab?

It's a web-based development tool (built with Jupyter notebooks that runs Python scripts on Google servers, providing you a predictable development environment that might be more powerful than your local CPU (it's definitely more powerful than the Raspberry Pi). Google Colab is a great platform for sharing Python scripts that train machine learning models, with documentation included in the script file (called a "notebook"). You can edit the notebook code from your web browser, immediately run that code, and make a copy of the notebook to save your changes.

To learn more about Colab, watch this short video.

Note: This procedure requires a Google Account to log into Google Colab. If you don't have one, create one here.
  1. On your Raspberry Pi, open the Web Browser from the task bar.

  2. In the browser address bar, enter This redirects you to our model training notebook on

    If this is your first time using Google Colab, take a moment to look around the interface, and watch this short video to learn more about how it works, but we'll walk you through each step required for this project...

  3. Click Sign in at the top-right corner of the web app. It will then ask if you want to leave the site, so click Leave and it will take you to the login screen.

  4. Log in using a Google account and then it will take you back to the Colab notebook.

  5. Below the Sign in button, click Connect. A dialog might appear asking if it's okay for Colab to show notifications, such as to notify you when the script completes—you can either allow or deny this permission.

  6. In the Table of contents panel on the left side, click Part 1: Preprocess the input images to jump to that section.

  7. There are a few drop-down options in this section. Click the drop-down for use_custom_dataset and set it to True. The other options should be set False.

  8. Click the Files tab on the left (folder icon) and then click Upload to session storage (file icon).

  9. Select the file you created in the previous section, and wait for the upload to complete. You'll see a circular progress indicator at the bottom of the left panel.

  10. Scroll past the notebook's instructions to find a code block that begins with if use_custom_dataset. Here, you need to edit a couple of lines to specify the name of your ZIP file and images folder name:

    1. Change to

    2. Change YOUR_DATASET_DIR_NAME to poses.

  11. In the Google Colab toolbar at the top, click Runtime > Run all.

The notebook is now running on Google servers to train a new TensorFlow model using your pose images!

This should take only about 5 minutes to complete. Just make sure not to close the web browser where Colab is running.

When the notebook finishes training the model, it automatically downloads the TensorFlow Lite model and the associated labels file in a ZIP file.

3. Run the pose classification demo

Now that you've trained a new pose classification model, we just need to move the model and labels files into our project directory, and then we can run the pose classification demo that will print the labeled poses onto the screen:

  1. In the same Terminal window where you zipped up the pose folder, you can now unzip the file downloaded from Google Colab and save them to the ~/aiy-maker-kit/examples/models directory:

    unzip ~/Downloads/ -d models

  2. Now run the pose classifier script, which is already coded to look for the pose_classifer.tflite and pose_labels.txt files in the models directory:


  3. Now strike one of the poses to signal forward, left, right, or stop, and you should see the pose label appear on the video display.

Note: Be sure your camera is oriented in the same position and you stand in the same location that you used when you captured your training images.

Build a voice-activated camera


  • Create an app that takes a picture in response to a voice command of your choice.

What you'll learn:

  • How to train a speech recognition model with Teachable Machine.
  • How to take a picture using the picamera API.

In this project, we'll train an audio classification model to recognize new speech commands. There are lots of things you can build that respond to voice comamnds, but we'll start with a voice-activated camera.

Before we begin, let's quickly discuss what goes into and comes out of the audio model. For comparison, the input for the previous image ML models is simple: they take an image. Even when using an object detection model with video, the model takes just one image at a time and performs object detection, once for every frame of video. But audio is a time-based sequence of sounds, so you might be wondering how we put that kind of data into the model.

First of all, how an audio model receives input and provides results depends on the type of task you want to perform. For example, a model that performs voice transcription (speech to text) is very different from a model that recognizes specific words or phrases (such as "Hey Google"). Our model is the second type; it will recognize only specific words or phrases. In fact, it can recognize only speech commands or sounds that fit within one second of time. That's because the input for our model is a short recording: the program listens to the microphone and continuously sends the model one-second clips of audio.

What later happens inside our speech recognizer model is similar to what happens in one of the image classification models, because the one-second audio clip is first converted into an image. Using a process called fast fourier transformation, the model creates a spectrogram image, which provides a detailed image representation of the audio signal in time. Though you don't need to understand how this works, it's useful to simply know that the audio model actually learns to recognize different sounds by recognizing differences in the spectrogram image. Then the model outputs a classification (such as "on" or "off") with a confidence score (just like an image classifier or object detector).

So, to create our own speech recognizer, we need to record a collection of one-second samples for each word or phrase we want to teach the model. When the training is done, our program will repeatedly feed one-second sound recordings to the model and the model will do its best job to predict whether we said one of the learned commands.

1. Test the microphone

This is the first time we're using the microphone, so let's make sure it's working well.

  1. Make sure the microphone is completely plugged into a USB port on your Raspberry Pi.

  2. Run this command in your terminal (that's a lowercase "L", not a one):

    arecord -l

    You should see your USB device listed like this:

    **** List of CAPTURE Hardware Devices ****
    card 2: Device [USB PnP Sound Device], device 0: USB Audio [USB Audio]
      Subdevices: 1/1
      Subdevice #0: subdevice #0

  3. Now let's record a few seconds of audio. As soon as you run the following script, watch the terminal output and say something into the microphone when it prints "Now recording."

    cd ~/aiy-maker-kit

    python3 scripts/

  4. To play to your recording, you need to plug in headphones or speakers to the headphone jack on the Raspberry Pi. Then double-click the test.wav file saved in the /home/pi/aiy-maker-kit/ folder. To get there, you can open the current path in the File Manager with this command:

    xdg-open .

  5. If the recording is too quiet, you can increase the microphone sensitivity as follows:

    1. Open AlsaMixer in the terminal with this command:


    2. When AlsaMixer appears, press F4 on your keyboard to switch to the capture levels.

    3. Press the up arrow on your keyboard to increase the level.

      You can increase it as high as you want (up to 100, as shown in the screenshot below), but this also increases any white noise produced by the microphone (which is common with low quality microphones). The extra noise should not be a problem for our machine learning project, though, because as long as the noise is present in the training samples, the model will learn to ignore it (or consider it part of the "background" noise).

    4. Now run the test script again to check the results.

      python3 scripts/

2. Train a new model

Training a custom model has never been easier than this: We're going to use a web app called Teachable Machine to record samples for training, and then train the model with the click of a button.

To get the best results from your trained model, it's important that you capture training samples that are very similar to (if not identical to) the type of data you will feed the model when you perform predictions. This means, for example, you should record the training samples in the same environment and with the same microphone you will use when running the model. Thus, we recommend you use your Raspberry Pi with the connected microphone to perform the following sample collection.

Open Teachable Machine

On your Raspberry Pi, open the web browser and go to

Record background noise

  1. In the Background Noise panel, click Mic.
  2. Click Record 20 Seconds and then don't speak while it records.
  3. When it's finished recording, click Extract Sample.

It will convert the 20-second recording into 20 one-second samples.

Record the first speech command

  1. In the second panel, click Class 2 and rename it to the first voice command, "Cheese".
  2. Click Mic to open the recorder. This time, it will record just 2 seconds at a time, and then split that into 1-second samples. So for each recording, you need to say "cheese" twice.
  3. Get ready... As soon as you click Record 2 seconds, you need to say "cheese" once and then say it again right after the cursor passes the middle marker. Give it a try.

The recorded sample shows the spectrogram of your recording, so if you timed it well, you should see the bright color when you said "cheese" on the left and right side. If either "cheese" recording overlaps the white middle line, click Record 2 Seconds to try again.

  1. If the "cheese" samples look good, click Extract Sample and the recording will be split into separate samples on the right side.
  2. Repeat steps 3 and 4 nine times so that you get a total of 20 audio samples on the right.

The more samples you record, the more accurate your model becomes. For this project, 20 samples should work fine.

Tip: Each time you record yourself, adjust your pronounciation slightly (for example, say it faster/slower and with a higher/lower pitch). This will help the model learn to recognize the same word even when said in a different manner. You can make your model even better if you get samples from several different people.

Record another speech command

You can add as many other speech commands as you want. For this project, let's do one more. The "Cheese" command will take a photo and then keep listening for more commands, so let's add the ability to quit the program when we say "Stop."

  1. Below the Cheese panel, click Add a class.
  2. Click Class 3 and rename it "Stop."
  3. Repeat the steps above to record samples. As before, perform 10 recordings (each time, say "Stop" twice) so you get a total of 20 audio samples.

Train and review the model

  1. Click Train Model.

In less than a minute, the model will be ready.

  1. Teachable Machine then starts running the model by listening to your microphone and passing one-second clips to the model.

  2. Say "Cheese" and "Stop" or whatever else you recorded to see how well the model recognizes each command.

If you think the model could be better, you can record more samples and/or add more classes, and then click Train Model again.

Note: The model always performs inference using one second of audio, but the preview includes an option to adjust the overlap factor, which is set to 0.5 (half a second) by default. This means, instead of recording one second at a time and sending data to the model just once per second, it records one-second samples that overlap each other by a half second, thus sending data to the model every half second.

Download the TF Lite model

  1. If you're satisfied with the model performance, click Export Model.
  2. In the dialog that appears, click the Tensorflow Lite tab and then click Download my model.

The model should appear on your Raspberry Pi at ~/Downloads/

Backup your dataset

Collecting audio samples is tedious work, so you should keep the recordings you created, in case you want to retrain your model later. (If you close the web browser now, you will loose all those samples!)

  1. For each class, click the three-dot button in the top-right corner of the class panel and click Download Samples or Save Samples to Drive.

  2. If you later decide to retrain the model, such as to record a larger dataset or add more commands to the model, you can easily upload these recordings by clicking the Upload button (next to the Mic button).

3. Run the model

Note: Unlike the vision models used above, the speech recognition model does not benefit from acceleration on the Coral USB Accelerator, because the speech model is much smaller and the Raspberry Pi is powerful enough to handle it. So for your own projects that use only a speech recognition model, you can disconnect the Coral USB Accelerator because it will not be used.
  1. Open a terminal on the Raspberry Pi and unzip the package from Teachable Machine like this:

    unzip ~/Downloads/

    This will extract two files: soundclassifier_with_metadata.tflite and labels.txt.

  2. Run our example for audio classification, passing it the path for the .tflite file:

    python3 ~/Downloads/soundclassifier_with_metadata.tflite

    The program will print various info to the terminal, but then you should begin to see classification results and scores.

  3. Try the voice commands! You should see results like this:

    CALLBACK:  Background Noise => 0.81608325
    CALLBACK:  Cheese => 0.9744848
    CALLBACK:  Background Noise => 0.672197
    CALLBACK:  Background Noise => 0.9117817
    CALLBACK:  Background Noise => 0.91303355
    CALLBACK:  Stop => 0.6549769

Wohoo! Now let's write some code to respond to these voice commands.

To quit the demo, press CTRL + C.

Note: We don't need to specify the labels file when we run this model, because this .tflite file was build with metadata that includes all the labels. So the API reads the labels right from the .tflite file.

4. Take a picture on command

Now let's take a photo in response to the "cheese" voice command.

  1. Make a copy of with a new name and open it in Thonny:


    thonny &>/dev/null &

  2. Take a look at the current code: It defines one function called handle_results() and the main script calls classify_audio(), passing it our trained model and the handle_results() function as a callback.

    The classify_audio() function activates the default microphone, initializes the model, and starts passing one-second audio samples to the model. Then, when classify_audio() receives results from the model, it passes them back to your code in the handle_results() function. That's why it's called a "callback": The classify_audio() function calls back to our program through the handle_results() function.

    Also notice that classify_audio() is a blocking function, which means it blocks all code that follows it from executing, until it's finished. And the only way it will finish is if you return False from your callback function. Currently, the callback function always returns True so classify_audio() never finishes and you must force-quit the program.

  3. To create a voice-activated camera, let's start by writing the code that will capture photos.

    In our previous vision ML projects, we used a loop to continuously pass camera frames to our vision model. But we don't have that kind of loop here, so we need a different approach to capture photos. One of the most popular APIs for this task on Raspberry Pi is the picamera API, so we'll use that.

    Let's start by importing some new modules at the top. We'll use PiCamera to capture photos, datetime to add a timestamp to the filename, and os to set the path where we save the file:

from picamera import PiCamera
from datetime import datetime
import os
  1. Next, we need to specify the location to save photos and initialize PiCamera. This code should go below the import statements and before handle_results():
PICTURE_DIR = os.path.join(os.path.expanduser('~'), 'Pictures')
IMAGE_SIZE = (640, 480)
camera = picamera.PiCamera(resolution=IMAGE_SIZE)
  1. Now we've initialized the camera, but it's important that we also terminate the camera object when we're done with it, using PiCamera.close(). And, because the call to classify_audio() is the heart of our program, we can do that by wrapping it in a try-finally block like this:
    audio.classify_audio(model_file=args.model, callback=handle_results)

This ensures that, no matter what happens during execution of classify_audio(), it will always be followed by the code in the finally block, which closes the camera.

  1. Now the camera is ready to be used anywhere in our program. So let's add a function that captures and saves a photo. This needs to appear after the camera object is initialized and before handle_results():
def capture_photo():
    timestamp =
    filename = "VOICE_CAM_" + timestamp.strftime("%Y-%m-%d_%H%M%S") + '.png'
    filename = os.path.join(PICTURE_DIR, filename)
    print('Saved', filename)
  1. All that's left to do is rewrite handle_results() to respond when the returned label is either "Cheese" or "Stop".

    If the returned label is "Stop," we return False to quit the program. If the returned label is "Cheese," we take a photo and return True. We also need to be sure we return True when neither is detected, so the model keeps running:

def handle_results(label, score):
    if label == 'Stop':
        return False
    elif label = 'Cheese':
    return True

That's it! Now save your changes, run the program with your trained model, and say "cheese"! :)

python3 --model_file ~/Downloads/soundclassifier_with_metadata.tflite

Advanced projects

Now that you've learned a bit about what's possible using machine learning, what else can you create?

Discover more APIs

Before you start a new project, it's a good idea to get familiar with the available tools. The project tutorials above introduce several important features from the aiymakerkit API, but there are still more APIs available.

To get started, read through the aiymakerkit API reference (it's not very long). We didn't use all these APIs in the projects above, so you might discover some APIs that will help you with your next project. Plus, some of the functions we already used support more parameters that we didn't use. For example, the draw_objects() function allows you to specify the line color and line thickness for the bounding boxes.

Explore the source code

This is optional. There's a lot more you can learn by looking at the aiymakerkit source code, but if you're not in the mood to read lots of code, you can skip this section.

On your Raspberry Pi, you might have noticed there's a lot more in the aiy-maker-kit/ directory than just example code. This directory also includes the source code for the aiymakerkit Python module (which is what we import at the top of all the examples).

So if you're curious how these APIs work (are you curious how draw_pose() actually draws the keypoints on your body?), just open the aiy-maker-kit/aiymakerkit/ directory and view the files inside. You'll notice that the aiymakerkit API depends on a variety of other APIs, such as PyCoral to perform ML inferencing, OpenCV for video capture and drawing, and PyAudio for audio capture.

These other API libraries offer a wealth of capabilities not provided by aiymakerkit, so you might want to explore those libraries as well to find other APIs you can use in your project.

And, if you see something you'd like to change or add to the aiymakerkit API, just edit the code right there. To then use the changes in your project, just reinstall the aiymakerkit package:

python3 -m pip install -e ~/aiy-maker-kit

The -e option makes the installation "editiable." That means if you continue to change the aiymakerkit source code, any projects that import these modules automatically receive those change—you won't need to reinstall aiymakerkit again.

Finally, if you want to explore the aiymakerkit code online, you can see it all on GitHub.

Vision projects

The number of projects you can build using vision intelligence (such as image classification, object detection, and pose detection) is endless. So here are just a few ideas to get you going:

Smart bird watcher

The idea is to build a smart camera that keeps records about all the birds that visit you. This is a popular project for bird enthusiasts, especially because there are already great bird recognition models available for you to use.

You can quickly get started by downloading a MobileNet bird classification model and just passing it to the example:

cd ~/aiy-maker-kit/examples
curl -OL
python3 -m mobilenet_v2_1.0_224_inat_bird_quant_edgetpu.tflite
Tip: To test the model, try holding up photos of birds in front of the camera (use a computer or smartphone screen).

To start building the bird watcher, you can simply copy the example and then start adding code to keep records of the birds.

For example, you might take a picture each time a new bird arrives and record the bird name (the label returned by labels.get(classes[0].id)) and a timestamp in the image file name:

timestamp =
filename = "BIRD_CAM_" + label + "_" \
           + timestamp.strftime("%Y-%m-%d_%H%M%S") + '.jpg'
filename = os.path.join(PICTURE_DIR, filename)
vision.save_frame(filename, frame)
Be careful! The example classifies images very fast, so if you merely add this code to save an image, you might save hundreds of images of the same bird. Plus, saving each file slows down your program thread. So be sure you add some logic that will avoid lots of duplicate bird entries, such as a time delay between saving images and a comparison to the most recently detected bird.

Then place the camera somewhere it can see birds close up, such as facing out a window where there's a bird feeder. If you want to put your Maker Kit outdoors, then you'll need to put it in a water-proof enclosure.

Advanced photo booth

We already built a face-detecting smart camera, but you can still improve this project in in several ways. For example:

  • Use a different camera. This is a simple way to improve the image quality. The aiymakerkit API for image capture (vision.get_frames()) works with any camera attached using the flat flex cable (such as the Raspberry Pi High Quality Camera) or with a USB camera. If using a USB camera, just make sure to disconnect the Pi Camera so the USB camera becomes the default camera, otherwise, you'll need to specify capture_device_index when you call get_frames().

  • Use two cameras. You can use the primary camera (Pi Camera or USB camera) as the input for your ML model, but then capture the photos with a digital camera connected via USB, which you can control using the gphoto2 library. This allows you to use a much more sophisticated camera like a DSLR to save photos, but you still need the smaller camera to run your ML model.

  • Add speech recognition. Perhaps capture a photo only when your face(s) is visible and you say a magic word.

  • Use pose classification. Maybe you have to strike one of several poses to take a photo. For the best performance, train your pose classification model with samples you collected in the same environment where you'll use the photo booth.

When setting up the booth, point the camera(s) toward a backdrop or a blank wall and make sure there's plenty of light.

Pose memory game

The pose classification project shows how you can build a model that recognizes your body poses. So maybe you can turn this into a memory game in which you must repeat specific poses as instructed by the game.

This idea is just like the electronic simon game, except instead of pressing colored buttons, you must strike specific poses. So the game starts by telling you to strike one pose. If you get it right, it repeats the same pose instruction and adds another. You must then strike the first pose, followed by the second pose. The game continues to add another pose with each round. The game continues until you get a pose wrong.


  • Start by building the part of the game that randomly selects one of the known poses and prints the name in the terminal. Then write the code to "accept" the player's pose and compare it to the randomly selected pose.

  • Beware that the pose classification model delivers pose predictions very fast, so your game should not immediately "accept" the model results. You should be sure that the player holds the pose for a moment. So wait until the model consistently delivers the same prediction over and over for a short period of time, perhaps 2 seconds. (Allow some leeway so the prediction is allowed to vary a little within that period.)

  • Each time the game selects a new random pose, add the pose name to a Python list. After the game prints all poses to the terminal (by reading from the list), compare the player poses to each item in the list. Repeat until the player gets a pose wrong.

Just to get started, here's some example code that prints a pose name from a list, waits a moment, and then deletes the line before printing the next pose (so the player must remember the pose):

import time

poses = ['left', 'right', 'forward', 'stop']

for pose in poses:
    print(' -> ', pose, end='\r')
    print(' -> ', ' ' * len(pose), end='\r')

Audio projects

So far, we've created just one project using audio from a microphone, but there are so many things you can create with a simple speech or sound classification model.

Here are some other project ideas:

Voice-activated lights

The code we used to create the voice-activated camera can be easily repurposed for a variety of projects that basically push a button or flip a switch for you in response to a voice command. The opportunities here are limitless, but they often require electronics hooked up to the Raspberry Pi.

To do something like turn lights on and off, you need to wire your lighting components to the Raspberry Pi's GPIO pins (the two rows of gold pins standing up on top of the Raspberry Pi). If you've never used the GPIO pins before, check out this introduction to GPIOs.

And when it comes to lighting, there are varying degrees of electrical complexity which can be dangerous when working with lights that require an additional power source. So if you're new to this sort of thing, the safest place to start is by lighting up a single LED (which can be powered entirely by the Raspberry Pi). The aforementioned introduction to GPIOs also shows you how to do this. So once you get to the point you can switch an LED on and off with some code, just add the speech recognition code to turn the light on or off based on the recognized speech commands.

Voice-controlled robot

A robot that responds to your commands isn't a science-fiction fantasy anymore! Now you have the tools to make it a reality. Using what you've learned so far, this robot probably won't do the dishes or the laundry just yet, but you can start by giving it some basic navigational instructions.

To get started, you need to assemble a small robot. There are plenty of assembly guides online for this sort of project that use a Raspberry Pi plus some wheels, motors, a motor driver, and a battery pack (to power the motors). You'll also need a power bank to power the Raspberry Pi (a battery pack that can charge your phone usually provides the necessary 5V and 3A power needed by the Raspberry Pi). Perhaps the most simple guide to build a robot is from the Raspberry Pi Foundation to build a robot buggy.

As soon as you get it going to do some basic maneuvering, like drive in a square over and over, just add some code for a speech recognition model so you can give the robot your driving commands. Just like we did to build a voice-activated camera, train a new speech model to understand commands such as, "go," "stop," "left," and "right."

Tip: Instead of recording a new speech dataset in Teachable Machine, try the Colab tutorial to train an audio classificaion model, which provides a huge speech dataset that includes commands for "go," "stop," "left," and "right."

Then make it a game to see how quickly you can navigate your robot through an obsticle course using only voice commands.

Run your app at bootup

When you're done developing your project and ready to put your Maker Kit to work, you might want your Python program to automatically start when the Raspbery Pi boots up. That way, if the board loses power or you need to reset for any reason, all you need to do is power it up and your program starts without requiring you to connect to the desktop or a remote shell.

You can make this happen by creating a systemd service, which is defined in a .service file.

As an example, we included the following detect_faces.service file in ~/aiy-maker-kit/examples. It's designed to run the example when the device boots up, but it's disabled by default.

Description=Face Detection example

ExecStart=/usr/bin/python3 /home/pi/aiy-maker-kit/examples/


To enable this service so it runs the face detection example at bootup, run the following commands in the Raspberry Pi terminal:

  1. Create a symbolic link (a "symlink") so the .service file can be found in /lib/systemd/system/ (do not move the original file):
sudo ln -s ~/aiy-maker-kit/examples/detect_faces.service /lib/systemd/system
  1. Reload the service files so the system knows about this new one:
sudo systemctl daemon-reload
  1. Now enable this service so it starts on bootup:
sudo systemctl enable detect_faces.service

All set! You can try manually running the service with this command:

sudo service detect_faces start

Notice that you don't see any terminal output from the Python script, because you didn't actually execute the script from the terminal—it was executed by the service manager. But you can still press Q to quit the example when the camera window is in focus.

Or you can reboot the system to see it automatically start at bootup:

sudo reboot

As soon as the desktop reappears, you should see the camera window and a bounding box around your face.

Customize the service config

To create a service configuration for your own project, just copy the above file using unique file name (the name must end with .service) and save it somewhere in your Home directory (such as /home/pi/projects/). Then edit the Description and change change ExecStart so it executes your program's Python script.

You must then repeat the steps above to create symlink to your service file, reload systemctl, and enable the service.

The .service file accepts a long list of possible configuration options, which you can read about in the .service config manual.

Note: The above example includes the line Environment=DISPLAY=:0 because the face detection demo requires a display to show the camera images. So if your project doesn't require a display, you can remove this line. However, if you are still fetching camera images from get_frames(), then you must also pass the function argument display=False so it does not attempt to show images in a deskotp window.

Other tips

To manually stop the service once it's running, use this command:

sudo service detect_faces stop

If you want to disable the service so it no longer runs at bootup, use this command:

sudo systemctl disable detect_faces.service

You can also check the status of your service with this command:

sudo service detect_faces status

Models guide

The projects above use some trained models and even train a couple new ones, but there are more models you can try and the opportunities to train your own are endless. This section introduces just a few places where you can start exploring other models.

Trained models

The following models are fully trained and compiled for acceleration on the Coral USB Accelerator.

Each of the following links take you to a page at where you'll find the model files. On that page, there's a table that describes each available model and the last column includes the download links. Make sure you download the Edge TPU model.

Image classification

Models that recognize the main subject in an image, and are compatible with the Classifier API.

Most models are trained with the same dataset of 1,000 objects, but there are three models trained with the iNaturalist dataset to recognize different types of birds, insects, or plants.

Object detection

Models that identify the location and label of multiple objects, and are compatible with the Detector API.

These are trained with a dataset of 90 objects (except for the one model trained to recognize faces—the same model used above). Notice that although some models are more accurate (the mAP score is higher), they are also slower (the latency is higher).

Pose detection

Models that identify the location of several points on the human body, but only the MoveNet models are compatible with the PoseDetector API.

The PoseNet models detect the same 17 keypoints but they output a different format, so although they do work with the Maker Kit, you must use different code as demonstrated in this PoseNet example. One of the only reasons you might want PoseNet instead is to detect the pose of multiple people.

Semantic segmentation

Models that can classify individual pixels as part of different objects, allowing you to know the more precise shape of an object, compared to object detection (which provides just a bounding-box area).

We do not offer an API for this type of model in aiymakerkit because the applications for such models are more complicated. To try it out, you can use this code with the PyCoral API.

Custom models

Nowadays, crating a model for common machine vision tasks such as image classification and object detection does not require that you actually design the neural network yourself. Instead, you can (and should) use one of many existing network designs that are proven effective, and instead focus on the training effort (gathering your training dataset and fine-tuning the training script).

Training a neural network model from scratch means you start with a network that has no learned weights—it's just a blank neural network. Depending on the model, training a model from scratch can take a very long time. So another shortcut often used (in addition to using an existing network design) is re-using the weights somebody else created when training the model, and then "fine tuning" the weights with a much shorter version of training.

This simplified training process is often called "transfer learning" or just "retraining," and it's the technique we recommend you use with the following training scripts.

All the training scripts below run on Google Colab, which we introduced in the pose classification project. Each script includes its own training dataset so you can immediately learn how to retrain a model, but you can then replace that dataset with your own. To learn more about Google Colab, watch this video.

Did you know?
Every training dataset includes sample input data (such as images) along with the corresponding "ground truth" results that we want to receive from the model's output (the label for an object in an image). A training program feeds the sample data to the model and the model performs a prediction (also called an "inference") for each input. Then, by calculating how close the prediction is to the ground truth result, the training program updates some numbers in the model so it can provide a better result next time. This repeats over and over and that's how machine learning models actually learn.
Note: If you're an experienced ML engineer, you should instead read TensorFlow models on the Edge TPU.

Train an image classification model

Creating your own image classifier is quite easy these days. In fact, you already did it in the custom image classifier project! And if you collect enough training data (more images for each class in your dataset), that technique allows you to create very effective image classifiers that are quite fast on the Raspberry Pi with the USB Accelerator.

Another tool we recommend is the web-based Teachable Machine, which provides a very simple graphical interface to upload and train a classification model:

  1. Open a New Project on Teachable Machine, and then click Image Project and choose Standard image model.
  2. Rename each class (default names are "Class 1" and "Class 2") and add more classes as needed.
  3. Click the corresponding buttons for each class to upload or capture training images.
  4. Click Train model to generate your model. This should finish almost immediately.
  5. Click Export model > TensorFlow Lite > EdgeTPU, and then Download my model.

    It might take a couple minutes to convert the model, but it will soon download a ZIP with a .tflite and labels.txt file you can use with the Classifier API.

Or, if you want to go the more traditional route and train a model using TensorFlow directly, then we suggest you try our Colab tutorial to Retrain a classification model for the Edge TPU.

Train an object detection model

Training an object detection model isn't very complicated either, but collecting and annotating your dataset can be a bit confusing at first and then very time consuming to perform.

That's because, whereas an image classification dataset only needs to be organized into labeled folders, an object detection dataset must be fully annotated with bounding box information. That is, in order for your model to learn how to locate the position of an object, you must specify the correct bounding-box coordinates for each object you want it to learn (in every image of your training dataset).

Fortunately, there are several programs available online that can help you annotate your object detection dataset. We suggest using a free, open-source annotation tool such as Make Sense or LabelImg to add bounding-box annotations to your images. Then, for compatibility with the following Colab tutorials, export your annotations into the Pascal VOC format, which creates an XML file for each image with all the annotation information.

But if you've never trained an object detection model, we suggest you first try running one of these tutorials so you can observe the process and even inspect the annotations before you start building your own dataset.

Both of these Colab tutorials provide a default training dataset for retraining the model, but also support uploading a custom dataset:

  • Retrain the EfficientDet detection model

    For beginners, this tutorial is the easiest to reuse with your own dataset. The Colab notebook is designed to support custom datasets that are uploaded in the Pascal VOC format and it usually finishes retraining in less than an hour.

  • Retrain the MobileDet detection model

    This Colab also accepts annotations in the Pascal VOC XML format and is designed to support custom dataset uploads. Beware that training this MobileDet model takes much longer, compared to the EfficientDet model in the previous tutorial. However, the MobileDet model typically executes faster than EfficientDet, while it's accuracy is still very good (although usually not as good as the larger variations of EfficientDet).

Train a pose classification model

We covered this in the project to build a pose classifier (using this Colab notebook).

Just remember that this pose classification model does not perform pose estimation. For that, you need to actually feed your image to a model like MoveNet or PoseNet. Your custom pose classification model does not take an image as input; it takes the keypoint data that's output by one of those pose estimation models.

Train an audio classification model

If you want to create a model that can classify sounds from the microphone (including speech), we recommend the following options:

  • Follow the steps in the above project to build a voice-activated camera, using Teachable Machine. You can record audio samples to classify any sounds you want, such as types of music or speech commands.

    This is the easiest way to train a model with your own audio recordings because Teachable Machine includes a microphone recorder to gather your dataset. You can even download the dataset to preserve it for future Teachable Machine projects, or to process the samples using other tools.

  • Follow the Colab tutorial to Retrain a speech recognition model with TFLite Model Maker. This produces the same kind of model as Teachable Machine but gives you more control of the training process. The difficult part with this option is creating a new audio dataset, because unlike Teachable Machine, this Colab does not record audio samples for you. So to use a custom dataset, you instead need to upload a collection of one-second WAV files that you already recorded.

    However, this Colab also offers the ability to train the model with an existing dataset that includes thousands of samples for 30 speech commands.

    By default, the Colab uses just a subset of this speech dataset—it trains a model to classify "up", "down", "left", "right", "go", "stop", "on", and "off." Just open the Colab and click Runtime > Run all and it will train this version of the model for you!

    If you want to choose your own commands from the same dataset, just find the line of code that specifies the commands list (under "Prepare the speech commands dataset") and specify a different set of commands that are available in the dataset. Here are all the available commands:

    bed, bird, cat, dog, down, eight, five, four, go, happy, house, left, marvin, nine, no, off, on, one, right, seven, sheila, six, stop, three, tree, two, up, wow, yes, zero

    Several words in this list are not neccessarily useful as speech commands (such as "tree," "bed," and "marvin") but they're included so the trained model can learn to recognize (and ignore) non-trigger words. For example, your model may become better at accurately classifying "three" if it also learns to distinguish it from "tree." Thus, you might find it useful to train the model with more commands than you actually plan on using.

    You can learn more about the speech commands dataset here.

Tip: If the commands you want to use are available in the speech commands dataset, we recommend using them instead of recording your own with Teachable Machine. This dataset has thousands of samples recorded by thousands of different voices, so the model will be much more accurate compared to a dataset you create with only one voice sampled only dozens of times.

API reference

The following Python APIs are provided by the aiymakerkit module, which is used in the above tutorials.

We designed these APIs to simplify your code for the most common TensorFlow Lite inferencing tasks. But by no means does this represent everything you can do with ML on the Raspberry Pi and Coral USB Accelerator.

To learn more, check out the source code.

Image classification


Performs inferencing with an image classification model.


model (str) – Path to a .tflite file (compiled for the Edge TPU).

get_classes(frame, top_k=1, threshold=0.0)

Gets classification results as a list of ordered classes.

  • frame – The bitmap image to pass through the model.
  • top_k (int) – The number of top results to return.
  • threshold (float) – The minimum confidence score for returned results.

A list of Class objects representing the classification results, ordered by scores.

Object detection


Performs inferencing with an object detection model.


model (str) – Path to a .tflite file (compiled for the Edge TPU). Must be an SSD model.

get_objects(frame, threshold=0.01)

Gets a list of objects detected in the given image frame.

  • frame – The bitmap image to pass through the model.
  • threshold (float) – The minimum confidence score for returned results.

A list of Object objects, each of which contains a detected object’s id, score, and bounding box as BBox.

Pose detection


Performs inferencing with a pose detection model such as MoveNet.


model (str) – Path to a .tflite file (compiled for the Edge TPU).


Gets the keypoint pose data for one person.


frame – The bitmap image to pass through the model.


The COCO-style keypoint results, reshaped to [17, 3], in which each keypoint has [y, x, score]., keypoints, threshold=0.01)

Converts keypoint data into dictionary with values scaled for the image size.

  • frame – The original image used for pose detection.
  • keypoints – A COCO-style keypoints tensor in shape [17, 3], such as returned by PoseDetector.get_pose().
  • threshold (float) – The minimum confidence score for returned results.

A dictionary with an item for every body keypoint detected above the given threshold, wherein each key is the KeypointType and the value is a tuple for its (x,y) location.


Pose keypoints in COCO-style format.

NOSE = 0

Pose classification


Performs pose classification with a model from


model (str) – Path to a .tflite file.

get_class(keypoints, threshold=0.01)

Gets the top pose classification result.

  • keypoints – The COCO-style pose keypoints, as output from a pose detection model.
  • threshold (float) – The minimum confidence score for the returned classification.

The class id for the top result.

Camera & drawing'Camera', size=(640, 480), handle_key=None, capture_device_index=0, mirror=True, display=True, return_key=False)

Gets a stream of image frames from the camera.

  • title (str) – A title for the display window.
  • size (tuple) – The image resolution for all frames, as an int tuple (x,y).
  • handle_key – A callback function that accepts arguments (key, frame) for a key event and the image frame from the moment the key was pressed. This has no effect if display is False.
  • capture_device_index (int) – The Linux device ID for the camera.
  • mirror (bool) – Whether to flip the images horizontally (set True for a selfie view).
  • display (bool) – Whether to show the camera images in a desktop window (set False if you don’t use a desktop).
  • return_key (bool) – Whether to also return any key presses. If True, the function returns a tuple with (frame, key) instead of just the frame.

An iterator that yields each image frame from the default camera. Or a tuple if return_key is True., frame)

Saves an image to a specified location.

  • filename (str) – The path where you’d like to save the image.
  • frame – The bitmap image to save., classes, labels, color=(86, 104, 237))

Draws image classification names on the display image.

  • frame – The bitmap image to draw upon.
  • classes – A list of Class objects representing the classified objects.
  • labels (str) – The labels file corresponding to model used for image classification.
  • color (tuple) – The BGR color (int,int,int) to use for the text., objs, labels=None, color=(86, 104, 237), thickness=5)

Draws bounding boxes for detected objects on the display image.

  • frame – The bitmap image to draw upon.
  • objs – A list of Object objects for which you want to draw bounding boxes on the frame.
  • labels (str) – The labels file corresponding to the model used for object detection.
  • color (tuple) – The BGR color (int,int,int) to use for the bounding box.
  • thickness (int) – The bounding box pixel thickness., keypoints, threshold=0.2, color=(86, 104, 237), circle_radius=5, line_thickness=2)

Draws the pose skeleton on the image sent to the display and returns structured keypoints.

  • frame – The bitmap image to draw upon.
  • keypoints – A COCO-style pose keypoints tensor in shape [17, 3], such as returned by PoseDetector.get_pose().
  • threshold (float) – The minimum confidence score for returned keypoint data.
  • color (tuple) – The BGR color (int,int,int) to use for the bounding box.
  • circle_radius (int) – The radius size of each keypoint dot.
  • line_thickness (int) – The pixel thickness for lines connecting the keypoint dots.

A dictionary with an item for every body keypoint that is detected above the given threshold, wherein each key is the KeypointType and the value is at tuple for its (x,y) location. (Exactly the same return as get_keypoint_types().), label, color=(86, 104, 237))

Draws a text label on the image sent to the display.

  • frame – The bitmap image to draw upon.
  • label (str) – The string to write.
  • color (tuple) – The BGR color (int,int,int) for the text., point, radius, color=(86, 104, 237), thickness=5)

Draws a circle onto the image sent to the display.

  • frame – The bitmap image to draw upon.
  • point (tuple) – An (x,y) tuple specifying the circle center.
  • radius (int) – The radius size of the circle.
  • color (tuple) – The BGR color (int,int,int) to use.
  • thickness (int) – The circle’s pixel thickness. Set to -1 to fill the circle., bbox, color=(255, 0, 0), thickness=5)

Draws a rectangle onto the image sent to the display.

  • frame – The bitmap image to draw upon.
  • bbox – A BBox object.
  • color (tuple) – The BGR color (int,int,int) to use.
  • thickness (int) – The box pixel thickness. Set to -1 to fill the box.

Audio classification, callback, labels_file=None, inference_overlap_ratio=0.1, buffer_size_secs=2.0, buffer_write_size_secs=0.1, audio_device_index=None)

Continuously classifies audio samples from the microphone, yielding results to your own callback function.

Your callback function receives the top classification result for every inference performed. Although the audio sample size is fixed based on the model input size, you can adjust the rate of inference with inference_overlap_ratio. A larger overlap means the model runs inference more frequently but with larger amounts of sample data shared between inferences, which can result in duplicate results.

  • model (str) – Path to a .tflite file.
  • callback – A function that takes two arguments (in order): a string for the classification label, and a float for the prediction score. The function must return a boolean: True to continue running inference, or False to stop.
  • labels_file (str) – Path to a labels file (required only if the model does not include metadata labels). If provided, this overrides the labels file provided in the model metadata.
  • inference_overlap_ratio (float) – The amount of audio that should overlap between each sample used for inference. May be 0.0 up to (but not including) 1.0. For example, if set to 0.5 and the model takes a one-second sample as input, the model will run an inference every half second, or if set to 0, then there is no overlap and it will run once each second.
  • buffer_size_secs (float) – The length of audio to hold in the audio buffer.
  • buffer_write_size_secs (float) – The length of audio to capture into the buffer with each sampling from the microphone.
  • audio_device_index (int) – The audio input device index to use.

Performs classifications with a speech classification model.

This is intended for situations where you want to write a loop in your code that fetches new classification results in each iteration (by calling next()). If you instead want to receive a callback each time a new classification is detected, instead use classify_audio().

  • model (str) – Path to a .tflite file.
  • labels_file (str) – Path to a labels file (required only if the model does not include metadata labels). If provided, this overrides the labels file provided in the model metadata.
  • inference_overlap_ratio (float) – The amount of audio that should overlap between each sample used for inference. May be 0.0 up to (but not including) 1.0. For example, if set to 0.5 and the model takes a one-second sample as input, the model will run an inference every half second, or if set to 0, it will run once each second.
  • buffer_size_secs (float) – The length of audio to hold in the audio buffer.
  • buffer_write_size_secs (float) – The length of audio to capture into the buffer with each sampling from the microphone.
  • audio_device_index (int) – The audio input device index to use.

Returns a single speech classification.

Each time you call this, it pulls from a queue of recent classifications. So even if there are many classifications in a short period of time, this always returns them in the order received.


block (bool) – Whether this function should block until the next classification arrives (if there are no queued classifications). If False, it always returns immediately and returns None if the classification queue is empty.



Read labels from the model file metadata.


model (str) – Path to the .tflite file.


A dictionary of (int, string), mapping label ids to text labels.

More information

Learn more

To learn more about some of the tools and topics covered above, check out these other resources:

  • Linux basics: This is a great resource if you're new to Raspberry Pi OS (which is a Linux operating system). It includes lots of information and tips about using the terminal (the command line) and many other features of a Linux OS.

  • Python basics: This is a good reference for basic Python programming syntax. If you want a comprehensive introduction to Python, there are lots of online tutorials you can find just by searching the internet for "python for beginners."

  • Machine learning basics with TensorFlow: You can find a plethora of educational content online about ML, but this page offers some specific recommendations about learning ML from the TensorFlow team.

Or maybe you're looking for the Maker Kit source code:

Get help

If you're having trouble with any part of the AIY Maker Kit code or documentation, try these resources:

Upgrade an existing Raspberry Pi system

For experienced Raspberry Pi users only!

If you have a fully-functional Raspberry Pi system and you want to upgrade it to accelerate TensorFlow Lite models with the Coral USB Accelerator, then these instructions are for you.

Beware: We cannot guarantee this will work for you because there is an infinite number system configurations you might have that we have not tested. If you follow these instructions and encounter errors you're unable resolve, we cannot offer support and you should probably follow the instructions to flash the Maker Kit system image.

Before you start, be sure you have all the required hardware:

  • Raspberry Pi 3 or 4 running Raspberry Pi OS with desktop (Buster only)
  • Raspberry Pi Camera
  • Coral USB Accelerator
Note: We currently do not support Raspberry Pi OS Bullseye, due to breaking changes in the camera framework. We recommend using the Buster release and it must have a desktop.

Here's how you can upgrade your system with the Maker Kit software:

  1. Before you power the board, connect the Pi Camera Module. Leave the USB Accelerator disconnected until you finish the setup script.

  2. Power the board and connect to your Raspberry Pi terminal. You must have access to the Raspberry Pi OS desktop becuase our demos depend on the desktop to show the camera video.

  3. Open the Terminal on your Raspberry Pi. Navigate to a path where you want to save the Maker Kit API repository and run this command:

    bash <(curl

That's it! Now go checkout the Maker Kit project tutorials!

Project complete!

You did it! Whether this was your first hackable project or you’re a seasoned maker, we hope this project has sparked new ideas for you. Keep tinkering and remixing this project!