Grasping Planning
Finding a gripper configuration that maximizes a success (or quality) metric.
Method fall into wo categories based on success criteria:
- analytic methods
- empirical (or data-driven) methods
Computer Vision Techniques in Robot Grasping
Analytic grasp planning methods register images of rigid objects to a known database of 3D models, which typically involves segmentation, classification, and geometric pose estimation from 3D point cloud data in order to index precomputed grasps.
deep learning
- stimate 3D object shape and pose directly from color and depth images
- detect graspable regions directly in images without explicitly representing object shape and pose
Problem Statement
planning a robust planar parallel-jaw grasp for a singulated rigid object resting on a table based on point clouds from a depth camera.
Input: a candidate grasp and a depth image
Output: an estimate of robustness or probability of success under uncertainty in sensing and control
A. Assumptions
a parallel-jaw gripper
rigid objects singulated on a planar worksurface
single-view (2.5D) point clouds taken with a depth camera
B.Definitions
States
\(x = (O,T_o,T_c,\gamma)\)
O: the geometry and mass properties of an object
\(T_o\): the 3d poses of the object
\(T_c\): the 3d poses of the camera
\(\gamma\): the coefficient of friction between the object and gripper.
Grasps
\(u = (p,\phi) \to R^3 × S^1\)
denote a parallel-jaw grasp in 3D space specified by a center \(p = (x, y, z) \in R^3\) \(\phi\) angle
Point Clouds
\(y = r_+^{H×W}\) 2.5D point cloud
Robust Analytic Grasp Metircs
S(u,x)\( \in {0,1}\) a binary-valued grasp success metric, such as force closure or physical lifting.
p(S,u, x, y) a joint distribution on grasp success, grasps, states, and point clouds modeling imprecision in sensing and control.
Q(u,v) = E[S|u,y]
C.Objective
Learn a robustness function \(Q_{\theta^*}\)(u,y)\(\in [0,1]\) over many possible grasps, objects, and images that classifies grasps according to the binary success metric:
θ
∗
=
a
r
g
m
i
n
θ
∈
Θ
E
p
(
S
,
u
,
x
,
y
)
[
L
(
S
,
Q
θ
(
u
,
y
)
)
]
\theta^* = argmin_{\theta \in \Theta} E_{p(S,u,x,y)}[L(S,Q_{\theta}(u,y))]
θ∗=argminθ∈ΘEp(S,u,x,y)[L(S,Qθ(u,y))]
L the cross-entropy loss function
\(\Theta\) the parameters of the Grasp Quality Convolutional Network
Learning Q rather than directly learning the policy allows us to enforce task-specific constraints without having to update the learned model.
Learning a grasp robustness function
A. Dataset Generation
Graphical Model:
p(S,u,x,y) the product of a state distribution p(x)
observaton model p(y|x)
a grasp candidate model p(u|x)
an analytic model of grasp success p(S|u,x)
Model the state distribution as:
p
(
x
)
=
p
(
γ
)
p
(
O
)
p
(
T
o
∣
O
)
p
(
T
c
)
p(x) = p(γ)p(O)p(T_o |O)p(T_c )
p(x)=p(γ)p(O)p(To∣O)p(Tc)
Grasp candidate model p(u | x) is a uniform distribution over pairs of antipodal contact points on the object surface that form a grasp axis parallel to the table plane.
Observation model is y = αŷ + \(\epsilon\)
ŷ is a rendered depth image for a given object in a given pose
α is a Gamma random variable modeling depth-proportional noise
\(\epsilon\) is zero-mean Gaussian Process noise over pixel coordinates with bandwidth l and measurement noise σ modeling additive noise
Model grasp success
S
(
u
,
x
)
=
{
1
E
Q
>
σ
a
n
d
c
o
l
l
f
r
e
s
s
(
u
,
x
)
0
o
t
h
e
r
w
i
s
e
S(u,x) = \begin{cases} 1 & E_Q \gt \sigma and collfress(u,x) \\ 0 & otherwise \end{cases}
S(u,x)={10EQ>σandcollfress(u,x)otherwise
\(E_Q\) the robust epsilon quality
collfree(u,x) the gripper does not collide with the object or table
Database
Parallel-Jaw Grasps
For each grasp, evaluate the expected epsilon quality \(E_Q\) under object pose, gripper pose, and friction coefficient uncertainty using Monte-Carlo sampling.
Rendered Point Clouds
B. Grasp Quality Convolutional Neural Network
Architecture: GQ-CNN
Input: the gripper depth from the camera z and a depth image centered on the grasp center pixel v = (i, j) and
aligned with the grasp axis orientation φ.
Normalize the input data by subtracting the mean and dividing by the standard deviation of the training data
Training Dataset:
associating grasps with a pixel v, orientation φ, and depth z relative to rendered depth images
compute these parameters by transforming grasps into the camera frame of reference using the camera pose \(T_c\)
nd projecting the 3D grasp position and orientation onto the imaging plane of the camera.
Optimization:
SGD
Initialize the weight of model by sampling from a zero mean gaussian with variance \(\frac{2}{n_i}\)
\(n_i\) is the number of inputs to the i-th network layer.
Augment the dataset: reflect vertical and horizontal axes, rotate, adaptively sample image noise
Grasp Planning
grasping policy \(\pi_{\theta}(y) = argman_{u \in C}Q_{\theta}(u,y)\)
C a discrete set of antipodal candidate graspssampled uniformly at random in image space for surface normals defined by the depth image gradients.
Each grasp candidate:
- kinematically reachable
- not in collision with the table is executed