of 10
Electronic
Marking
and
Identification Techniques
to Discourage Document Copying
J.
Brussil,
S.
Low,
N.
Muxemchuk,
L.
O'Gormun
AT&T
Bell
Laboratories
Mumy
Hill,
NJ
Modern
computer
networks
make
it
possible to
distribute
documents quickly
and economically by
elem"(*
means rather
tham
by
conventional
paper
means. However.
the widespread adoption
of
electronic
distribution
of
copyrighted muterial
is currently impeded
by
the
euse
of
illirit copying
und
dissemination,
In
this
puper
WO
propose
techniqu
t>s
hut
discourage illicii
distribution
hv
embedding
each
document
with
U
unique
codeword.
Our
encoding techniques are indiscernible by
rruders. yei
enable
us
to
identifi
th8e
sanctioned
recipient
of
U
document
by
emminution
qf
a
recovered
docurneni.
We
propose
three coding
methods,
describe
one
in
drtail,
und present experimental results
showing
that
our
identifieution
techniques
are highly
reliable,
ellen
ufter
docwments
have been
photocopied.
1.
Introduction
Electronic distribution
of
publications
is
increasingly
available
through
on-line
text
databases, CD-ROMs,
computer
network based
retrieval
services, and
electronic
libraries
[Lesk90, Lynch90, Basch91,
Arms92, Saltzer92,
Fox931.
One electronic
library,
the
RightPages'
Service
[Hoffman93.
Stoiy92.
O'Gorinan92I.
has been
in
place
within Bell
Laboratories since 1991, and
lhas
recently
been
installed
at
the
University
of
Califomia
in
San
Francisco.
Electronic publishing
is being driven
by
the
decreasing
cost
of
computer processing and
high
quality printers
and displays. Furthermore, the increased availability
of
low
cost,
high
speed data communications
makes
it
possible
to
distribute electronic documents
to
large
groups quickly
and
inexpensively.
While photocopy infnngements
of
copyright have
always
concerned publishers,
the:
need
for document
security
is
much
greater
for
electronic document
distribution (Garrett91,
Vizard931.
The
sa"
advances
that
make elcctronic publishing and distribution
of
documents feasible
also
increase the
threat
of
"bootlegged" copies.
With
far
less effort
than
it
takes
to
copy
a
paper document
and mail
it
to
a single person,
am
electronic document
can
be
sent
to
a
large group
by
electronic
mail.
In
addition, while originals and
photocopies
of
a paper document
cLm
look and
feel
different,
copies
of
electronic documents are
identical.
In
order
for
electronic publishing
to become
accepted, publishers must
be
assured
that
revenues
will
not
be
lost
due to
theft
of
copyrighted materials.
Widespread
illicit
document dissemination should
ideally
be
at
least
as
costly
or
difficult
;is
obtaining
the
documents legitimately. Here
we
define
"illicit
dissemination"
as
distribution
of
documents
without ihe
knowledge
of
-
arid
payment
to
-
the
publisher;
tltiis
contrasts legitimate document distribution
by
ihe
publisher
or
the
publisher's electronic
documtmt
distributor.
This paper describes
a ineatns
of
discouraging
illicit
copying and dissemination.
A
document
is marked
in
an
indiscernible
way by
a
codeword identifying
ihe
registered owner
to
whom
the
document
is
smt.
If'
;I
document copy
is
found
that
is
suspected
to
have
been
illicitly
disseminated,
that
copy
can
be
decoded and
ihe
registered owner
identified.
The techniques
we
describe here
are
complementary
to
the security practices
that
can
be applied to ihe
legitimate distribution
of
documents. For example,
;I
document can be encrypted prior
to
transmission across
a cornputer network. Then
even
if
the
tlocuinent
file
is
intercepted
or
stolen
from
;t
database,
it
remains
cnreadable
to those
not
possessing
the
decrypting
kt:y.
The techniques
we
describe
in
this
paper
prcwiide
security
ajier
a document
has been
decrypted. and
is thus
reatdible
to
all.
We also
briefly
describe
a cryptographic
protocol
in
Section
3
of
this
paper
to
secure
ihe
document transmission process.
1.
RightPages
is
a trademark
of
AT&T
1278
1
Oa.2.1
0743-166W94
$3.00
0
1994
IEEE
In
addition
to
discouraging
illicit
dissemination
of
documents distributed
by
computer network,
our
proposed encoding techniques
can
also make paper
copies
of
documents traceable.
In
particular, the
codeword embedded
in
each
document survives
plain
paper copying. Hence, our techniques can also
be
applied to "closely held" documents,
such
as
confidential, limited distribution correspondence.
We
describe
this
both
as
a potential application
of the
methods
and
an
illuslration
of
their
robustness
in
noise.
2.
Document
Coding Methods
Document
marking can be
achieved
by
altering the
text
formatting,
or
by
altering certain characteristics
of
textual elements
(e.g.,
characters). The goal
in
the
design
of
coding methods
is to
develop alterations
that
are reliably decodeable
(even
in
the presence
of
noise)
yet
largely indiscernible
to
the reader. These
criteria,
reliable decoding and
minimum
visible
change, are
somewhat conflicting;
herein
lies
the
challenge
in
designing document
marking
techniques.
The marking techniques
we
describe
can be
applied
to
either
an
image representation
of
the document or
to a
document format
file.
The
document format
file
is
a
computer
file
describing the document content
and
page
layout
(or
formatting),
using
standard format description
languages such
as
PostScript2, TeX, @off,
etc.
It is from
this
format
file
that the image
-
what
the reader sees
-
is
generated. The image representation describes
each
page
(or
sub-page)
of
a
document
as
an
array
of
pixels.
The image
may be
bitmap (also called
binary
or
black-
and-white), gray-scale,
or
color.
For this work,
we
describe
both
document format
file and
image coding
techniques,
however
we
restrict
the
latter
to bitmaps
encoded
within
the binary-valued
text
regions.
Common
to
each
technique
is
that
a codeword
is
embedded
in
the
document
by
altering particular textual
features.
For instance, consider the codeword 1101
(binary). Reading
this
code right to
left from
the
least
significant
bit,
the
lirst
document feature
is altered
for
bit
1,
the second feature
is not altered for
bit
0, and
the
next
two features
are
altered
for
the two
1
bits.
It is the
type
of feature that distinguishes
each
particular encoding
method.
We describe these features for each
method
below
and
give
a
simple comparison
of
the
relative
advantages
and
disadvantages
of
each
technique.
2.
PostScript
is
a trademark
of
Adobe
Systems,
Inc
The three coding techniques that
we
propose
illustrate
different approaches rather
than
form
<an
exhaustive
list
of
document marking techniques.
The
techniques
can
be
used
either separately
or
jointly.
Each
technique enjoys certain advantages
or
applicability
as
we
discuss
below.
2.1
Line-Shift
Coding
This
is
a method
of
altering a document
by
vertically
shifting
the locations
of
text lines
to
encode the
document uniquely.
This
encoding
may
be
applied
either
to
the
format
file
or
to the bitmap
of
a page image.
The embedded codeword
may
be
extracted
from
the
format
file
or
bitmap.
In
certain cases
this
decoding
can
be
accomplished without
need of
the
original image,
since the original
is known to have
uniform
line
spacing
between
adjacent lines
within
a paragraph.
2.2
Word-Shift Coding
This
is
a
method
of
altering a document
by
horizontally
shifting
the locations
of words within text
lines
to
encode the document
uniquely.
This encoding
CM
be
applied
to
either the format
file
or to the bitmap of
a
page image. Decoding
may
be
performed
from the
format
file
or
bitmap.
The
dod
is applicable
only
to
documents
with
variable spacing
between
adjacent
words.
Variable spacing
in
text documents
is commonly
used
to
distribute white space
when
justifying
text.
Because
of
this
variable spacing, decoding requires the
original image
-
or
more
specifically,
the spacing
between words
in
the
unencoded
document. See Figure
1
for
an
example
of
word-shift coding.
-f
Now
is
the time
for
all
men/women to
...
Now
is
the time
for
all
men/women
to
...
Figure
1
-
Example
of
word-shift coding.
In
a),
the
top
text line
has added
spacing
before the "for," the
bottom
text line
has the
same
spacing after
the
"for."
In
b),
these
same text
lines
are
shown
again
without
the
vertical
lines
to
demonstrate that
either spacing
appears natural.
Consider
the
following example
of how
a
document
might be
encoded
with
word-shifting. For
each
text
line,
the largest and smallest spacings
between
words are
found.
To
code a
line,
the largest spacing
is decremented
1
Oa.2.2
1279
by
some amount
and the
smallest is augmented
by
the
same amount. This
maintains
the
text
line length, and
produces little qualitative change
to the text image.
2.3
Feature Coding
This
is a coding
method
that
is
applied either
to
a format
file
or
to
a bitmap
image
of
iI
document. The
image is
examined
for
chosen
text features, and those features
are
altered,
or
not
altered, depending
on
the
codeword.
Decoding
requires
the
oniginal
image,
or
more
specifically,
a
specification
of
the
change
in
pixels
at
a
feature. There are
many
possible choices
of
text
features;
here,
we
choose
to
alter
upward, vertical
endlines
-.
that
is
the tops
of
letters,
b,
d,
Ii,
etc. These endlines are
altered
by
extending
or
shortening
their
lengths
by
one
(or
more)
pixels,
but
otherwise
not
changing
the
endline
feature. See Figure
2
for
an
example
of
feature coding.
Figure
2
-
Example shows feature coding performed
on a portion
of
text from a
jourinal
table
of contents. In
a),
no coding has
been
applied. In
b),
feature coding
has
been
applied
to
select characters.
In
c),
the
feature
coding has been exaggerated
to
show feature
alterations.
Among
the proposed
encoding techniques,
line-
shifting
is
likely
to
be the niost
easily discemible
by
readers.
However
we
also
expect
line-shifting
to
be
the
most
robust type
of
encoding
in
the
presence
of
noise.
This
is because
the long lengrhs
of
text lines
provide
a
relatively
easily detectable feature
For
this reason, line
shifting
is particularly
well
suited to marking documents
to
be
distributed
in
paper
form,
where
noise
can
be
introduced
in
printing
and
phottmpying.
As
we
will
show
in
Section
4,
our experiments indicate
that
we
can
easily
encode documents
with
line shifts
that
'are
sufficiently
small1
that
they
are not noticed
by
the
casual
reader. while
still retaining
the
ability
to decode reliably.
We
expect that
word-shifting
will
be less
discernible
to
the reader
than
line-shifting, since
the spacing
between
adjacent words
on
,a
line
is often
varied
to
support
text
justification.
Feature
encoding
can
accommodate
a
particularly large
number
of
sanctioned
document
recipients, since there are
frequently two
or
more
features available for encoding
in
each word.
Feature
alterations are also largely indiscernible to
readers. Feature
encoding
also
has
the
additional
advantage
that
it
can
be applied
directly
to
image files.
which
allows encoding to
be
introduced
in
the absence
of
a format
file.
A
technically sophisticated "attacker"
CM
detect
that
a
document
has
been
encoded
by
any
ot
the three
techniques
we
have introduced. Such
an
attacker
cain
also attempt to
remove the
encoding (e.g.,
produce
;U)
unencoded
document copy). Our
goal
in
the
design
oi
encoding techniques
is
to
make
successtul attacks
extremely
difficult
or
costly.
We
will
return
to
;I
discussion
of
the difficulty
ot
dcfeating
each
of
our
encoding techniques
in
Section
5.
3.
Transmission Security
by
Cryptographic Protorol
A
publisher
CM
distribute documents
as
either
image
or
format files.
The coding
methods
described above
are
intended
to
discourage
illicit copying and
dissemination
of
read7ble
images.
However.
before
a
docurnen1
IS
to
he
displayed
or
printed
(e.g.,
dunng
transmission
on
ai
computer
network),
the
document
can
he
sec
ured
by
king
encrypted. Though this
paper
pnmarilv describes
image
coding
methods,
we
briefly
describe
,I
complcte
system for document
security
using
ai
cryptographic
protocol
proposed
to
secure
transmitted
documrnn
against
theft
[Choudhury931.
The
proposed
cryptographic techniques
for
document
distribution
use
both
public
key
and secret
key
cryptography.
Each
document recipient
has
a
public
kcy,
PK
,
with
which
anyone
can
cncode information,
antl
ai
private key,
SH,
with
which only
thc
reiiprent
an
decode the
information. The publisher
first
sc
lids
the
recipient
a program
to process
a
document.
The
progriun
is
changed often.
to
reduce
the
value
of
reveiw
engineering the program. The
program
includes
a
secret
key,
SD,
that
is
encrypted
with
PR,
so
Ihat
only
ihe
individual
with
SR
can run
the program antl
recover
SII.
The document that
is
transmitted
by
the
publisher
is
encrypted
so
that
SD
is required
to
receive
it.
4lthough
;I
user
may
be willing
to
share
the
program and
document.
it
is assumed that
SR
is
too
valuable
to
pvi:
away.
Perhaps
it
is the same key
that is
used
in
.I
signature
system
to charge purchases
of
docurnents.
(It
is
unliki.ly
that
anyone
would
give
his
credit
card
to
a
person who
is
unscrupulous enough
to
violate
the
copyright
la^
s.)
The
information
transmitted
by
the publisher
includes
a unique identification
number
antl
ii
format
file.
The same format
file
is transmitted
to
every
recipient,
which
rnakes things
easier
for
the
publisher
by
keeping document preparation
and
secure
tlistributi'on
separate.
The
program
on
the
recipient's computer
1280
-
requests
SR
from
the recipient,
-
uses
SR
to decrypt
SD,
-
uses
SD
to
decrypt the identification
number and
-
generates the image
file
with
the
identification
This example
illustrates
that the image encoding
techniques introduced
in
this
paper
may be viewed
as
one component
of
a
larger,
secure document distribution
system.
format
file,
and
number
encoded
in
the
image.
4.
Implementation
and
Experimental Results for Line-
Shift
Coding
Method
In this
section
we
describe
in
detail
the
methods for
coding
and
decoding
we
used
for
testing
the
line-shift
coding method.
Each
intended document recipient was
preassigned
a
unique codeword.
Each
codeword
specified
a
set
of
text
lines
to be moved
in
the document
specifically
for
that recipient.
The
length
of
each
codeword equaled the
maximum number
of
lines
that
were
displaced
in
the area
to
be encoded.
In
our
line-
shift
encoder,
each
codeword element
belonged
to
the
alphabet
(-
1,
+
1,
01,
corresponding
to
a
line
to
be
shifted
up,
down
or
remain
unmoved.
Though our encoder was capable
of
shifting
an
arbitrary text
line
either
up
or down,
we
found that the
decoding performance
was
greatly improved
by
constraining the
set
of
lines
moved.
In
the
results
presented
in
this
paper,
we
used
a
differential
(or
difference) encoding technique.
With
this
coding
we
kept every other
line
of
text
in
each
paragraph unmoved,
starting
with
the
first
line
of
each paragraph. Each
line
between
two
unmoved
lines was always
moved
either
up
or
down. That
is,
for each paragraph,
the
lst,
3rd,
5th,
etc.
lines were unmoved,
while
the 2nd,
4th,
etc. lines
were
moved. This encoding
was
partially
motivated
by
image defects
we
will discuss
later
in
this
section. Note
that
the consequence
of using
differential
encoding
is
that
the
length
of
each codeword
is cut approximately
in
half.
While
this
reduces
the
potential
number of
recipients for an encoded document, the
number can
still
be
extremely
large.
In
each
of
our experiments
we
displaced
at least
19
lines,
which
corresponds
to
a
potential
of
at
least
219
=
524,288
distinct
codewords/page.
More
than
a
single page
per
document
can be coded for
a
larger
number
of
codeword
possibilities
or
redundancy
for
error-correction.
Each of
our experiments
began
with
a
paper copy
of
an
encoded
page.
Decoding
from
the paper copy
first
required scanning to obtain the
digital
image.
Subsequent image processing improved
detectability;
salt-and-pepper noise
was
removed [O’Gorman92]
<and
the image was deskewed
to
obtain horizontal text
[O’Gorman93]. Text
lines
were located
using
a
horizontal
projection profile.
This
is
a
plot of
the
summation
of
ON-valued pixels along each
row.
For
a
document
whose
text
lines
span
horizontally,
this profile
has
peaks
whose
widths
are
equal
to
the character height
and
valleys whose widths
are
equal
to
the white space
between
adjacent
text
lines.
The distances
between
profile
peaks
are
the
interline
spaces.
The
line-shift
decoder measured the distance
between each
pair
of
adjacent text
line
profiles (within
the
page
profile).
This
was
done
by
one
of
two
approaches
-
either
we
measured the distance
between
the
baselines
of
adjacent
line profiles,
or
we measured
the difference
between
centroids
of
adjacent
line
profiles.
A
baseline
is
the logical horizontal
line
on
which
characters
sit;
a
centroid
is
the center
of
mass
of
a
text
line
profile.
As
seen
in
Figure
3,
each text
line
produces
a
distinctive
profile
with
two peaks,
corresponding
to
the midline and baseline. The peak
in
the
profile
nearest
the
bottom
of
each text
line is
taken
to
be
the baseline.
To
define the centroid
of
a
text
line
precisely, suppose the text
line
profile runs
from
SCM
line
y,
y
+
1,
,
to
y
+
w,
and the respective
number of
ON
bitdscan
line
are
h(y),
h(y+l),
a..
,
h(y+w).
Then
the
text line
centroid
is given by
Y
MY)
+
...
+
(Y+w)h(Y+w)
.
(3.1)
The measured
interline
spacings
(i.e., between
adjacent
centroids
or
baselines) were used
to
determine
if white
space
has
been
added
or
subtracted because
of
a
text
line
shift.
This process, repeated for every
line,
determined
the codeword
of
the
document
-
this
uniquely
determined the original recipient.
We now
describe our decision rules for detection of
line
shifting
in
a
page
with
differential
encoding.
Suppose
text
lines
i
-
1
and
i
+
1
are
not
shifted and
text
line
i
is
either shifted
up
or
down.
In
the unaltered
document, the distance
between
adjacent baselines,
or
baseline spacings,
are
the
same.
Let
si-l
and
si
be the
distances
between
baselines
i-1
and
i,
and
between
baselines
i
and
i+
1,
respectively,
in
the altered
document. Then the
baseline detection decision rule
is:
h(y)
+
**a
+
h(y+w)
ifsi-l
>si
:
decide line
i
shifted
down
ifsi-1
<si
:
decide line
i shifted
up
(3.2)
otherwise
:
uncertain
1
Oa.2.4
1281
ON
bits
500
-0
I
-0
I
1000
I
2000
I
3000
scan
Line
Figure
3
-
Profile
of
a recovered document page. Decoding a page with line shifting requires measuring
the
distances
between adjacent text line
centroids;
(marked with
0)
or
baselines (marked with
+)
and deciding whether white space has
been
added
or
subtracted.
Unlike
baseline spacings, centroid spacings
between
adjacent text
lines
in
the original
unaltered
document
are
not necessarily uniformly
spaced.
In
centroid-based
detection,
the
decision
is based
on
the difference
of
centroid spacings
in
the
altered and
unaltered
documents.
More
specifically,
let
s,-~
and
s,
be the
centroid
spacings
between
lines
i
-
1
and
i,
and between
lines
I
and
i
+
1,
respectively,
in the altered
document;
let
1
.
and
t
,
be
the
corresponding centroid spacings
in
the
unaltered
document. Then the
centroid
detection
&cision
ride
is:
s,-,-t,-,
>
s-t,
decidelineishifieddown
decide
line
i
shifted
up
(3.3)
otherwise
An
error
is said
to
occur
if
our
decoder decides
that
a
text
line
was
moved
up
(down)
when
it was moved
down
(up).
In
baseline detection,
a
second type
of
error exists.
We say that
the
decoder is
uncertain
if
it
cannot
determine
whether
a line
wa..
moved
up
or down. Since,
for
our
encoding
method,
every
other
line
ils
moved
and
this
information is known
to
the
decoder, false
al"m
do
not
occur.
4.1
Experimental
Results
for
Line-Shift
Coding
We
conducted
two
sets
of
experiments. The
firs1
set
tested
how
well
line-shift coding
works with
different
font
sizes
and
different
line
spacing shifts
in
the
presence
of
limited,
but
typical,
image
noise. The second
set
tested how well
a fixed
lint:
spacing shift
could
be
detected
as
document
degradation
became
increasingly
severe.
In
this
section,
we
ffirst
describe these
experiments and
then
present
our
results.
The equipment
we
used
in
both
experiments
was
as
follows:
a
Ricoh
FSlS
400
dpi
Flat
Bed
Electronic
Scanner,
Apple
Laserwriter
IIntx
300
dpi laser printer,
and
a
Xerox
5052
plain paper
copier3. The
printer and
copier
were
selected
in
part
because
they
are
typical
of
equipment
found
in
wide
use
in
office environments.
The particular machines
we
used
could be
characterizd
as
being heavily
used
hut
well
maintamed.
Writing
the software
routine
to
implement
a
rudimentary
line-shift encoder for
a
PostScript input
file
was
simple. We chose
the
PostScript
format because:
1)
it
is the
most
common Page Description
Language
in
use
ioday,
2)
it enables
us
to
have
sufficiently
fine
control
of
text
placement,
and
3)
it permits
us
to encode documents
produced
by
a
wide variety
of
word
processing
applications. PostScript describes
the
document
content
;I
page
at
a
time.
Roughly
speaking,
it
specifies
ihe
content
of
a
text
line
(or
text
line fragment such
;IS
a
phrase, word,
or
character)
and identifies
the
location
lor
the
text
to
be
displayed.
Text location is specified
by
(in
x-y
coordinate representing a
position
on
a virtual
page.
'Though
it
depends
on
the
application software
generating the
PostScript,
text placement
can
~vpically
be
modified
by
as
little
as
U720 inch
(approxiinately
1/10
of
a
printer's "point").
Most personal laser
printcrs
in
common
use
today have
ahout
half
this
resolution
(e.g.,
U300
inch).
'3.
Xerox
and
5052
are
trademarks
of
Xerox
Corp.
Apple
and
LaserWnter are trademarks
of
Apple
Computer,
Inc.
Kicoh
and
FSl
are trademarks
of
Ricoh
Cop
1282
1
Oa.2.5