Electronic marking and identification techniques to discourage document copying - INFOCOM '94. 'Networking for Global Communications'., 13th Proceedings IEEE

Electronic

Marking

and

Identification Techniques

to Discourage Document Copying

J.

Brussil,

S.

Low,

N.

Muxemchuk,

L.

O'Gormun

AT&T

Bell

Laboratories

Mumy

Hill,

NJ

Modern

computer

networks

make

it

possible to

distribute

documents quickly

and economically by

elem"(*

means rather

tham

by

conventional

paper

means. However.

the widespread adoption

of

electronic

distribution

of

copyrighted muterial

is currently impeded

by

the

euse

of

illirit copying

und

dissemination,

In

this

puper

WO

propose

techniqu

t>s

hut

discourage illicii

distribution

hv

embedding

each

document

with

U

unique

codeword.

Our

encoding techniques are indiscernible by

rruders. yei

enable

us

to

identifi

th8e

sanctioned

recipient

of

U

document

by

emminution

qf

a

recovered

docurneni.

We

propose

three coding

methods,

describe

one

in

drtail,

und present experimental results

showing

that

our

identifieution

techniques

are highly

reliable,

ellen

ufter

docwments

have been

photocopied.

1.

Introduction

Electronic distribution

of

publications

is

increasingly

available

through

on-line

text

databases, CD-ROMs,

computer

network based

retrieval

services, and

electronic

libraries

[Lesk90, Lynch90, Basch91,

Arms92, Saltzer92,

Fox931.

One electronic

library,

the

RightPages'

Service

[Hoffman93.

Stoiy92.

O'Gorinan92I.

has been

in

place

within Bell

Laboratories since 1991, and

lhas

recently

been

installed

at

the

University

of

Califomia

in

San

Francisco.

Electronic publishing

is being driven

by

the

decreasing

cost

of

computer processing and

high

quality printers

and displays. Furthermore, the increased availability

of

low

cost,

high

speed data communications

makes

it

possible

to

distribute electronic documents

to

large

groups quickly

and

inexpensively.

While photocopy infnngements

of

copyright have

always

concerned publishers,

the:

need

for document

security

is

much

greater

for

electronic document

distribution (Garrett91,

Vizard931.

The

sa"

advances

that

make elcctronic publishing and distribution

of

documents feasible

also

increase the

threat

of

"bootlegged" copies.

With

far

less effort

than

it

takes

to

copy

a

paper document

and mail

it

to

a single person,

am

electronic document

can

be

sent

to

a

large group

by

electronic

mail.

In

addition, while originals and

photocopies

of

a paper document

cLm

look and

feel

different,

copies

of

electronic documents are

identical.

In

order

for

electronic publishing

to become

accepted, publishers must

be

assured

that

revenues

will

not

be

lost

due to

theft

of

copyrighted materials.

Widespread

illicit

document dissemination should

ideally

be

at

least

as

costly

or

difficult

;is

obtaining

the

documents legitimately. Here

we

define

"illicit

dissemination"

as

distribution

of

documents

without ihe

knowledge

of

-

arid

payment

to

-

the

publisher;

tltiis

contrasts legitimate document distribution

by

ihe

publisher

or

the

publisher's electronic

documtmt

distributor.

This paper describes

a ineatns

of

discouraging

illicit

copying and dissemination.

A

document

is marked

in

an

indiscernible

way by

a

codeword identifying

ihe

registered owner

to

whom

the

document

is

smt.

If'

;I

document copy

is

found

that

is

suspected

to

have

been

illicitly

disseminated,

that

copy

can

be

decoded and

ihe

registered owner

identified.

The techniques

we

describe here

are

complementary

to

the security practices

that

can

be applied to ihe

legitimate distribution

of

documents. For example,

;I

document can be encrypted prior

to

transmission across

a cornputer network. Then

even

if

the

tlocuinent

file

is

intercepted

or

stolen

from

;t

database,

it

remains

cnreadable

to those

not

possessing

the

decrypting

kt:y.

The techniques

we

describe

in

this

paper

prcwiide

security

ajier

a document

has been

decrypted. and

is thus

reatdible

to

all.

We also

briefly

describe

a cryptographic

protocol

in

Section

3

of

this

paper

to

secure

ihe

document transmission process.

1.

RightPages

is

a trademark

of

AT&T

1278

1

Oa.2.1

0743-166W94

$3.00

0

1994

IEEE

In

addition

to

discouraging

illicit

dissemination

of

documents distributed

by

computer network,

our

proposed encoding techniques

can

also make paper

copies

of

documents traceable.

In

particular, the

codeword embedded

in

each

document survives

plain

paper copying. Hence, our techniques can also

be

applied to "closely held" documents,

such

as

confidential, limited distribution correspondence.

We

describe

this

both

as

a potential application

of the

methods

and

an

illuslration

of

their

robustness

in

noise.

2.

Document

Coding Methods

Document

marking can be

achieved

by

altering the

text

formatting,

or

by

altering certain characteristics

of

textual elements

(e.g.,

characters). The goal

in

the

design

of

coding methods

is to

develop alterations

that

are reliably decodeable

(even

in

the presence

of

noise)

yet

largely indiscernible

to

the reader. These

criteria,

reliable decoding and

minimum

visible

change, are

somewhat conflicting;

herein

lies

the

challenge

in

designing document

marking

techniques.

The marking techniques

we

describe

can be

applied

to

either

an

image representation

of

the document or

to a

document format

file.

The

document format

file

is

a

computer

file

describing the document content

and

page

layout

(or

formatting),

using

standard format description

languages such

as

PostScript2, TeX, @off,

etc.

It is from

this

format

file

that the image

-

what

the reader sees

-

is

generated. The image representation describes

each

page

(or

sub-page)

of

a

document

as

an

array

of

pixels.

The image

may be

bitmap (also called

binary

or

black-

and-white), gray-scale,

or

color.

For this work,

we

describe

both

document format

file and

image coding

techniques,

however

we

restrict

the

latter

to bitmaps

encoded

within

the binary-valued

text

regions.

Common

to

each

technique

is

that

a codeword

is

embedded

in

the

document

by

altering particular textual

features.

For instance, consider the codeword 1101

(binary). Reading

this

code right to

left from

the

least

significant

bit,

the

lirst

document feature

is altered

for

bit

1,

the second feature

is not altered for

bit

0, and

the

are

altered

for

the two

1

bits.

It is the

type

of feature that distinguishes

each

particular encoding

method.

We describe these features for each

method

below

and

give

a

simple comparison

of

the

relative

advantages

and

disadvantages

of

each

technique.

2.

PostScript

is

a trademark

of

Adobe

Systems,

Inc

The three coding techniques that

we

propose

illustrate

different approaches rather

than

form

<an

exhaustive

list

of

document marking techniques.

The

techniques

can

be

used

either separately

or

jointly.

Each

technique enjoys certain advantages

or

applicability

as

we

discuss

below.

2.1

Line-Shift

Coding

This

is

a method

of

altering a document

by

vertically

shifting

the locations

of

text lines

to

encode the

document uniquely.

This

encoding

may

be

applied

either

to

the

format

file

or

to the bitmap

of

a page image.

The embedded codeword

may

be

extracted

from

the

format

file

or

bitmap.

In

certain cases

this

decoding

can

be

accomplished without

need of

the

original image,

since the original

is known to have

uniform

line

spacing

between

adjacent lines

within

a paragraph.

2.2

Word-Shift Coding

This

is

a

method

of

altering a document

by

horizontally

shifting

the locations

of words within text

lines

to

encode the document

uniquely.

This encoding

CM

be

applied

to

either the format

file

or to the bitmap of

a

page image. Decoding

may

be

performed

from the

format

file

or

bitmap.

The

dod

is applicable

only

to

documents

with

variable spacing

between

adjacent

words.

Variable spacing

in

text documents

is commonly

used

to

distribute white space

when

justifying

text.

Because

of

this

variable spacing, decoding requires the

original image

-

or

more

specifically,

the spacing

between words

in

the

unencoded

document. See Figure

1

for

an

example

of

word-shift coding.

-f

Now

is

the time

for

all

men/women to

...

Now

is

the time

for

all

men/women

to

...

Figure

1

-

Example

of

word-shift coding.

In

a),

the

top

text line

has added

spacing

before the "for," the

bottom

text line

has the

same

spacing after

the

"for."

In

b),

these

same text

lines

are

shown

again

without

the

vertical

lines

to

demonstrate that

either spacing

appears natural.

Consider

the

following example

of how

a

document

might be

encoded

with

word-shifting. For

each

text

line,

the largest and smallest spacings

between

words are

found.

To

code a

line,

the largest spacing

is decremented

1

Oa.2.2

1279

by

some amount

and the

smallest is augmented

by

the

same amount. This

maintains

the

text

line length, and

produces little qualitative change

to the text image.

2.3

Feature Coding

This

is a coding

method

that

is

applied either

to

a format

file

or

to

a bitmap

image

of

iI

document. The

image is

examined

for

chosen

text features, and those features

are

altered,

or

not

altered, depending

on

the

codeword.

Decoding

requires

the

oniginal

image,

or

more

specifically,

a

specification

of

the

change

in

pixels

at

a

feature. There are

many

possible choices

of

text

features;

here,

we

choose

to

alter

upward, vertical

endlines

-.

that

is

the tops

of

letters,

b,

d,

Ii,

etc. These endlines are

altered

by

extending

or

shortening

their

lengths

by

one

(or

more)

pixels,

but

otherwise

not

changing

the

endline

feature. See Figure

2

for

an

example

of

feature coding.

Figure

2

-

Example shows feature coding performed

on a portion

of

text from a

jourinal

table

of contents. In

a),

no coding has

been

applied. In

b),

feature coding

has

been

applied

to

select characters.

In

c),

the

feature

coding has been exaggerated

to

show feature

alterations.

Among

the proposed

encoding techniques,

line-

shifting

is

likely

to

be the niost

easily discemible

by

readers.

However

we

also

expect

line-shifting

to

be

the

most

robust type

of

encoding

in

the

presence

of

noise.

This

is because

the long lengrhs

of

text lines

provide

a

relatively

easily detectable feature

For

this reason, line

shifting

is particularly

well

suited to marking documents

to

be

distributed

in

paper

form,

where

noise

can

be

introduced

in

printing

and

phottmpying.

As

we

will

show

in

Section

4,

our experiments indicate

that

we

can

easily

encode documents

with

line shifts

that

'are

sufficiently

small1

that

they

are not noticed

by

the

casual

reader. while

still retaining

the

ability

to decode reliably.

We

expect that

word-shifting

will

be less

discernible

to

the reader

than

line-shifting, since

the spacing

between

adjacent words

on

,a

line

is often

varied

to

support

text

justification.

Feature

encoding

can

accommodate

a

particularly large

number

of

sanctioned

document

recipients, since there are

frequently two

or

more

features available for encoding

in

each word.

Feature

alterations are also largely indiscernible to

readers. Feature

encoding

also

has

the

additional

advantage

that

it

can

be applied

directly

to

image files.

which

allows encoding to

be

introduced

in

the absence

of

a format

file.

A

technically sophisticated "attacker"

CM

detect

that

a

document

has

been

encoded

by

any

ot

the three

techniques

we

have introduced. Such

an

attacker

cain

also attempt to

remove the

encoding (e.g.,

produce

;U)

unencoded

document copy). Our

goal

in

the

design

oi

encoding techniques

is

to

make

successtul attacks

extremely

difficult

or

costly.

We

will

return

to

;I

discussion

of

the difficulty

ot

dcfeating

each

of

our

encoding techniques

in

Section

5.

3.

Transmission Security

by

Cryptographic Protorol

A

publisher

CM

distribute documents

as

either

image

or

format files.

The coding

methods

described above

are

intended

to

discourage

illicit copying and

dissemination

of

read7ble

images.

However.

before

a

docurnen1

IS

to

he

displayed

or

printed

(e.g.,

dunng

transmission

on

ai

computer

network),

the

document

can

he

sec

ured

by

king

encrypted. Though this

paper

pnmarilv describes

image

coding

methods,

we

briefly

describe

,I

complcte

system for document

security

using

ai

cryptographic

protocol

proposed

to

secure

transmitted

documrnn

against

theft

[Choudhury931.

The

proposed

cryptographic techniques

for

document

distribution

use

both

public

key

and secret

key

cryptography.

Each

document recipient

has

a

public

kcy,

PK

,

with

which

anyone

can

cncode information,

antl

ai

private key,

SH,

with

which only

thc

reiiprent

an

decode the

information. The publisher

first

sc

lids

the

recipient

a program

to process

a

document.

The

progriun

is

changed often.

to

reduce

the

value

of

reveiw

engineering the program. The

program

includes

a

secret

key,

SD,

that

is

encrypted

with

PR,

so

Ihat

only

ihe

individual

with

SR

can run

the program antl

recover

SII.

The document that

is

transmitted

by

the

publisher

is

encrypted

so

that

SD

is required

to

receive

it.

4lthough

;I

user

may

be willing

to

share

the

program and

document.

it

is assumed that

SR

is

too

valuable

to

pvi:

away.

Perhaps

it

is the same key

that is

used

in

.I

signature

system

to charge purchases

of

docurnents.

(It

is

unliki.ly

that

anyone

would

give

his

credit

card

to

a

person who

is

unscrupulous enough

to

violate

the

copyright

la^

s.)

The

information

transmitted

by

the publisher

includes

a unique identification

number

antl

ii

format

file.

The same format

file

is transmitted

to

every

recipient,

which

rnakes things

easier

for

the

publisher

by

keeping document preparation

and

secure

tlistributi'on

separate.

The

program

on

the

recipient's computer

1280

-

requests

SR

from

the recipient,

-

uses

SR

to decrypt

SD,

-

uses

SD

to

decrypt the identification

number and

-

generates the image

file

with

the

identification

This example

illustrates

that the image encoding

techniques introduced

in

this

paper

may be viewed

as

one component

of

a

larger,

secure document distribution

system.

format

file,

and

number

encoded

in

the

image.

4.

Implementation

and

Experimental Results for Line-

Shift

Coding

Method

In this

section

we

describe

in

detail

the

methods for

coding

and

decoding

we

used

for

testing

the

line-shift

coding method.

Each

intended document recipient was

preassigned

a

unique codeword.

Each

codeword

specified

a

set

of

text

lines

to be moved

in

the document

specifically

for

that recipient.

The

length

of

each

codeword equaled the

maximum number

of

lines

that

were

displaced

in

the area

to

be encoded.

In

our

line-

shift

encoder,

each

codeword element

belonged

to

the

alphabet

(-

1,

+

1,

01,

corresponding

to

a

line

to

be

shifted

up,

down

or

remain

unmoved.

Though our encoder was capable

of

shifting

an

arbitrary text

line

either

up

or down,

we

found that the

decoding performance

was

greatly improved

by

constraining the

set

of

lines

moved.

In

the

results

presented

in

this

paper,

we

used

a

differential

(or

difference) encoding technique.

With

this

coding

we

kept every other

line

of

text

in

each

paragraph unmoved,

starting

with

the

first

line

of

each paragraph. Each

line

between

two

unmoved

lines was always

moved

either

up

or

down. That

is,

for each paragraph,

the

lst,

3rd,

5th,

etc.

lines were unmoved,

while

the 2nd,

4th,

etc. lines

were

moved. This encoding

was

partially

motivated

by

image defects

we

will discuss

later

in

this

section. Note

that

the consequence

of using

differential

encoding

is

that

the

length

of

each codeword

is cut approximately

in

half.

While

this

reduces

the

potential

number of

recipients for an encoded document, the

number can

still

be

extremely

large.

In

each

of

our experiments

we

displaced

at least

19

lines,

which

corresponds

to

a

potential

of

at

least

219

=

524,288

distinct

codewords/page.

More

than

a

single page

per

document

can be coded for

a

larger

number

of

codeword

possibilities

or

redundancy

for

error-correction.

Each of

our experiments

began

with

a

paper copy

of

an

encoded

page.

Decoding

from

the paper copy

first

required scanning to obtain the

digital

image.

Subsequent image processing improved

detectability;

salt-and-pepper noise

was

removed [O’Gorman92]

<and

the image was deskewed

to

obtain horizontal text

[O’Gorman93]. Text

lines

were located

using

a

horizontal

projection profile.

This

is

a

plot of

the

summation

of

ON-valued pixels along each

row.

For

a

document

whose

text

lines

span

horizontally,

this profile

has

peaks

whose

widths

are

equal

to

the character height

and

valleys whose widths

are

equal

to

the white space

between

adjacent

text

lines.

The distances

between

profile

peaks

are

the

interline

spaces.

The

line-shift

decoder measured the distance

between each

pair

of

adjacent text

line

profiles (within

the

page

profile).

This

was

done

by

one

of

two

approaches

-

either

we

measured the distance

between

the

baselines

of

adjacent

line profiles,

or

we measured

the difference

between

centroids

of

adjacent

line

profiles.

A

baseline

is

the logical horizontal

line

on

which

characters

sit;

a

centroid

is

the center

of

mass

of

a

text

line

profile.

As

seen

in

Figure

3,

each text

line

produces

a

distinctive

profile

with

two peaks,

corresponding

to

the midline and baseline. The peak

in

the

profile

nearest

the

bottom

of

each text

line is

taken

to

be

the baseline.

To

define the centroid

of

a

text

line

precisely, suppose the text

line

profile runs

from

SCM

line

y,

y

+

1,

,

to

y

+

w,

and the respective

number of

ON

bitdscan

line

are

h(y),

h(y+l),

a..

,

h(y+w).

Then

the

text line

centroid

is given by

Y

MY)

+

...

+

(Y+w)h(Y+w)

.

(3.1)

The measured

interline

spacings

(i.e., between

adjacent

centroids

or

baselines) were used

to

determine

if white

space

has

been

added

or

subtracted because

of

a

text

line

shift.

This process, repeated for every

line,

determined

the codeword

of

the

document

-

this

uniquely

determined the original recipient.

We now

describe our decision rules for detection of

line

shifting

in

a

page

with

differential

encoding.

Suppose

text

lines

i

-

1

and

i

+

1

are

not

shifted and

text

line

i

is

either shifted

up

or

down.

In

the unaltered

document, the distance

between

adjacent baselines,

or

baseline spacings,

are

the

same.

Let

si-l

and

si

be the

distances

between

baselines

i-1

and

i,

and

between

baselines

i

and

i+

1,

respectively,

in

the altered

document. Then the

baseline detection decision rule

is:

h(y)

+

**a

+

h(y+w)

ifsi-l

>si

:

decide line

i

shifted

down

ifsi-1

<si

:

decide line

i shifted

up

(3.2)

otherwise

:

uncertain

1

Oa.2.4

1281

ON

bits

500

-0

I

-0

I

1000

I

2000

I

3000

scan

Line

Figure

3

-

Profile

of

a recovered document page. Decoding a page with line shifting requires measuring

the

distances

between adjacent text line

centroids;

(marked with

0)

or

baselines (marked with

+)

and deciding whether white space has

been

added

or

subtracted.

Unlike

baseline spacings, centroid spacings

between

adjacent text

lines

in

the original

unaltered

document

are

not necessarily uniformly

spaced.

In

centroid-based

detection,

the

decision

is based

on

the difference

of

centroid spacings

in

the

altered and

unaltered

documents.

More

specifically,

let

s,-~ 

and

s,

be the

centroid

spacings

between

lines

i

-

1

and

i,

and between

lines

I

and

i

+

1,

respectively,

in the altered

document;

let

1

.

and

t

,

be

the

corresponding centroid spacings

in

the

unaltered

document. Then the

centroid

detection

&cision

ride

is:

s,-,-t,-,

>

s-t,

decidelineishifieddown

decide

line

i

shifted

up

(3.3)

otherwise

An

error

is said

to

occur

if

our

decoder decides

that

a

text

line

was

moved

up

(down)

when

it was moved

down

(up).

In

baseline detection,

a

second type

of

error exists.

We say that

the

decoder is

uncertain

if

it

cannot

determine

whether

a line

wa..

moved

up

or down. Since,

for

our

encoding

method,

every

other

line

ils

moved

and

this

information is known

to

the

decoder, false

al"m

do

not

occur.

4.1

Experimental

Results

for

Line-Shift

Coding

We

conducted

two

sets

of

experiments. The

firs1

set

tested

how

well

line-shift coding

works with

different

font

sizes

and

different

line

spacing shifts

in

the

presence

of

limited,

but

typical,

image

noise. The second

set

tested how well

a fixed

lint:

spacing shift

could

be

detected

as

document

degradation

became

increasingly

severe.

In

this

section,

we

ffirst

describe these

experiments and

then

present

our

results.

The equipment

we

used

in

both

experiments

was

as

follows:

a

Ricoh

FSlS

400

dpi

Flat

Bed

Electronic

Scanner,

Apple

Laserwriter

IIntx

300

dpi laser printer,

and

a

Xerox

5052

plain paper

copier3. The

printer and

copier

were

selected

in

part

because

they

are

typical

of

equipment

found

in

wide

use

in

office environments.

The particular machines

we

used

could be

characterizd

as

being heavily

used

hut

well

maintamed.

Writing

the software

routine

to

implement

a

rudimentary

line-shift encoder for

a

PostScript input

file

was

simple. We chose

the

PostScript

format because:

1)

it

is the

most

common Page Description

Language

in

use

ioday,

2)

it enables

us

to

have

sufficiently

fine

control

of

text

placement,

and

3)

it permits

us

to encode documents

produced

by

a

wide variety

of

word

processing

applications. PostScript describes

the

document

content

;I

page

at

a

time.

Roughly

speaking,

it

specifies

ihe

content

of

a

text

line

(or

text

line fragment such

;IS

a

phrase, word,

or

character)

and identifies

the

location

lor

the

text

to

be

displayed.

Text location is specified

by

(in

x-y

coordinate representing a

position

on

a virtual

page.

'Though

it

depends

on

the

application software

generating the

PostScript,

text placement

can

~vpically

be

modified

by

as

little

as

U720 inch

(approxiinately

1/10

of

a

printer's "point").

Most personal laser

printcrs

in

common

use

today have

ahout

half

this

resolution

(e.g.,

U300

inch).

'3.

Xerox

and

5052

are

trademarks

of

Xerox

Corp.

Apple

and

LaserWnter are trademarks

of

Apple

Computer,

Inc.

Kicoh

and

FSl

are trademarks

of

Ricoh

Cop

1282

1

Oa.2.5