Labeling Manual - Columbia University



Labeling Manual

for

News Data

(dLabel 1.02)

23 September 2003

Columbia University

Table of Contents

1. Introduction

2. Installation and Use of dLabel 1.02

3. Annotation

1. Segmentation

2. SpeakerType

3. Named Entities

4. Time

5. Number

4. Data and Use of Web

1. INTRODUCTION

We are labeling the broadcast news data manually with a set of tags. We want to use the labeled data to find the tagged entities automatically in a new set of data using machine learning techniques.

The labeling tool dLabel 1.02 can be used to label the data. A pair of tags is put around the entities that are to be labeled. The labeling tool has icons that help to do so.

Besides labeling for the specific entities we also want to do a broad segmentation of the news. We want to find headlines and interview sections of the news. We also want to tag if the people mentioned in the news are reporter, anchor, speaker or any other person. Hence, you will notice that we are doing 3 kinds of tagging which are –

• segmentation (headlines, interview)

• speaker type (reporter, anchor, etc)

• entities (names, time, numbers).

You can follow links in

for answers of questions you may have.

Particularly, you can post your questions at



You can download the data including this manual at



You can upload your file at



and see the uploaded file at



2. INSTALLATION AND USE OF DLABEL 1.02

dLabel 1.02 is a labeling tool which is still in the build process. Please, post your questions with the software in the web board.

INSTALLATION

You need to have java installed in your computer before you can run this software. To install java:

1. Download Java 2 Platform, Standard Edition (J2SE) from .

2. Download for your operating system and install it.

3. If you are running windows machine then the binary to run java will be at c:\j2sdk(your version)\bin

4. To run the software:

In Linux, run the following command "java LabelData" in the directory where you downloaded the software.

For Windows, open the dos prompt by clicking all programs-> accessories -> Command Prompt and change directory where you downloaded the software by using cd command. "cd directory name" then type "java dataLabel"

Problems?

--------

If you can't run "java dataLabel" it is very likely the path is not set right so you can either set the path or you can run by giving full path like " c:\j2sdk(your version)\bin\java LabelData"

USING THE TOOL

You highlight the text and click on the icon to place a beginning tag and an ending tag around the text you highlighted. Please remember there can be tags within tags. In order to places tags within tags, tag the text using the first button. Then highlight the text including the first tags and then click on the second button to place second set of tags.

The next version of the tool will include features like button shortcuts and color coding capabilities. If you have any particular feature that you think will be useful in the tool please post your tips at the web board.

3. ANNOTATION

3.1 Segmentation

News broadcasts contain many items of interest. Some represent “parts” of the show. These include:

▪ Anchor initial greeting/signon and identification to the show

▪ Listing of headlines for the show

▪ Interviews within the show

▪ Anchor signoff/closing from the show

Normally, their will be only on initial greeting, one headline listing, and one closing/signoff. If you think there are multiple instances of any of these, please check with Julia or post a message to the bboard. There can be multiple interviews within the show. For these, select everything from the reporter’s first question to the interviewee to the interviewee’s last contribution.

Examples:

[Anchor greeting: ]

[Headlines: ]

[Interview: ]

[Anchor closing: ]

3.2 SpeakerType Tagging

We are tagging all personal names as it is like other entity (business, place). But we also want to classify the personal names in special types which we call SpeakerType. Hence, for personal names we just don’t tag them as like we do for but we classify them into the following set of tags.

Anchor names:

This is [Anchor: Peter Jennings] for [Company: ABC] news.

“Yes, [Anchor: Peter], I’m here in [Place: Baghdad]…”

Reporter/correspondent names:

This is [Reporter: Mitch Renley] reporting live…

“Tell me, [Reporter: Mitch], what do you see in the direction of the airport?”

Interviewee names: Use with people who are actually recorded in the broadcast.

“I am here with [Interviewee: Rudolph Giuliani], former mayor of [Place: New York].”

3.3. Named Entities

Please remember that speakertypes are still names though they are tagged as Anchors, Reporters and Interviewee.

Now let us look at rest of the named entities that needs to be tagged.

Other personal names: Use for anyone other than the above who is identified by name. Do not use when the name is not present (e.g. “a former mayor of [Place: New York] said”)

Former [Place: New York] mayor [Other personal name: Rudoph Giuliani] spoke today of his political ambitions.

Business/corporation/organization names:

[Company: IBM] announced layoffs today.

[Company: Intel’s] profits rose dramatically.

Business executives now follow the [Company\: GE] model.

Include possessives (with the possessive suffix) and other modifiers as well as noun uses. Include acronyms as well as full company names.

Place names:

from [Place: Paris] to [Place: London]

He comes from [Place: Brixton, England]

Include the smallest contiguous place identifier. When additional modifiers occur (e.g. “[Place: Southampton] in the south of [Place: England], bracket only the sub segments of the phrase.

ISSUES

(For all person names, include only title first name/initial middle name/initial last name. Do not include appositives (e.g. “[Other name: Lee Bollinger], president of [Company: Columbia University]”) or other following identifying material (e.g. “[Reporter: Mitch Renley], [Company: CNN] News”).

(Do not include Mr. Ms. Mrs. that might appear in front of name but do include titles like Jr. Sr. William III.

(If there are regular words within the title or the name of the organization include such regular word as well. e.g. (Boston Chickering Corporation).

(In possessive constructions like Citibank’spresident Bill Ford tag the organization and the name separately.

(If the quote marks appear around the name include the quote marks in the tag.

(If the word is integrally associated with place names then tag the associated word as well. e.g. Ohio River

(Do not tag sub-regional names e.g. North, East, South, West, or Northeast of the building.

(There can be nested tags. Hence if the word can be given more than one set of tags, place nested tags. e.g. Boston Celtics

(Include definite articles in the tags if it is an integral part of the name but in most cases like the article in front of university names, do not include the definite article. e.g. The GodFather, the University of Kansas

(Tag nicknames as well.

(False starts and repairs go inside the segments. For example, “…This is [Reporter: George Thomas] reporting from [Place: Kab uh from Kabul].”

(Use the Notes button in the transcriber to add notes to a transcript. For example, if the transcript is cut off or begins in the middle of a program, indicate that with a Note. Or any question you have about a label, may be indicated with a Note.

3.4. TIME/DATE

Date:

on the [ Date: fourteenth of April ]

on [Date: 4/14/03]

Today is [Date: Wednesday, April 14, 2003]

If you can’t put a date on the calendar exactly, then do not use Date, e.g.

• They meet every Tuesday

▪ This occurs once a month

▪ Next year’s figures will be available

Use Date only if you could actually identify the date in question on a calendar if you needed to.

Time:

[ Time: three o’clock central time ]

[ Time: 8:30 CST ]

about [ Time: eleven thirty ]

This is a specific time, not a duration. Do not use Time for the following:

▪ forty five minutes ago

▪ lunch time

▪ the meeting should last about a half hour

3.5 NUMBERS

You should also tag numbers that represent monetary value or temperature.

Money

Contiguous words representing numbers should be tagged. Include the currency in the currency within the tags.

e.g. US $45.8 million

Temperature

Tag the numbers representing temperature including the degrees.

e.g. forty five degrees

4.DATA

To get the data and upload the annotations once you are done.

1. For each message, you will be highlighting various types of information and clicking on the appropriate button to label that information with its type.

2. To do this:

• Download a news program to be labeled from



• Fill in the file name, start date, and your name,….in the information table at .

• Bring up the labeling tool on your pc.

• Load in the file you have downloaded.

• Go through the file, labeling all of the items described below by highlighting them and clicking the appropriate button.

• When you have finished, upload the file to .

• Fill in the end date for this file in the information table (see above).

• Post any questions that arose during the transcription to the labeling bboard at ; you may also post questions during the transcription of course, or send them by email to julia@cs.columbia.edu.

• If you have any problems while labeling, send mail to julia@cs.columbia.edu and smaskey@cs.columbia.edu.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download