Manual



[pic]

PROTEUS PROJECT

NEW YORK UNIVERSITY

Ali Argyle

Darren Jahnel

Jon Liebowitz

Sachiko Omatoi

Jeremy Shapiro

Graig Warner

Dan Melamed

A Bitext Harvesting and Distribution System

PROTEUS PROJECT AT NEW YORK UNIVERSITY

Bitext Harvester Project Guide

( NYU

Proteus Project

715 Broadway 7th floor

New York NY, 10003

Table of Contents

CHAPTER 1

What is a Bitext Harvester 1

CHAPTER 2

Bitext Harvester Database 3

Database Requirements 3

Database Installation 3

Chapter 3

Bitext Harvester Spider 5

Spider Introduction 5

Spider Installation 6

Spiderman Menu Command

Summary 9

Chapter 4

Bitext Harvester Filters 11

Filter Setup 12

Running a Filter 13

Terminating the Filter 13

A Sample Scenario 13

Notes on Filters 14

Chapter 5

Web Based User Interface 17

Installation Steps 17

Chapter 6

Developer Wish List for

Future Improvement 20

APPENDIX A

Requirements Document 22

APPENDIX B

Database Spec 51

Chapter

1

What is a Bitext Harvster?

We must acknowledge the "fact" of bilingualism and build upon it.

      -Maurice Beaudin

t

he Bitext Harvester Suite of applications was created to exploit the ever expanding resource of online parallel text data. A parallel text or 'bitext' is a pair of documents that are identical in content but have been written in different languages. These bitext resources are extremely useful in the development of natural language processing techniques, and can serve as both training and testing data for the NLP community.

The Bitext Harvester is a system for collecting processing and distributing parallel texts retrieved from the Web.

The system works as follows:

❑ Spider - A spider constantly trolls the Web looking for pairs of documents that might be parallel texts, downloads them locally, and places key management information into a database.

❑ Filter - The resulting documents are processed by filtering programs, which help decide whether the pair is indeed a parallel text worth saving. For example one filter might decide what language each of the documents is written in. The results are recorded in the database.

❑ Web Interface - A web site enables people to investigate the progress of the spider and the filtering; for example, someone could specify 2 languages and ask for all parallel texts in those particular languages The value of this resource will be significantly increased by the harvester, and the tools that have been developed to tag and keep track of the candidate parallel texts that have been gathered.

The Bitext Harvester Application consists of four main components, the spider, the user interface, the filter capability, and the database backend. Each of the first three may be used independently along with the database to perform its specific function. The web spider downloads pages and deposits data into the database, the filters serve to analyze and tag the data in the database, and the user interface allows a zipfile summary of the data to be downloaded via a web interface. The installation instructions pertain to all four components.

Chapter

2

Bitext Harvester DataBase

t

he Database works in conjunction with all parts of the Bitext Harvester Suite, and is the first thing you will need to install. The Database installation instructions below will help you setup your MySQL database with the appropriate tables and fields.

Database Requirements

(

The first step to getting any of the three main components working is to have a working database for them to talk to. The server that was used for development and testing was MySQL Version 4.1 which can be found at .

Note: As of Dec 15, 2003 this was the only version of MySQL that had the subquery capabilities which were needed in the project. Hopefully any future version of the Bitext Harvester will be able to use stored procedures which are going to be supported with version 5.

Database Installation

1. Download and Install MySQL >= 4.1 from

2. Run mysql database server daemon on your machine.

3. Create a database named bitextharvester as root. – mysqladmin create bitextharvester

4. Create tables – mysql –u root –p[password] bitextharvester < [TableName].tbl

5. Load Static data - mysql –u root –p[password] bitextharvester < [TableName].dat (for Languages and Topics.)

6. Add rows into Filters and FilterInstances tables, as Filter user guide suggested.

7. Please find MySql command detail in online MySql manual.

8. Please see Table.xls for table details, such as data type, field description.

9. CreateBitextHarvester.sh [database password] will do step 3 – 5.

Developer Notes

All Id's are auto_incremented. (MySql assigns Id automatically. Deleting entire table will not reset the row count. Please re-install table or use the truncate command).

At this point, there are no specific users or table access permissions for different processes (Spider, Filter, and Web file download).

Chapter

3

Bitext Harvester Spider

Not all keys hang from one girdle.

      -Anonymous

t

he spider’s job is to trawl the web looking for possible parallel text candidate pairs. Each spider has some basic configuration information like where to start, how to choose the pages that are most likely to be bitext pairs, what to call itself, etc. For your first spider we recommend that you try the default spider simplespider. Once you have a better idea of how the spider works you will be able to build one that does exactly what you need, and grabs the type of documents that you want.

Spider Introduction

(

Now that the database is in place you can install the spider and start putting data into it. There are two different executables that you will need, the spider.sh executable is used to instantiate a new spider and the spiderman.sh executable is used to manage any spiders you may have running. Two different spider package tars are included in the distribution, one for developers with all of the source included and one for run-only purposes. If you are following this demonstration just use the simpler run only version called spiderman_bin.tar.

Once the SpiderMan has been started, it will run until manually shut down. When the SpiderMan starts up, it also starts a background thread. This thread monitors the SPIDER database table for any entries of spiders. The entry in the table gives the SpiderMan info about this spider's name, along with the RMI registry where it has registered [RMI is a java tool to allow remote processes to call methods on one another. A more detailed explanation is beyond the scope of this doc - but easily found on the web]. For all spiders found in the DB, the SpiderMan attempts to look them up, and validate that they are alive. If so - we store this spider in a Hashmap. If not alive, the row is deleted from the DB by the SpiderMan. Once the spider is in the map, the user is able to call various actions on the spider.

The SimpleSpider implementation is a webcrawler that will iterate through a list of queries, and call a particular search engine with that query. The resulting html page will be scraped for all links. For each link found, that page will be retrieved. That page will then be scraped looking for the potential link that cause this page to be retrieved by the engine in the first place. If this link exists - both files will fill be retrieved and stored (along with duplicate checking using a hash scheme to avoid unnecessary duplicate files). An example of a query might be 'click for French', in which case the page found, and the page that it points to may be retrieved as a potential bitext. That being said, this demo spider does a lousy job of finding accurate bitext pairs, but is set up to be greedy to bring in many texts so additional filters may decide to keep or discard them. Why is it lousy? When a page is found by using a search engine to query on 'click for french', the page that may be the French version is very seldom a link with the words 'click for french'. What I did in this 'greedy' implementation is just search for any link that contains any word in the query (and hope for the best). Other problems occur in how these links are formatted (meaning how there paths are structured, etc., so a high percent create Malformed URL's). More precise scraping methods may be used, but were beyond the scope of this project.

Spider Installation

1. Make certain that you have J2SE1.4.1 or greater

2. Follow the database installations above if you haven’t already done so

3. Choose which tar file (run only or src) you would like to use and unpack the tar file.

Unpack the run only tar file: > tar xzf spiderman_bin.tar.gz

The following structure will be created:

startrmi: start script for RMI

spiderman

spiderman.sh: start script

config

lib

log

spider

spider.sh: start script

config

lib

log

Or unpack the developer tar file: > tar xzf spiderman_dev.tar.gz

The following structure will be created:

dist: (distribution of spiderman and simplespider)

lib: (all needed jars)

config: (all config files)

src: (java files)

edu.nyu.bitext.server

edu.nyu.bitext.server.db

edu.nyu.bitext.spider

edu.nyu.bitext.spider.db

build (class files)

edu.nyu.bitext.server

edu.nyu.bitext.server.db

edu.nyu.bitext.spider

edu.nyu.bitext.spider.db

5. Starting the RMIRegistry

The rmiregistry will need to be running for the apps to work correctly. Start the RMIRegistry by typing: > ./startrmi

6. Starting a Spider with Spider.sh:

The spider.sh executable is responsible for instantiating a spider. Before running the spider manager lets go over an example of starting a spider so we will have something to manage. By default ./spider.sh will use ./config/simplespider.cfg for its configuration. If you look inside this file you will notice that the name of the spider is specified as 'SimpleSpider', this is the name you will use when refering to the spider in the spider manager utility so make sure it is unique and descriptive for each spider that you start.

[pic]

Figure 1 uses shows steps 5 and 6 in a terminal window.

7. Using the Spider Manager ('SpiderMan'):

The spider manager ('SpiderMan') system is a mechanism built to allow for central control of multiple spiders. A spider may be loosely defined as a process that scans the web for possible bitexts (a bitext is two 'identical' texts in different languages). Technically, a spider can be anything, as long as it implements the edu.nyu.bitext.shared.Spider interface. For instance, a spider could potentially scan repository files instead of web crawling, and still be controlled via the SpiderMan.

To run the SpiderMan, go to the install directory: type: >./spiderman.sh

A usage menu will be shown [NOTE: items in brackets are optional parameters]:

Implementations of these commands are spider specific, and it is up to the individual spider to implement them how they choose. The SpiderMan provides a centralized, and convenient way to run these commands against any registered spiders. The spider developer should do their best to implement these commands in a reasonable manner. The summary below tries to describe how these commands should act in a generic way.

[pic]

Figure 2 – the spiderman in action

In the SpiderMan menu, 'spider' refers to the spider name that you would like to run the command for. The spider name is specified by the user in the spider config file. All spiders must have a unique id.. If you look at the spider_french/config/spider.cfg example:

SpiderID is set in first line: SPIDERID=SimpleSpider_FRENCH

This would be input into 'name' field in SPIDER table in the database. The 'id' field is an autoincrement field in MYSQL, so for this example the following would be generated by MYSQL.

|Id |name |host |port |status |created |

|4 |SimpleSpider |localhost |1099 |0 |2003-12-22 14:24:56|

And yes (as I'm sure you're wondering), unfortunately it is possible that another spider could overwrite an existing spider using the same name, which would cause the SpiderMan to use the most recently registered. In some ways this might make it easier to redeploy - without worrying about running an old version, but also has the problem of accidentally (or unknowingly) using same name. In future versions this might change.

[Note that in spiderman - there is a hashmap of [spidername (not id) as key, to spider ] - so if you changed to use spider ID for lookup, map would have to change.]

Spiderman Menu Command Summary

run spider [configfile] [queryfile]

Tells the spider to begin crawling web (or file systems), in order to locate bitexts. This command may take two parameters:

• configFile - This is a property file that can be set for a spider. Individual spiders will most likely need specific property files, which should be documented by the spider developer. A configFile for the included 'SimpleSpider' implementation is given as an example in the spiderman/config folder.

• queryFile - This is also spider specific, but should most likely be a list of queries for a spider to send to a search engine. A queries.txt for the included 'SimpleSpider' implementation is given as an example in the spiderman/config folder.

setconfig spider configFile

This command has the ability to send property files to spider. Spider may be implemented to act on these while running (this will be up to spider implementation)

setquery spider queryFile

This command has ability to send query list to spider. Spider may be implemented to act on these while running (this will be up to spider implementation)

halt spider

This halts the spider. Ability to pause spider from fetching bitexts.

continue spider

This wakes up spider from halted state

throttle spider

This is an optional feature that one could build in to a spider's activity. ay to automatically halt spider, if number of bitexts retrieved is ahead of filtering. It will give a chance for filters to catch up.

status [spider]

Shows the status of all registered spiders. It is up to spider implementation to provide status information for when this command is called.

showquery spider

showconfig spider

Dumps properties/queries for spider to console.

quit

Exit SpiderMan

help

Display the full menu of options.

Chapter

4

Bitext Harvester Filters

All roads do not lead to Rome.

      -Slovenian Proverb

n

ow that you have gathered some bitext candidate documents, the next step is to perform any additional processing that can help you decide which of the text pairs are worth keeping. Some basic filters are included in the distribution and should be used as a guide to help you build filters of your own. The filtering system provides a means for automating the process of filtering the documents and bitexts downloaded by the spider. It accomplishes this by storing information about the filtering results in the database and judging, based on user-defined criteria, whether the filtered entity should remain valid for further testing or marked as invalid.

(

Three varieties of filters:

1. Text filters (update TestResults)

2. Bitext filters (update TestResults)

3. Column value filters (update the Texts or Bitexts table)

Filtering runs from the command line (include screen shot of command line options). User needs to provide the following: Configuration file, properties file, and the executable file. Configuration and executable file locations can be loaded from the database for filters that have already been inserted into the database

Purpose of the various files:

Properties file – environmental settings (database driver / location / username / password, directory where logs are written to)

Executable file – Performs the testing on one or more files. At minimum, read permission should exist for the executable. Execute permission needs to exist if this file is run without prefixing its name with a shell or interpreter language name (e.g. perl, sh, awk, and so on). It's recommended that the minimum permissions set for this file be 755.

Configuration file – settings specific to a filter executable. Authored in XML. This communicates the following to the filter system (details included later):

• How many files are to be tested by the script

• What is the syntax used by this script at the command line

• Of the texts / bitexts kept in the database, which ones should be tested by this filter

• What results, when printed to standard output, imply that the test has passed or failed

• Which tables and columns in the database need to be updated

Filter Setup

1. For developers and users: filters.zip

>unzip filters.zip

The following directory structure will be created:

• filters

o filter_launcher.sh: the launch script

o properties: the default properties file. Logging is disabled by default.

o edu/nyu/bitext/filters: source and class files

o edu/nyu/bitext/filters/utils: source and class files

• lib: all needed jars

• logs: default directory where logs will be stored

• scripts: sample filters

• config: DTD and XML configuration files for the above scripts

2. If you are running on a linux machine you may need to change the end of line characters to linux format in filter_launcher.sh. You also may need to change some of the paths in filter_launcher.sh to correspond to your install locations.

Running the Filter

Execution from the command line takes one of three forms:

1. To add a new filter to the database and execute it on the bitexts or texts in the database:

>filter_launcher.sh –n FILTER_NAME –c CONFIG_FILE –EXECUTABLE [–d DESCRIPTION] –o PROPERTIES_FILE

where the description is optional and the other tags mandatory.

2. To launch a filter with information in the Filters table of the database:

>filter_launcher.sh –i FILTER_ID –o PROPERTIES_FILE

where FILTER_ID is the primary key value from the Filters table.

3. To query the database for information about available filters:

>filter_launcher.sh –q [FILTER_ID] –o PROPERTIES_FILE

where “–q FILTER_ID“ gets information about one filter, and “-q“ without a FILTER_ID gets information about all filters.

After executing the application in the first or second manner, it would be useful to type CTRL-Z and subsequently bg in order to run the application in the background.

Logging may be enabled by adding a setting logs.level=SETTING (where SETTING is one of the following: ALL, DEBUG, ERROR, FATAL, INFO, WARN, or OFF) to the properties file.

Terminating the Filter

The application may be terminated under one of the following conditions:

1. Retry count exceeded – while the run() method of a thread is executing, failed connections to the database or SQLExceptions may occur. After five retries, the threads spawned by the filter launcher will die and the application will terminate gracefully.

2. Forceful termination – under normal circumstances, this is the means by which a user will wish to end a filtering process. Using the Unix kill command should terminate the process without causing database inconsistencies since transactions and batched queries are utilized. Future developers may wish to make use of the methods provided in the edu.nyu.bitext.filters.Filter interface in order to provide a means to shut down the filters programmatically.

A Sample Scenario:

We want to ensure that the files we download are comprised of text data (e.g. HTML and text pages) rather than binary data (images and sound files). The executable file, in this case, would be /usr/bin/file, a UNIX command that returns a file's type. On non-BSD UNIX systems, the file command will return output that contains the word “text“ if the contents of the file being tested are readable by humans. Therefore, we have to look for the word “text“ in order to ensure that the file should pass this test.

The syntax of the command would be /usr/bin/file –b filename. The configuration file for this command represents this as $0 –b $1, where the general rule for specifying command syntax is:

$0 [some_parameters] $1 [some_parameters] $2 [some_parameters]... $n

The term $0 represents the command itself, while arguments such as $1, $2, and so on represent the first and second files respectively.

The configuration file specifies the logic for evaluating the output of a filter command. The units of evaluation are called “rules;“ in turn, each rule consists of one or more “tokens,“ which are regular expressions that the filtering output must match to either pass or fail a rule. If the test passes all of the tokens in a rule, the rule specifies whether this means that the rule passes or fails. If the test passes all of the rules specified in the configuration file, then the test is considered to be successful and the bitext or text remains as a valid candidate for further evaluation by other filters.

In our sample scenario, there is only one rule (or group of tokens) that needs to pass, and the rule consists of one token – the regular expression “text“. If the output of the file command emits a string that contains the word text in it, then the test passes.

[graig@csstupc16 graig]$ file -b bitextharvester.txt

ASCII text

In the above case, a command line execution of the filter emits “ASCII text,“ which matches up with the regular expression supplied in the configuration file. Therefore, the test would be successful and the appropriate tables in the database updated.

Notes on Filters

Java notes - Please see the javadoc included in the code for notes regarding the Java code.

XML notes - The structure of the XML documents parsed by objects of this class is specified by the DTD or XSD supplied to the SAXParserFactory, which the instance of this class is bound to.

Relevant elements include:

• config - The root node of the document. Must contain attribute textCount and elements cmdSyntax and optionally, elements tables, features, and rule.

• tables - Information for filters that need to update either the Texts or BitextPairs table. This is optional. Elements enclosed by this element include:

updateColumn - The column in Texts or BitextPairs that will be updated. This column is a foreign key to lookupColumn. tableToLookup: The table that contains possible return values.

lookupColumn -The primary key of the aforementioned table.

lookupAuxColumn - The column in tableToLookup to which we compare the test's return value..

• cmdSyntax - Specifies the command syntax. $0 = script name, $1 = first file, $2 = second file...

• features - Defines a group of features that a text or bitext must possess in order to be processed

• date - Specifies a minimum or maximum date of discovery for a text or bitext. Attributes used by this element include:

after -The minimum date of discovery

before - The maximum date of discovery

exclude - A true or false value that determines whether or not files in this range should be excluded from testing.

• Languages - Defines a group of languages that the filter should run against or exclude

• length - Specifies a minimum or maximum size of a text to be filtered. Attributes used by this element include:

minimum - The minimum size

maximum - The maximum size

• pattern - A specific pattern being searched for in the result string

• rule - A rule that may be applied to the text or bitext. Attributes used by this element include:

ignoreCase - A boolean that determines case sensitivity

• topics - Defines a group of topics that the filter should run against or exclude. Attributes used by this element include:

exclude - A true or false value that determines whether or not files in this range should be excluded from testing.

Please note that additions to the configuration file schema should either be performed on the DTD that comes with the filtering system (in filters/config/filterconfig.dtd) or by creating a new DTD that extends the currently existing DTD.

The recommended way to extend parsing of XML files that use tags that are defined in a different DTD is:

1. In the tag, have an attribute xsi:noNamespaceSchemaLocation="your_dtd_name.dtd" that contains the new tags.

2. In the application, create a new subclass of FilterImpl (for filters that update the TestResults table) or ColumnValueFilter (for filters that update only one value per row in the BitextPairs or Texts table). In the subclass, it would be wise to override the startElement(…) and endElement(…) methods to parse tags defined in the new or updated DTD.

3. Update the FilterFactoryImpl class to determine the appropriate Filter implementation to return when the newFilter() methods are called.

The choice of DTD instead of XSD is due to the fact that DTD is easier to maintain than XSD. However, XSD can be used in place of DTD with no changes to the Java application code.

Chapter

5

Web Based User Interface

They are ill discoverers that think there is no land, when they can see nothing but sea.

      -Sir Francis Bacon

n

ow that you have gathered some bitext candidate documents , the next step is to perform any additional processing that can help you decide which of the text pairs are worth keeping. Some basic filters are included in the distribution and should be used as a guide to help you build filters of your own.

Installation Steps for the Web Based User Interface

Preconditions:

MySQL is installed

JWSDP 1.2 is installed and JAVA_HOME environment variable is defined

The bitext harvester DB has been set up

Steps:

1. Get the MySQL driver jar (mysql-connector-java-3.0.9-stable-bin.jar or the appropriate jar for the version of MySQL that is installed), this file can be downloaded at

2. Put that jar in jwsdp-1.2/common/lib/

3. Create a directory named "bitext" under jwsdp-1.2/webapps/

4. Put the 5 JSP files in the bitext directory

5. Put the css, images, and WEB-INF directories in the bitext directory on the same level as the JSPs

6. Let the existing directory structures under WEB-INF exist as they are

7. Start the server by going into the bin subdir of your jwsdp-1.2 installation and typing ./startup.sh

[pic]

Figure 3 – Starting the server

8. That's it! Go to to see if it's working it should look like the following:

[pic]

Figure 4 – the intro page

When you click ‘Search and download bitexts” you will get the following choice screen:

[pic]

figure 5 – search page

Once you have made your selections, click the submit button and proceed to download the zipfile. Add the file extention .zip to the filename if you are on windows.

Notes on the web site

1. All of the .java files are included in the WEB-INF directory if you need to make changes.

2. Note that the zip file for download should contain both files and a tab file that notes the relationship between the downloaded files.

3. The filter display will look for filter names inserted into the database and display them dynamically.

Chapter

6

Developer Wish List For Future Improvements

Image creates desire. You will what you imagine.

      -J. G. Gallimore

A

s with any other project in development, there are many features that were not implemented in the first version of the Bitext Harvester. Here are a few of the enhancements that might be implemented for future versions.

• Enhance Topics table to support multiple layer topics by adding

• ParentTopicId (News will picks up Domestic News and International News.)

• Create a new table called TextCategories (TextId, TopicId) to support multiple topics of a text.

• Create Account table to store Bitext Delivery System users.

Achieve mechanism.

Below are items that were not in scope, but the development team has noted as items that could be visited in later development. Listed in order of importance.

• The Bitext Delivery System will let user to decide if he/she want to download bitexts or to have a list of URLs.

• File Clean Up Tool

• A Web based tool, which enables the system administrator to purge text files and their related database table entries, downloaded before the date of his/her choice. Orphan files, the files without their information in database for some reason, may be deleted at the same time.

• Account Manager - A part of Bitext Delivery System, which creates and manages user accounts to distinguish user affiliations within or outside of New York University, for Copyright issue.

• Bitext Link Queries - A part of Bitext Delivery System, which allows outside of New York University users to get list of URL for bitexts, instead of allowing them to download text files, for not to violate the Copyright.

• Spider Manager Graphic User Interface - A graphic user interface, which let the system administrator to execute and to stop spiders. It also allows his/her to set up configuration and arguments, and to see the performance statistics for spiders.

• Filter Manager - A process which will automatically run and check status of filter processes, and report performance

• Filter Manager Graphic User Interface - A graphic user interface, which let the system administrator to execute and to stop filters. It also allows his/her to set up configuration and arguments, and to see the performance statistics for filters.

• Multilingual (more than bilingual) filtering

• A filter mechanism, which allows a filter to test more than two text files at the same time.

Enhanced Topics:

• Change table to support multiple layer topics by adding ParentTopicId (News will picks up Domestic News and International News.)

• Create a new table called TextCategories (TextId, TopicId) to support multiple topics of a text.

• If the number of the topics increases to a very large number, may want to create a new page for Binary Delivery Systems.

Appendix

A

Bitext Harvester Project

Transition Meeting Documentation

11/2/2003

Table of Contents

Overview and Project Status 24

1. Requirements Introduction 26

2. General Description 27

3. Functional Requirements 31

4. Database Schema and Dictionary 33

5. System Architecture Summary 35

6. Interface Requirements 36

7. Deployment Requirements 37

8. Non-Functional Requirements 38

9. Preliminary Design Documents 39

10. Team Composition and Assignments 48

11. Preliminary Schedule 49

12. Demo Plan for the Demo Show on December 18th, 2003 50

Overview and Project Status

The project team has gone through a full requirements gathering process with Professor Melamed resulting in a finalized requirements document, and has made it through most of the design phase of the project. It is our goal through the transition meeting to exit the design phase of our project and move fully to the development phase.

Key accomplishments include:

• Rebuilding much of the interface and the database used by the PT Miner spider.

• A signed-off requirements document by Professor Melamed.

• Design of the database to be used for the project.

• Creation of the My-SQL database to be used for the project

• UML schemas completed for the filter (test) interfaces

• User Interface drafts of the researcher web site.

• Identification and testing of a compression solution for the researcher web site.

• Both development and staging servers are up and running, development with preliminary code for the system.

With some strong progress made, we now are entering a rapid development phase to bring together all of the pieces of this project. As you will note in the final sections, the timeline to delivery is tight, however our project plan was designed to accomplish enough up-front work to make the development process less labor-intensive.

In this document, we include the requirements document completed, and have added to it design materials, and fleshed out data dictionaries.

1. Requirements Introduction

1.2 Scope of this document

The scope of this document is to outline the functional requirements for the Melamed database project.

1.3 Overview

In this project, Professor Melamed is looking for the project team to build a database candidate parallel texts that can be used in current ML research in language translation. This “candidate” database will be populated by web crawlers provided by the professor. The database will be able to house, display to users, and be a management tool for these candidate bitexts.

1.4 Business Context

Professor Melamed's work uses parallel texts--copies of the same document in 2 languages--to help develop and test his translation software.  

This project involves designing and building a system for collecting and processing parallel texts retrieved from the Web.  The system will work as follows:

• A spider constantly trolls the Web looking for pairs of documents that might be parallel texts; an existing spider will be used

• The resulting documents are processed by some filtering programs (which Prof. Melamed has) which help decide whether the pair is indeed a parallel text worth saving; for example, one filter decides each document's language

• A database records the results of the previous steps

• A web site enables people to investigate the progress of the spider and the filtering; for example, someone could specify 2 languages and ask for all parallel texts in those particular languages

1.5 Definitions Used in this Document

o Bitext: A bitext is a pair of documents that are identical in content, but written in a different language.

o Candidate Bitext: An suspected bitext pair that needs to be tested.

o ML : Machine Learning

o Web Spider : A tool that downloads web pages

o Test: A test examines candidate bitexts to determine if they are real bitexts.

2. General Description

2.1 Product Functions

The final deliverable will be a database for candidate bitexts, an adequately robust input routine from a web crawler populating the database with candidate bitexts, and a web-based download interface to allow researchers to download candidate bitexts.

2.4 User Problem Statement: The ML project requires a huge quantity of bitexts. Currently, a web crawler exists to spider and download candidate bitexts, however, this spider is not automated. This limits the amount of candidate bitexts can be loaded into machine learning algorithms.

2.3 User Roles and Characteristics

o Professor Melamed. An administrative user with access not only to the entire system, but will require some basic tools to manipulate data easily.

o Assistants to Professor Melamed. Will be working with the candidate bitexts, potentially moving data to other databases, etc.

o Outside Researchers. Outside researchers will want to enter a web site, search for candidate bitexts, and download bitexts.

o Future Developers.

2.4 Workflow Summaries of Each User Role

The Spider Workflow

Researchers

2.5 General Constraints

Software Cost: All software used in the project must be free (no charge to the final user)

Timing: Resources will be limited to a five member team, and approximately 100 to 150 man hours until mid December, 2003.

3. Functional Requirements

1. The Web Spider Manager – ‘Spiderman’

The ‘Spiderman’ is responsible for monitoring and controlling the running web spiders that have registered with the ‘Spiderman’.

NOTE: All web spiders are responsible for implementing the spider interface, and registering with the ‘Spiderman’. They can be run remotely or locally. This project will ship with the ‘SimpleSpider’ web crawler [details on this spider are out of scope for this document].

Additional ‘Spiderman’ requirements:

|# |Name |Description |Priority |

|1 |Duplicate detection |Are 2 different records pointing to the same file on the web? |2 |

|2 |Use configurable parameter lists |To allow the spider to find different web pages without duplicating |1 |

| | |effort. | |

|3 |Each text should be fingerprinted |Using MD5, this will help to identify if we have duplicates, even when |1 |

| | |files names change. | |

|5 |Throttling |Give the admin a throttling capability so downloads don’t get too far |2 |

| | |ahead of filtering. | |

|6 |File naming |On download of a document, need to append a unique ID to the name so it|1 |

| | |is a unique file. [possibly using the IP address as a trie data | |

| | |structure] | |

| | | | |

2. Test Interface

The filter interface allows an Administrative researchers to grab candidate bitexts in the database and to update the database after a filter is applied.

|# |Name |Description |Priority |

|1 |Provide a generic API to call for |This will allow test to be conducted on candidate bitext pairs to |1 |

| |bitext pairs |determine if they are true bitexts | |

|2 |Record the results of the test |This is a also an API where the call into the system will record the |2 |

| | |results of the test | |

|3 |Plan for future enhancements |Parameters may be enriched in the future. Allow for expansion |3 |

|4 |Provide a generic API to call for |For tests that need to be conducted directly on the text data. |2 |

| |texts | | |

3.3 Bitext Delivery Web Portal

|# |Name |Description |Priority |

|1 |Provide a search of candidate |By languages, date, and test results (one for each test.) |1 |

| |bitexts | | |

|2 |Return a list of search results of|Allow users to look at a list of the bitexts |2 |

| |candidate texts | | |

|3 |Zip search results |Provide a zipped set of results to download |1 |

|4 |The package download |Should have the files to download, plus a tab delimited file of bitext |1 |

| | |pairs. | |

3.4 Database Requirements

|# |Name |Description |Priority |

|1 |Texts are single records |One text may be translated into many different languages. Keeping the |1 |

| | |text as a single row will allow this | |

|2 |Bitexts are single records that |Another table will store relationships between texts |1 |

| |relate to texts in different | | |

| |relationships | | |

|3 |Provide table locks for tests |Allow for a row to be unavailable to a test by flagging it as |2 |

| | |“in-process” | |

|4 |Record Test results |For each test on a particular bitext, record the test results. |2 |

4. Database Schema and Dictionary

The database schema should follow the rules set in the requirements above and the following relationships.

[pic]

Data Dictionary for Major Tables

Texts

A row stores a single file downloaded from the web spider. Texts are related to bitexts – and can be associated with many bitext pairs. This will allow us to create many Bitext relationships while reusing the same texts.

| | |

|Text ID |Unique ID |

|URL |This is the URL of the text being examined. |

|Fingerprint |The text will be hashed using a technique to be determined that can be used to help determine |

| |duplicate data. |

|Date of discovery |Downloaded date. |

|TextPath |The named file path to retrieve the document. |

|Stage of Process |This is a flag to indicate whether a text should be kept as a valid text in a bitext pair. |

|Length |Bit length of the file. |

|Topic ID |Allows a text to be categorized. |

|Language ID |Tests will determine what language this document is in. |

Bitexts

Bitext rows associate texts in candidate bitext pairs. This is where relationships will exist and where tests will look for documents to examine.

|Bitext ID |Unique ID |

|Text1ID |Relation |

|Text2ID |Relation |

|Fail |After a test is run on a bitext, we will want to mark a pair as failed, to remove the bitext |

| |relationship. |

|Creation Time |Time the association was created. |

TestResults

Tests will be run on bitext pairs. This table will store the results of those tests.

|Test ID |Unique ID |

|Bitext Pair ID |The test refers to this bitext pair |

|Filter ID |The test refers to this specific test type |

|ResultValue |Tests may pass back values |

|Result |Tests may also pass back pass / fail |

|Test time |Time test was run |

Filters

A list of all filters that run tests against bitext pairs.

|FilterID |Unique ID |

|Filter Name | |

|Filter File Name |Location of the filter code |

|Description |Up to admin to use |

|Hostname |Where the filter is coming from |

|Port |Expected port filter runs on |

|NumofInstances |Number of instances of the filter running from the filter manager |

|Creation Time | |

5. System Architecture Summary

6. Interface Requirements

The interface requirements for the application will be as follows:

1. User Interfaces

1. Researcher

This is a simple web interface using HTML 4.0 as the maximum browser type.

2. Admin

Will be a combination of API and web interfaces.

2. Software Interfaces

1. Spider Interface

This is an API only

2. Admin

Will be a combination of API and web interfaces.

7. Deployment Requirements

1 Documentation: In the final deliverable, one (1) document is needed to generally describe functions and API definition for tests (filters)and for spiders manager – how to run the script and the config. files. Also noting the structure of calls. Included in the documentation, it must have information on:

3 System requirements for deploy

4 Test API

5 Spider manager API

7 Packaging: The final deliverable must be packaged and then deployed by someone not on the development team successfully to a non-development machine used in this development.

8. Non-Functional Requirements

Software Tool Choices

Why we made these choices:

1. The software chosen must be free software so that the final user (possibly DARPA) has no financial commitment locked-in to adopt it.

2. They represent best of class for open source software. There are certainly other databases available. However, MySQL is widely adopted, and happens to be used in the main Spider tool we are using.

Software Selection:

• OS – Red Hat Linux 7.3

• Development Language – Java

• Execution Environment - J2SE1.4.1

• Web Server – Apache HTTP Server 2.0.40

• Execution Environment - J2SE1.4.1

• Application Server – Tomcat 5.0.x



• Database – MySQL 4.1

9. Preliminary Design Documents

9.1 Researcher Web-based Front End

9.1.1 Bitext Delivery System – Home Page

[pic]

9.1.2 Bitext Delivery System – Search Page (Under Construction)

[pic]

9.1.3 Bitext Delivery System – Search Results Page

[pic]

9.1.4 Bitext Delivery System – After clicking on the download link, the system will ask you if you want to save the file to your hard drive.

[pic]

Design for Spider Management Interface

SpiderMan console menu:

**********************************

SPIDERMAN MENU

1. run [spider] [configfile]

2. halt [spider]

3. continue [spider]

4. status [spider]

5. showconfig [spider]

6. throttle [0=none]

7. help

**********************************

?

SimpleSpider config sample:

#Search Engine parameters

ENGINE=?

ADDTL_PARAMS=&ie=UTF-8&oe=UTF-8&num=100

#Language Info

SEARCH_LANG=lh=en

TRANS_LANG=fr

#Regular Expressions used for html scraping

# LINK_REGEXP: for finding links in search engine results

# NEXT_REGEXP: for next group of search results

LINK_REGEXP=/(.*?) -/i

NEXT_REGEXP=/ ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download