GIGAWORDCORPUS

Department of Computer Science

University of Minnesota Duluth

Tikal Temple

Download Links

Goal of the Project:

Retrieving top ranked words based on the TF*IDF measure from a corpus.

Brief description of the problem:

The problem is about retrieving out a ranked list of R interesting distinct terms (interesting measured using the TF*IDF measure) of lengths varying from M to N (N>M) from a huge corpus where R, M, N are input parameters to the program.

Hardware Details:

Blade Cluster at University of Minnesota Twin Cities.

Software Details:

C language with MPI Package for parallelization.

Project Guide:
Dr.Ted Pedersen
Keywords:

Suffix Arrays

Quick Sort

Binomial Reduction

The way we implemented.....

To start off with we create one suffix array per file and then use the quick sorting algorithm to sort the positions in the suffix array based on the words they point to in the files.

After the suffix array is sorted we eliminate duplicate positions(different positions but they point to identical words).

Once the final suffix array has been calculated all the processors do a binomial reduction so that, at the end processor with id zero gets the merged suffix arrays of all the processors.

After this we retrieve top R interesting terms.

Downloads

Please read this before downloading and running

download the project report

download the project slides

download the project source code

Authors

Pavan Poluri

Siddharth Deokar

Varun Sudhakar