A sample of product development, consulting, and research work in search, text mining, natural language processing, machine learning, information management, and user interface design.
SEARCH AND INFORMATION RETRIEVAL
Large-volume news search

Architecture, engineering management, and UI design for innovative news search application. Indexes over 100K news items per day. Featuring persistent search, alerts, heat maps, classification, clustering, and other functions. Implemented using blend of Lucene, SOLR, and PyLucene, written in Python and Java.

SEC filings search

Architecture, engineering management, and UI design for SEC filings search application. Indexes live and archived SEC filings. Featuring filings drill-down within search results, hit highlighting within sub-documents, and user tagging functions.

Hedge fund portfolio search

Blended search with structured browsing of portfolio positions for hedge fund. Search over structured and unstructured data, and search results take the user to a point in the hierarchical portfolio data. Architecture, engineering management, and UI design.

Other search consulting projects

Internal oil company portal using Verity; natural language search for CRM; search of medical literature.

Context-based document retrieval

Co-designed and co-implemented a new approach to desktop search based on past context ("Find the file I emailed to Frank in June or July"), for HP's NewWave desktop environment. Implemented a natural language interface based on semantic grammars.

Free-text search engine

1991: As part of a collaboration with Karen Sparck-Jones, designed and implemented an early online search system based on inverse document frequency weighting and relevance feedback. Karen Sparck-Jones invented inverse document frequency (the IDF in TF-IDF), and was a pioneer in the field of information retrieval from the 1960s onwards.

PARSING
Parsing financial data from natural language

Architected, implemented, and deployed major system for information extraction from text. System in use since 2003, extracting over 5,000 data records from over 2,000 distinct documents per day. Used for portfolio management and automated trading by hedge funds. Emphasis on high precision (over 99.8%) and good recall (over 70%). Text content is unpredictable and changes on a daily basis; content includes structured, semi-structured, and unstructured language. System involves document classification, syntactic parsing, semantic filtering, heuristic slot-filler information extraction, error-checking, machine learning, and named-entity detection (including Asian, European, and US company names, and varying security identifiers). Written in Python, .NET, and Visual Basic.

AtomicParser

Designed and developed AtomicParser, a propietary Python library for parsing both unstructured (linguistic) and structured text. AtomicParser is a nondeterministic rule-based parser where the rules can be applied top-down or bottom-up. Machine learning can be used to infer categories and rules from training text. The system uses a specialized regular expression module, written in C, which can return thousands of named captures from a single regular expression match.

MACHINE LEARNING
AtomicML

Specified and deployed AtomicML, a proprietary Python library for machine learning. Used in various text analysis tasks, including document routing and creation of custom news channels.

TEXT MINING
Search-based content analytics

Designed and developed a search-based analytics web application for lightweight text mining of any searchable content. Free, simplified version of this service available as AtomicIQ supporting analysis of web, news, wikipedia content. Used by public relations professionals for measurement of news coverage.

Text mining of customer opinion data
Several projects using automated techniques for analysis of customer statements and opinions in text, including product review forums, customer survey open-ended responses, and call center records. See Presentation by Bacon and Haddock (2004) for work which came out of one of these projects.

CONSUMER

Topic-based news aggregation

2009: Web application for monitoring latest football team news on a single page. Includes news, blog posts, video clips, and quotations. Uses techniques for content filtering and sentence detection: for an example, see Sir Alex Ferguson quotes.

Time-based bookmarking

2002: Designed and implemented a web-based service for saving and sharing web links. The service lists links according to when they were saved, making it easy to retrieve a link with no up-front organisational effort. Similar to later, more widely known services such as Del.icio.us.

Trail-based web browsing

2000: Designed and developed a web application for browsing travel web sites based on guided trails of information. Shared some ideas with the earlier eTour and the later StumbleUpon.

Personalized news reader

1993: Designed and implemented an early personalized news filtering system, for extracting relevant articles from the Nikkei Weekly News. Featured simple user interface and automatic mechanism for detecting areas of interest, based on modified relevance feedback.

SPEECH PROCESSING
Search and management of voice recordings

Founded and managed HP R&D project developing a suite of techniques for information extraction from voice records, with applications to voicemail, personal voice notes, and recorded meetings. Visual navigation and tagging of voice records, via graphical "chunks" of speech. These chunks could be extracted to other applications, and tagged with icons, such as a telephone-number icon.

Speech processing components

Graphical user interfaces to speech data depend in part on speech processing algorithms to extract higher-level information from the speech signal. Algorithms developed included:

  • Word-spotting
  • Event-spotting (phrasal constituents such as telephone numbers and dates)
  • Partial text transcription using large vocabulary speech recognition
  • Variable-speed playback
  • Speaker separation
  • Phrase segmentation
NATURAL LANGUAGE INTERFACES
Textual NL interfaces

Designed and implemented NLP interfaces to financial monitoring and other ERP systems. Implementation used external NLP tools for syntactic and semantic interpretation.

Speech NL interfaces

Developed grammars for interactive voice interfaces, using commercial and academic speech toolkits.

PH.D. RESEARCH

Incremental Interpretation

Developed computational model of word-by-word syntactic and semantic processing, based on Combinatory Categorial Grammar and incremental evaluation of referential contraints.

Constraint networks for noun phrase evaluation and generation

Further work demonstrating that fast, low-power network consistency algorithms are sufficient for NP evaluation, and an application to generation of noun phrases.