Tag: Regular expresions

  • Regular expressions are a tool, not a problem

    What are regular expressions?

    Regular expressions are a way to describe a pattern to match a piece of text.

    So if you want to find another “similar” word in a text, badly formatted phone numbers, or replace “same kind” of a (X)HTML element regardless of attributes, regular expressions are the way to go. Regular expression can be also referred to as a “regex” or “regexp”.

    (more…)

  • Matchpoint

    Matchpoint

    During 2007 and 2008 I worked on a document analysis tool called “Matchpoint”.

    The idea is to parse the document by identifying content blocks and then find certain keywords within the context. The document is tagged based on the found information.

    I did the software architecture first, creating the concepts, entities and relations,and identifying crucial parts of the system.

    The heart of the system is the parsing engine that identifies segments of document, for example education, experience, and so on. All the permutations of the segments are used, and the one that matches the most segments is selected for further analysis. Each of the recognized segments is then searched for the keywords. Each keyword has appropriate tags assigned, and this way the document is in the end tagged.

    Since the algorithm has to analyze documents in different languages, using semantic algorithms seemed a bit too complicated, so I went with regular expressions.

    The documents can be emailed or uploaded by FTP to the web server, where is a Windows service monitoring configured folder. A .NET console application is then run to convert document to plain text using IFilters, and then to run the analysis, and upload the data to the Microsoft SQL Server database in the end.

    Users can use a web application built on ASP.NET Web Forms to search and view indexed documents.

  • Spider

    During 2003, I was asked to create a interesting tool, a spider to walk the web pages and extract data.

    I have made an analysis of how to achieve this, and implemented it fully. My focus was to parse HTML, collect the HTTP request parameters and values, and extract the data.

    The tool can use a page with links as well as a form as a source to create possible request parameter value combinations. Then, the links within the each page will be located and parsed to append to the list of pages to be processed.

    Configurations and results are saved as XML files. I also made a viewer for the results which can be sorted on any attribute.