KillerNay
Chanon Ngerntongdee Biography

Open Source Web Crawlers Written in Java

Saturday, 24 May 2008 06:59 by KillerNay
  1. Heritrix - Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. Heritrix is designed to respect the robots.txt exclusion directives and META robots tags .
  2. WebSPHINX - WebSPHINX ( Website-Specific Processors for HTML INformation eXtraction) is a Java class library and interactive development environment for web crawlers. A web crawler (also called a robot or spider) is a program that browses and processes Web pages automatically. WebSPHINX consists of two parts: the Crawler Workbench and the WebSPHINX class library.
  3. Nutch - Nutch provides a transparent alternative to commercial web search engines. As of June, 2003, we have successfully built a 100 million page demo system. Uses Lucene for its indexing, however provides its own Crawler implementation.
  4. WebLech - WebLech is a fully featured web site download/mirror tool in Java, which supports many features required to download websites and emulate standard web-browser behaviour as much as possible. WebLech is multithreaded and will feature a GUI console.
  5. Arale - While many bots around are focused on page indexing, Arale is primarly designed for personal use. It fits the needs of advanced web surfers and web developers.
  6. J-Spider - Based on the book "Programming Spiders, Bots and Aggregators in Java". This book begins by showing how to create simple bots that will retrieve information from a single website. Then a spider is developed that can move from site to site as it crawls across the Web. Next we build aggregators that can take data from many sites and present a consolidated view.
  7. HyperSpider - HyperSpider (Java app) collects the link structure of a website. Data import/export from/to database and CSV-files. Export to Graphviz DOT, Resource Description Framework (RDF/DC), XML Topic Maps (XTM), Prolog, HTML. Visualization as hierarchy and map.
  8. Arachnid - Arachnid is a Java-based web spider framework. It includes a simple HTML parser object that parses an input stream containing HTML content. Simple Web spiders can be created by sub-classing Arachnid and adding a few lines of code called after each page of a Web site is parsed.
  9. Spindle- spindle is a web indexing/search tool built on top of the Lucene toolkit. It includes a HTTP spider that is used to build the index, and a search class that is used to search the index. In addition, support is provided for the Bitmechanic listlib JSP TagLib, so that a search can be added to a JSP based site without writing any Java classes.
  10. Spider - Spider is a complete standalone Java application designed to easily integrate varied datasources. XML driven framework for data retrieval from network accessible sources, scheduled pulling, highly extensible, provides hooks for custom post-processing and configuration and implemented as a Avalon/Keel framework datafeed service.
  11. LARM - LARM is a 100% Java search solution for end-users of the Jakarta Lucene search engine framework. It contains methods for indexing files, database tables, and a crawler for indexing web sites. Well, it will be. At the moment we only have some specifications. It's up to you to turn this into a working program. Its predecessor was an experimental crawler called larm-webcrawler available from the Jakarta project.
  12. Metis - Metis is a tool to collect information from the content of web sites. This was written for the Ideahamster Group for finding the competitive intelligence weight of a web server and assists in satisfying the CI Scouting portion of the Open Source Security Testing Methodology Manual (OSSTMM).
  13. SimpleSpider - The simple spider is a real application to provide the search capability for DevelopMentor's web site. It is also an example application, for classroom use learning about open source programming with Java.
  14. Grunk - Grunk (for GRammar UNderstanding Kernel) is a library for parsing and extracting structured metadata from semi-structured text formats. It is based on a very flexible parsing engine capable of detecting a wide variety of patterns in text formats and extracting information from them. Formats are described in a simple and powerful XML configuration from which Grunk builds a parser at runtime, so adapting Grunk to a new format does not require a coding or compilation step. Not really a crawler, but something that may prove extremely useful in crawling.
  15. CAPEK - CAPEK is an Open Source robot entirely written in Java. It gathers web pages for EGOTHOR in a sophisticated way. The pages are ordered by their pagerank, stability of the connection between Capek and the respective web-site, and many other factors.
  16. Aperture - Aperture crawls information systems such as file systems, websites, mail boxes and mail servers. It can extract full-text and metadata from many common file formats. Aperture has a flexible architecture that can be extended with custom file formats, data sources, etc., with support for deployment on OSGi platforms.
  17. Smart and Simple Web Crawler - A framework thats crawls a web site with integrated Lucene support. Support two crawling modes, Max Iterations and Max Depth. Provides a filter interface to limit the links to be crawled. Filters can be combined with AND, OR and NOT.
  18. Web Harvest - Web-Harvest collects Web pages and extracts useful data from them. It leverages technologies for text/xml manipulation such as XSLT, XQuery and Regular Expressions. Web-Harvest mainly focuses on HTML/XML based web sites. However it can be extended by custom Java libraries to augment its extraction capabilities.

Be the first to rate this post

  • Currently 0/5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5
Tags:   ,
Categories:   seo
Actions:   E-mail | Permalink | Comments (0) | Comment RSSRSS comment feed

Open Source Full Text Search Engines Written In Java

Saturday, 24 May 2008 06:54 by KillerNay
  • Lucene - The de-facto open source search index used almost everywhere. Features include Ranked searching, boolean and phrase queries, fielded searching and date-range searching. Lucene also serves as the search engine of Nutch.
  • Egothor - Impressive demo is worth a look. Key features include: HTML, PDF, PS, and Microsoft's DOC, and XLS indexing; Golomb, Elias-Gamma and Block coding; Universal stemmer that can process almost any language; Boolean model and Vector model.
  • Carrot2 - Carrot2 is a research framework for experimenting with automated querying of various data sources (such as search engines), processing search results and their visualization.
  • BDDBot - BDDBot is a web robot, search engine, and web server written entirely in Java. It was written by Tim Macinta for his book (co-authored with Wes Sonnenreich), a Web Developer's Guide to Search Engines. It was written as an example for a chapter on how to write your search engines, and as such it is very simplistic.
  • MG4J - MG4J lets you build compressed full-text indices for large collections of documents using sophisticated techniques such as interpolative coding. Moreover, it provides utility classes that are essential in any serious text-processing activity.
  • eXist - Primarily designed as an XML database however it includes an inverted index that speeds up XPath based queries. The author describes this "Indexing in eXist is based on a numbering scheme which supports quick identification of structural relationships between nodes, such as parent-child, ancestor-descendant or previous-/next-sibling. This way, a wide range of common path expressions is processed only using indexing information".
  • JXTA Search - JXTA Search is a JXTA service which enables efficient search in distributed networks. JXTA Search is based on technology originally developed by Infrasearch which was acquired by Sun in March 2001. JXTA Search searches for content and services on JXTA nodes and on the web from either network. I'm not 100% certain whether this project includes its own full-text search engine, however from a quick glance it appears to do.
  • XQEngine - A full-text search engine for XML documents. Utilizes XQuery as its front-end query language. XPath expressions lets you specify constraints on attributes and element hierarchies, in addition to the specific word content.
  • Zilverline - Search a collection a set of files and directories in a directory. PDF, Word, txt, java, CHM and HTML is supported, as well as zip and rar files. Search results of the search can be retrieved from local disk or remotely, if you run a webserver on your machine. Files inside zip, rar and chm files are extracted, indexed and can be cached. The cache can be mapped to sit behind your webserver as well.
  • XXL - XXL is a Java library that contains a rich infrastructure for implementing advanced query processing functionality. The library offers low-level components like access to raw disks as well as high-level ones like a query optimizer. On the intermediate levels, XXL provides a demand-driven cursor algebra, a framework for indexing and a powerful package for supporting aggregation
  • Red Piranha - Red Piranha combines Lucene (Searching Ability), XML-RDF (ability to learn), Tomcat (for P2P Power) and Spring (Ease of use) to not only let you find anything, anywhere, but to actually understand what you are looking for.
  • Regain - Regain is a search engine that doesn't search the web, but searches own files and documents. There are two versions of regain: The desktop search and the server search. The desktop search is to be used on a normal desktop computer and it offers you a fast search for documents or intranet webpages. The server search you can install on web servers. It provides searching functionality for a website or for intranet fileservers.
  • Solr - Solr is an open source enterprise search server based on the Lucene Java search library, with XML/HTTP APIs, caching, replication, and a web administration interface. It runs in a Java servlet container such as Tomcat. It includes an Extensible Plugin Architecture.
  • OpenGrok - OpenGrok is a fast and usable source code search and cross reference engine. It helps you search, cross-reference and navigate your source tree. It can understand various program file formats and version control histories like SCCS, RCS, CVS and Subversion. OpenGrok provides a fast search engine that can: search for full text, definitions, symbols, path and revision history; limit searches to any subtree (hierarchical search); search query with Google like syntax (eg. path:Makefile defs:target); search for files modified within a date range and search using wild cards like * (many characters) or ? (one character).
  • Terrier - TErabyte RetrIEverR is a comprehensive, flexible, robust, and transparent platform for research and experimentation in text retrieval. Terrier has been tested to handle to at least 25 million documents. Out-of-the box indexing for documents of various formats, such as HTML, PDF, or Microsoft Word, Excel and Powerpoint files. Supports classic retrieval models, such as tf-idf, Okapi's BM25 as well as several language models, and Rocchio's query expansion.
  • JZKit 2 - JZKit 2 is a toolset for building advanced search and retrieve applications. The framework provides components that cover all aspects of building searching applications from directory and collection description services through to record schema / syntax translation and aggregate item deduplication.
  • Argos - Argos is a Java based interface designed to provided unified methods for querying internet search engines. Currently many search engines provide their own interfaces for programmatic access. These access mechanisms vary from simply providing search results in XML to supplying code in one or more languages.
  • Snapper - This fulltext indexing and search engine is designed to work on millions of documents in unlimited different intranet / LAN "sites". The included search client application is completely XML/XSLT based representing search and result pages as easily customizable XML/XSLT/HTML documents. Common office file formats are supported by native Java file parsers: MS Office, Outlook, PDF, HTML, TXT, ZIP, tar.gz, PST, Pictures and scanned images. Doument metadata from relational databases can be merged into the document index. Index data can be updated incrementally.
  • Currently rated 4.0 by 1 people

    • Currently 4/5 Stars.
    • 1
    • 2
    • 3
    • 4
    • 5
    Tags:   , ,
    Categories:   seo
    Actions:   E-mail | Permalink | Comments (1) | Comment RSSRSS comment feed

    “Formatting” An iPhone To Wipe Data

    Saturday, 24 May 2008 06:50 by KillerNay

    It appears people are recovering data off old iPhones. Whoops- looks like you can pull data out of memory using forensics tools, just like any other platform. While your Mac includes the ability to overwrite old data when formatting your hard drive to prevent recovery (very cool that this is included in a consumer operating system), there is no equivalent mechanism to clear off that “ancient” original iPhone when you trade up to the 3G version next month.

    For those of you who aren’t just convincing your spousees to take your “old” iPhone off your hands to justify that new toy, Securosis presents a simple process to minimize the chances of recovery. It’s not perfect, but it’s easy and should offer enough protection for those of you forced to eBay your once-precious-but-now-obsolete device:

    1. Restore the iPhone from within iTunes.
    2. On the “Info” tab, un-check all options so you don’t synchronize calendars, email, bookmarks, and contacts.
    3. On the Photos, Podcasts, and Video tabs, uncheck “Sync …”.
    4. Create 3 big playlists at large as the storage capacity of your iPhone.
    5. On the Music tab, select the first of your 3 playlists to sync. Make sure the storage bar at the bottom looks full after syncing.
    6. Sync your iPhone, change to the next playlist, sync again, and repeat one last time.

    This will hopefully overwrite any of the free space on your phone, helping prevent recovery of any of those love letters and bad jokes lingering from old emails. I won’t have a chance to test this anytime soon, and odds are high some fragments will survive depending on how the iPhone allocates at the file system level, but this should be more than sufficient to prevent casual recovery of sensitive stuff if you’d like to hock your “old” phone.

    Be the first to rate this post

    • Currently 0/5 Stars.
    • 1
    • 2
    • 3
    • 4
    • 5
    Tags:   , ,
    Categories:  
    Actions:   E-mail | Permalink | Comments (0) | Comment RSSRSS comment feed

    Open Letter to the Jason Gambert Fanboys & the SEO Standards Crowd

    Tuesday, 13 May 2008 03:36 by KillerNay

    Dear People Who Want Other People to Tell Them How to Run Your Business,

    You all know how I feel about SEO standards, and lord knows I know how all you all feel. But here’s the deal - you people are dragging me down.

    Especially you Jason Gambert fanboys.

    Most of the SEO Standards crowd are, for the most part, polite. Sure, some of the conversation has gotten heated, but hey, passionate debate does that and it’s been heated on both sides. But all you all need to police the Gambert fanboys. It’s kind of like keeping your own in check. They’re coming in here and making clowns of themselves and making you people look a little silly. And I think it’s all one dude. But whatever. Just don’t let the vocal minority become your mouthpiece is what I’m getting at.

    Next, all you all are gonna’ screw this collection of SEO industry junk up! People are going to start coming here expecting to learn something! I don’t need that kind of shit! That’s way too much pressure! I gave you people my own set of SEO Standards - either follow them or don’t. And I’m guessing you all are in the “don’t” column as I have yet to see any friggin’ silver! And hey, I’m not even strong arming you like Gambert is!

    Now I’m going to be serious for a moment. I hear a lot of people demanding friggin’ SEO standards to legitimize the industy, to make us appear to be more than service providers. I read people worrying about others getting scammed and fretting over making “legitimate” search marketers look different than their “black hat” brethren and sisterthen (I know, I know, it isn’t a friggin’ word. Deal with it.). The fact of the matter is that as long as people refuse to do the due diligence research before entering into a partnership with a company, people are going to get scammed. SEO standards will not help these people. Furthermore, just because there are guidelines and badges saying a company is approved, that doesn’t mean that company will not scam anyone; it does not mean that the client won’t feel scammed; it doesn’t mean that the approved company has any ethics. It simply means they paid their money, perhaps signed something and maybe took a test. And if you want SEO standards just so you look better, then you need to hire public relations. There’s a lot more to being a “professional” than simply having standards and guidelines. In the end, SEO standards may just be giving those that are causing all the hand-wringing a way to to look legitimate while screwing over a client.

    The bottom line is you can only trust yourself to do what’s right. Spell out what you feel is appropriate and live by it.

    Sincerely,

    SEO Hack

     

    Be the first to rate this post

    • Currently 0/5 Stars.
    • 1
    • 2
    • 3
    • 4
    • 5
    Tags:   ,
    Categories:  
    Actions:   E-mail | Permalink | Comments (2) | Comment RSSRSS comment feed