Human knowledge has become staggeringly huge. As my growing pile of unread books shows, we can't keep up even in areas we are interested in. I constantly find that curiosity leads me from one thing to the next (Richard I to the Knights Templar, Wavelets to signal processing to Kolmogorov complexity). I will not live long enough to delve into all of the areas of computer science and mathematics that I am interested in. This is one of the saddest parts of mortality.
The web is wonderful because it allows me to reference work that I would otherwise have a hard time getting access to or even learning about. The web also makes things worse in some ways too, as the web page demonstrates. I frequently find information that I might want to reference at some point in the future, so I make a note of it on this Web page.
This web page was started soon after www.bearcave.com came on-line, in June of 1995. Back then, everyone had a links pages, so we created a links page too. As with the rest of bearcave.com, this links page has evolved. Many of the links on this page I have added for myself, although the narrative is usually written for others. But as this Web page expands I feel a bit like Ted Nelson and his files (see The Curse of Xanadu by Gary Wolf, Wired Magazine, June 1995). There is a real danger that any system of notes and annotations will become useless as it grows to the point where finding material becomes difficult (then you need an index to the index). While this web page does not have an index (other than Google), it has acquired a disturbingly long table of contents (see Links to the Links, below). In fact, it is starting to resemble a sort of demented "blog".
Without better tools (see the links on Cyc, below) there may be a limit to human knowledge, not because there are not new things to discover, but because the human knowledge base becomes too unwieldy. Our lifetimes are limited, as is our brain capacity. From the point of view of intellectual work, upgrading the life span would be not very useful without upgrading our brain storage and retrieval capacity. As human knowledge grows there may come a time when people become so narrowly specialized that progress stops. Or that a person must spend their lifetime simply mastering what is known in a given field, without contributing anything new.
I have used the ANTLR parser generator to develop the Bear Productions International Java compiler front end. I have also written a fair amount about ANTLR on the bearcave.com web pages.
ANTLR has been used to create a wide variety of software tools. One interesting example is Ephedra, a C/C++ to Java translator developed by Johannes Martin in his PhD dissertaton.
Java has references, which are pointers, by another name. However, Java does not have references to references, which could be used to model pointers to pointers in C++. This is a problem when it comes to translating some C++ algorithms to Java. For example, translating the algorithm that supports binary tree delete.

SPAM is the term applied to unsolicited e-mail sent out by mass junk e-mail advertisers. SPAM threatens to flood our e-mail accounts with junk mail, making them unusable. I started writing about SPAM here years ago, when the problem first started to appear. Here is something I wrote about the infamous Stanford Wallace an "innovator" in mass spamming:
One of the most despicable SPAMers is Stanford Wallace (known as SPAMford). Mr. Wallace runs a company called Cyberpromotions. Wallace believes that the more controversy he can stir up, the more attention he will get and that his business will increase as a result. Recently Wallace hosted a site registered with interNIC as godhatesfags.com. This is a "Christian" hate site mainly aimed at gay men and lesbian women. The site is also antisemitic and states that Jews are in league with gays and are damned as well.
Wallace hosted this fountain of pathalogical hate in the hope that it would attract media attention. Wallace is a good example of the kind of scum that take part in SPAM "marketing". Fight SPAM and don't do business with anyone who uses SPAM marketing. Don't do business with an ISP (Internet Service Provider) that allows their site to be a SPAM platform.
Things change fast in Internet time. SPAMford Wallace has been driven off the Internet. He lost several lawsuits and has been looking for some other way to scam bucks. Those wonderful "Christians" at godhatesfags.com are (or were) hosted by www.L7.net (or at least they were at the time of this writing). As a strong believer in the first amendment, I have to support their right to post their spew.
Ever the master of reinvention Spamford Wallace went legit. He is opened a nightclub, complete with go-go dancers. But as Wallace points out at the end of a Wired News article, the spam continues. Wallace's night club, Plum Crazy filed for bankrupcy in 2004. Ever the master of the quetionable businesses, Spamford apparently went back to his roots:
Spam King must step down: This might make you feel a little better the next time you have to close dozens of pop-up windows or spend an hour removing "spyware" or "mal-ware" from your computer: A federal judge has issued a temporary restraining order against Stanford Wallace, a man known as the "Spam King," which will force him to disable most of his software. Mr. Wallace's case is the first action launched by the U.S. Federal Trade Commission in its crackdown on ad-ware and spam. He has also allegedly been selling software called "Spy Wiper" and "Spy Deleter" that the FTC says doesn't work.
The Globe and Mail, Mathew Ingram's Globe and Mail Update, October 25, 2004
You don't have to be very smart to be a spammer, just totally lacking in morality. A great site that can be used to track SPAMmers down to their ISP or web provider is Sam Spade (dot org). I use this site frequently and am grateful to steve@blighty.com who hosts this site and keeps its software running in the face of outraged spammers.
At the time I wrote this some people estimate that the industry wide cost for SPAM is about $10 billion (US). The SPAM problem has been growning, day by day, week by week, month by month. We definitely see this at bearcave.com. Yet there are no laws that have any teeth that attack the SPAM problem, nation wide, in the United States. Why is this? An excellent answer is provided by Keith H. Hammonds in his Fast Company article The Dirty Little Secret About Spam (Fast Company, August 2003, Issue 73, Pg. 84). The sub-title is: What J.P. Morgan Chase and Kraft want is exactly what the guys peddling porn and gambling want: free access to your inbox. That's why there's no easy solution to a problem that could soon make the world's email system crash and burn.. The point made by Mr. Hammonds is that laws have not been passed because "direct marketing" companies and large corporations have fought any laws that might stop them from SPAMming.
Banks, morgages companies and other large comporations, like Kraft which sells the widely spammed "Gevalia Kaffe", pay for leads or affiliates that bring them buyers. Spammers provide these leads, sometimes through multiple "cutouts". In Who profits from spam? Surprise: Many companies with names you know are benefiting by Bob Sullivan (MSNBC, August 8, 2003) followed the path from spammer to the company that benifited (and ultimately paid the spammer). These companies all denied that they had anything to do with spam. But the system was set up so that they did not know who provided the "leads" or affiliate links.
One company arms both sides in spam war, By Saul Hansell, November 25, 2003, The New York Times, republished on news.com
This article discusses a company called IronPort, which makes a specialized computer system described by some as a "spam cannon". Apparently this system is designed to rapidly send vast amounts of e-mail. Some believe that IronPort's customers are spammers. IronPort claims that their products are used to send e-mail only to those how have "opted-in".
Ironicaly, IronPort has recently purchased SpamCop. SpamCop runs a spammer blacklist which includes IronPort's customers, according to this article. After being repeatedly attacked by spammers, Julian Haight, who runs SpamCop, was forced to look for a buyer.
Report: A third of spam spread by RAT-infested PCs By Munir Kotadia, December 3, 2003, CNET News.com
RAT is the acronym for Remote Access Trojan. This article discusses an alarming trend where spammers are using viruses and other computer security attacks in order to take over computers so that they can be used to send SPAM, or in some cases Distributed Denial of Service attacks (DDOS). Such a DDOS attack was recently launched at the anti-spam site run by the Spamhaus Project (see reference below).
New e-mail worm targets antispammers, By Reuters, December 3, 2003 (published on news.com)
This article discusses a distributed denial of service attack that targeted the Spamhaus Project. This attack was launched from virus infected computers.
Sending someone spam is the quickest way possible of demonstrating that you are a clueless Internet user who knows little about technology. In particular sending spam to people who, at one time, had e-mail addresses that had "!" characters in them is a quick way to become very unpopular. These people remember an Internet without spam. Spamming people like this with your resume is the quickest way to get identified as someone they don't want to work with, as a gentleman named Bernard Shifman demonstrates. Finally, as Mr. Shifman should have known, the Internet has a long memory (sometimes disturbingly so given Google's USENET archives). It is not a good idea to send anyone you are not intimate friends with e-mail that you would not want posted on a company bulletin board (you know, the old kind, made of cork).
The mass layoffs that resulted form the 2000/2001 "dotcom dieoff" has lead more desperate people than Mr. Shifman to send out their resume in spam e-mail. I suppose that it is now a phenomena, since there is a Washington Post article discussing resume spam.
There are a variety of reasons that bearcave.com exists, but one of them is to open professional doors for the humble author of this site. The lesson here is, content, not spam.
Since I registered bearcave.com, I have fought spam by tracking down those who sent my account spam and getting their accounts and web pages canceled. My efforts and the efforts of others in the anti-spam Jihad have had little effect. The tide of spam keeps rising. Currently I get between 50 and 100 spam e-mails a day. So I finally threw in the towel and wrote an e-mail filter for my UNIX shell account. The C++ source code for the mail filter, along with its documentation is published here. This mail filter not only gets rid of the vast majority of the spam, but it also gets rid of stupid bear mail (see below).
The spam filter above allowed me to continue using my iank at bearcave.com email address. However, I still got vast amounts of spam: a few hundred junk emails in my junk "folder". My spam filter is pretty accurate, but once in a while it would throw valid email into the junk folder. When my junk email got into the hundreds of emails, I would frequently delete it. On a few occasions I lost emails that I would have wanted to read. These problems made it clear that yet another version of this spam filter would be needed. I was happy to discover a Google Gmail based solution.
Google's Gmail provides directions for setting up a Gmail account so that it handles domain email. For example, I have routed email to iank at bearcave.com to my Gmail account. The GMail directions include the MX record settings to give your ISP for forward your email.
Paul Graham's Web pages on filtering out spam
Paul Graham is the author of at least two books on Lisp and some elegantly written and thoughtful essays. On of my favorite is The Hundred-Year Language, which is a thoughtful essay on the evolution of languages for designing software. Paul also wrote this Web page which consists of links to his work on filtering spam, particularly using Baysian techniques.
SpamBayes: A Bayesian anti-spam classifer written in Python
SpamBayes is an open source Python implementation that started with Paul Graham's Bayesian spam filtering algorithm. They claim that they have improved on this algorithm. The spam filter is available as an Microsoft Outlook plugin, a POP proxy filter, or a procmail filter.
Gary Robinson's Rants: SPAM Detection
This is a link rich web page which discusses the technical detials of various spam filtering techniques.
Field Guide to SPAM by John Graham-Cumming
Compiled by Dr. John Graham-Cumming, a leading anti-spam researcher and member of the ActiveState Anti-Spam Task Force, the ActiveState Field Guide to Spam is a selection of the tricks spammers use to hide their messages from filters, providing examples taken from real-world spam messages.
From the Field Guide to SPAM web page
Other Bear Web sites of various flavors
I sort of lumber around and take life at a slow considered pace. I am getting hairier by the day and I'm a fairly big guy (6'1"). So, in short, I'm just an ursine sort of person. So when got my domain in 1995, the name bearcave seemed to make sense. As it turns out, there are other people who consider themselves "bears". While browsing around on the Web, I met some really nice people who consider themselves Bears also (there is even an IRC channel named #bearcave, which is totally independent of this Web site).
The most common definition of a "bear" is a homosexual or bisexual man who is hairy, has facial hair, and a cuddly body. However, the word "Bear" means many things to different people, even within the bear movement. Many men who do not have one or all of these characteristics define themselves as bears, making the term a very loose one. Suffice it to say, "bear" is often defined as more of an attitude than anything else - a sense of comfort with our natural masculinity and bodies that is not slavish to the vogues of male attractiveness that is so common in gay circles and the culture at large.
Resources for Bears FAQ
I corresponded a bit with a couple of Bears (Scott and Kevyn), who seem like really nice people. There is little enough love and warmth in this world and these Bears seem to be loving and warm people. Although I happen to be straight, I would be honored to be considered a Bear.
In general I hold the essayist, writer and professional pundit Andrew Sullivan in contempt (gay Republican Bush II supporter pretty much summarizes it). But he did write a good essay on bears (of the gay variety), published in Salon.
Stupid Bear Mail
OK, some bears are way cool. Having said this, let me take a moment to comment on bearcave.org. This now defunct "bear" site once offered free Web e-mail accounts. Back then the ISP that hosted bearcave.com routed all of the email that went to bearcave.com to my email account. As it turned out, some of the bearcave.ORG users would enter .com when they meant .org. A few others intended to send their email to bearcave.net, the more intelligent "bear" site that at one time hosted several users (currently it seems to be exclusively Brian's web site).
There are those who don't really understand what domains are and think that it would be cool to have an address like clueless_dweeb@bearcave.com. One misdirected note had an enclosed picture of the guy's cock and asked the addressee to send him a picture of his. My only comment is that being gay does not excuse acting like a teenage boy. If you're sending e-mail like this out you should grow up or move to Palm Springs. For a selection of e-mail from bears trolling for other bears, click here.
In order to use my e-mail in the face of the massive stream of spam that gets directed to anyone with web pages and published e-mail addresses, I implemented a spam filter which put the misdirected "bear" mail in my junk "folder", along with the spam.
The Evolution of Bears by Don Middleton, published on The Bear Den Web page. The Bear Den Web page has lots of information on bears (the real ones, not the human kind) and links to other bear sites on the Web.
The impact of mankind on our planet is causing one of the largest species die outs in geologic history. Some species of bear, like the Giant Panda Bear, are near extinction. Gary Coulbourne and Phil Pollard have created the www.bears.org Web site "dedicated to the preservation of the accurate Bear beliefs". Gary writes:
Many of the species of bears in the world are slowly dwindling. We are not alone on the Earth. Thousands of other forms of life live here too. If we are to survive, we must do it together. Not just humans and bears, but humans and everything that lives
Among the pages on this site is an interesting set of pages on the various species of bear. When I looked at the Web page, it was still growing. Gary and Phil's page looks like it will be an important contribution to ursine information on the Web.
Beautiful Kitchen Knives
At the Bearcave we love good food (and other sensual experiences). I've been cooking since I was a little kid (my sister was given a Susie Homemaker oven which I quickly appropriated (hey, I shared the cakes with her). Good knives are important tools for any chef. I had been using a terrible Henkel knife that my mother gave me when I went off to college. The knife would not hold and edge and I had to sharpen it all them time. After spending years suffering with this knife I decided to buy a new 6-inch kitchen knife.
Some of the kitchen knives that caught my eye are made in Japan, using the techniques that are similar to those used to make samurai swords. The web site Japanese Woodworking Tools has an amazing selection of beautiful knives, in a range of prices (from about $70 US to over $1000 US). The 4" Ryusen Damascus paring knife and the 6" Ryusen Damascus small slicing knife are shown below:
I own both of these knives and I absolutely love them. They are not cheap, but they should last a lifetime and then can be passed on to your heirs. The folded steel construction makes the blades very strong, so the knives can be remarkable thin. They hold a very sharp edge. In fact so sharp that you really need to threat these knives with respect (we never leave them in the sink or the dish drainer to avoid any accidents).
One word of caution about these knives: the thinness of the blades makes them more fragil than thicker chef's knives. I was slicing down the lenght of some snow crab legs with my 6" knife and I shattered parts of the edge. Fortunately I was able to repair it with a Sharpton Water Stone (also sold by The Japan Woodworker Catalog). So if you have these knives, it's useful to have a thick chef's knife as well.
An article on Japanese knives and the chefs who love them can be found in This blade slices, it dices by Harris Salat, February 1, 2008, Salon.com
La Quercia Artisan Cured Meats (pronounced La Kwair-cha)
I have gone through some evolution when it comes to cured meats like prosciutto. Cured pork is not cooked and the idea of eating uncooked pork bothered me for some time. After visiting Italy twice, I could not escape the fact that cured pork tasts pretty good. To reconcile the two conflicting ideas: cured pork tastes good and eating uncooked pork is bad I came up with the following train of thought. Cooking meat denatures the proteins, which makes the proteins more difficult for bacteria to digest (but makes it easier for humans to digest). This is why cooked meat keeps longer than raw meat. Curing meat also denatures the proteins in the meat. Cured meat also contains preservatives like salt and sometimes smoke. The process of curing meat is similar to cooking. Ergo, cured meat is similar to cooked meat, so it's OK to eat cured pork. That's my story and I'm sticking to it.
After returning from Italy I started buying Italian prosciutto from CostCo. I read about La Quercia Artisan Cured Meats in the New York Times. I like to support US artisans, so I ordered the La Quercia kitchen sampler. The prosciutto is fantastic, better than the Italian prosciutto. The sampler comes with Prosciutto Americano crumble and it took me weeks to go through it. A couple of teaspoons adds wonderful flavor to an omlette with aged cheddar cheese.
The pigs that become the La Quercia meat are all free range and are raised humanely:
All of the pork we use comes from suppliers who subscribe to humane practices. To us this means that the animals have access to the out of doors, have room to move around and socially congregate, and root in deep bedding. We do not use meat from animals that have been given subtherapeutic doses of antibiotics or kept in large animal confinement facilities.
The meat for our Prosciutto Americano is all from antibiotic free, animal by product free, hormone free pork.
For La Quercia Rossa, our Heirloom Breed Culaccia, we use Berkshire meat from animals that have not had subtherapeutic antibiotics and have had no antibiotics at all for at least 100 days prior to harvest.
Americans eat too much meat and we'd be better off if we ate half as much meat and paid twice as much for animals that are raised well.
At least on paper this system is as close as any commercial system I've seen to the Nebraska/ADVISE Distributed Semantic Graph System Database System I worked on at Lawrence Livermore. They seem to be thinking about the important issues, like entity disambiguation and security. They also seem to be trying to integrate unstructured data input via Attensity (good luck with that). Visual Analytics also claims to be "partnering" with TARGUSInfo, which is a consumer data mining company.
Investigtive Analysis Software
Investigative Analysis Software is a British company that has developed Analyst's Notebook and Analyst's Workstation. They have also purchased a small company which used to be on this list called Anacubis which makes software that allows graph browsing and construction of networks of information. Interestingly, this product is aimed at "commercial intelligence". Obviously a tools like this has applications in other areas (assumign that it provides some level of power).
Cogito was named for one of the few propositions that are provable without reference to the senses, Cogito Ergo Sum (I think, therefore I am). Cogito has developed a database system based on semantic graph technology.
IBM Entity Analytics Solutions. IBM purchased Jeff Jonas' company SRD (Systems Research and Development), which made a product called NORA. Apparently Jonas' company lives on in Nevada as a division of IBM, with Jeff Jonas becoming an IBM distinguished engineer and the chief scientist of IBM's Entity Analytics division. See also IBM Beefs Up Criminal Detection/Analytics Software (news.yahoo.com)
SRD's original web pages stated:
NORA. is a software solution that uses SRD's unique Entity Resolution. technology to cross-reference databases and identify potentially alarming non-obvious relationships among and between individuals and companies. Companies use the generated NORA Intelligence Reports and alerts to focus investigative and audit resources on areas of real concern.
NORA is apparently used by gambling casinos to identify card counters and cheaters (casinos seem to view these groups as roughly similar, since both groups are engaged in taking the casino's money, rather then the casino taking theirs).
For more on Systems Research and Development:
IBM semantic graph related research projects:
Web Fountain
Web Fountain is a project at the IBM Almeden Research Center involved with data mining the Web. According to the New York Times (Entrepreneurs see a web guided by Common Sense by John Markoff, November 12, 2006), Web Fountain "has been used to determine the attitudes of young people on death for an insurance company was able to choose between the terms "utility computer" and "grid computing" for an I.B.B. branding effort." IBM has also used Web Fountain "to do market research for television networks on the pop;ularity of showns by mining a popular online community site" and "mining the 'buzz' on college music Web sites". In this last case "researchers were able to predict songs that woujld hit the top of thte pop charts in the next two weeks." The IBM researcher mentioned in reference to Web Fountain is Daniel Gruhl.
From the Visible Path web pages:
The Visible Path platform applies the science of social network analysis to allow professionals to access the entire enterprise's trusted relationship network without invading privacy or compromising relationships. The platform integrates tightly with corporate SFA, CRM and business intelligence applications to measurably accelerate sales cycles, increase close rates, and reduce the cost of lead generation and customer acquisition.
According to a December 4, 2007 CNET article, Visible Path is being acquired:
Visible Path, which makes social-networking tools for business users, is set to be acquired.
A company representative on Tuesday said that Visible Path has signed a term sheet with a multibillion international company to sell the firm. She said the service is expected to continue operating.
Visible Path has confirmed the buyout, but the identify of those clueless enough to purchase this company remains in question as I write this (Dec. 7, 2007).
Apparently the Visible Path management "will not be continuing on" after the acquision. I have heard (but not confirmed) that the Visible Path engineering group has been entirely offshored to India.
Given the meager Visible Path technology, Kleiner Perkins will be lucky to get their money back, much less the factor of ten return that Venture Capitalists like to get.
One of the founders of Radar Networks is Nova Spivack. His Minding the Planet blog can be found here (apparently also at www.mindingtheplanet.net). According to a New York Times article (Entrepreneurs see a web guided by Common Sense by John Markoff, November 12, 2006):
Radar Networks, for example, is one of several working to exploit the content of social computing sites, which allow users to collaboratet in gathering and adding their thoughts to a wide array of conent, from travel to movies.
Radar's technologhy is based on a next-generation database system that stores associations, such as one person's relationship to another (colleague, friend, brother), rather than specific items like text or numbers.
Radar Networks has a product called Twine which builds semantic content on the Twine site. An hour long demo, with the Radar Networks founder, Nova Spivak can be found on Robert Scoble's blog here. I tried to add a comment, which didn't seem to get posted. I was able to grab it from the web page and I've pasted it in below:
One of the features that is very powerful with the Web and the Internet that it is built on is that it is decentralized. Short of destroying technological civilization there is no way to destroy the Internet and the Web. Nor can the web be controlled. It is distributed on vast numbers of computer systems that are scattered around the world. Sites like Google do mirror much of the web, but the data exists elsewhere and is just mirrored on the Google mega server farm.
Currently Twine is entirely centralized. This creates a variety of problems. Twine has control of your Twine content and it exists in one logical place, the Twine server farm. Twine is not distributed through the Internet. Even if Twine opens their API, everything is still on Twine. I suppose that you could suck your graph off in something like RDF and put it on another system, but these systems don't exist yet.
It could be argued that FaceBook, LinkedIn and MySpace also control all content. At least some of these sites have probably also had to manage exponential growth. But Twine is not Semantic Web. Twine is a site that builds semantic graphs and presumably supports some of the semantic web standards. Perhaps the idea is that Twine is a seed from which the Semantic Web will grow. This remains to be seen.
There are also some serious scalability problems, which I at least, have not idea how to solve. As the size of the Twine graph grows, the link structure of the graph will grow as well. Exponential growth is easy to imagine with Twine data and their link structure. This will require a similar increase in their hardware support. Even if they purchase the necessary hardware infrastructure, some kinds of queries will become more difficult. For example, path traversal between two points (e.g., the connection in the graph between Ian Kaplan and Nova Spivack). As the size of the graph increases, each "hop" in that connection can bring in more links entities. This problem could be especially bad because Twine seeks to forge topic links automatically throughout the graph.
On the hype end of things: automatic entity extraction, the automatic recognition of names, places and organizations, is error prone. We can argue about how error prone, but I have yet to see an entity extraction system that does not mark "bank of cooling towers" as an organization (perhaps a financial organization). If these errors are not filtered out by a human, there will be some amount of "cruft" build-up of false links.
Link extraction is even more difficult. For example, the link between Paul Wolfowitz and Iraq. In Twine links are simply links through the entities (e.g., I have a reference to Wolfowitz and it become linked to other content with the same word).
Semantic Web has not been very promising so far because there was no motivation to use it and no tools to automatically build content. Twine is interesting because it is actually a step toward motivating semantic graph use. But it still seems like an early step.
21st Century Technologies appears to be a small software consulting and product company (with a very impressive staff). They provide a variety of services based on the background of their staff. 21st Century Technologies is on this list because they do semantic graph data mining and analysis. In fact, reading their description of their Lynxeon system was like reading the description of the ADVISE system I worked on. It is not clear from the web site that Lynxeon is a core focus of their company, however. Building large scale semantic graph systems takes a lot of resources. Whether the Lynxeon system lives up to their description remains to be seen.
Danny Hillis, someone who keeps turning up in my list of interests, is one of (or the) founder of Metaweb Technologies. Metaweb is, apparently, doing something with semantic graphs or social networks. At the time this was written they were in "stealth" mode and what they are actually doing has not been publicly disclosed.
Despite the fact that few people have probably heard of Metaweb, they apparently think that they are Google (or the next Google). Perhaps Hillis thinks his past record with start up companies justifies this view. Or perhaps it is the amazing quality of Hillis' Phd thesis (which became the book The Connection Machine) that justifies this view. Or perhaps Hillis is a legend in his own mind. What ever the case, in order to apply for a job as a database engineer they want you to answer the following questions (which I'm quoting here under the copyright doctrine of fair use):
To apply, please respond to the following four questions in your cover letter. Brevity is the soul of wit.
1. Programming Languages have changed very little in the past 30 years: OO (Smalltalk) dates from the mid seventies. Closures and continuations (Scheme) were invented in the late seventies. List comprehensions, and other lazy functional constructs date from the early eighties. Is this vocabulary "it" for programming?
2. Have you ever built something you could have bought? If so, what and why?
3. What is your favorite time of day?
4. Imagine a graph that consists of directional links between nodes identified by small non-negative integers < 2**16. We define a "cycle" in the graph as a nonempty set of links that connect a node to itself.
Imagine an application that allows insertion of links, but wants to prevent insertion of links that close cycles in the graph.
For example, starting from an empty graph, inserting links 1
2 and 2
3 would succeed; but inserting a third link 3
1 would fail, since it would close the cycle 1
2
3
1. However, inserting a link 1
3 instead would succeed.
In your favorite programming language, declare data structures to represent your graph, and give code for an "insert link" function that fails if a new link would close a cycle. What, roughly, is the space- and time-complexity of your solution?
Instructions
Please submit cover letters and resumes in plain text or HTML only to (their email address) and include your answers to the questions above.
Perhaps the job market in the computer industry is so bad now that people will actually respond to all of these questions just for the chance of getting a job at an unknown startup company. This kind of game, that employers play with prospective employees, is what makes me glad that I have a job that I'm currently happy with.
How much you are willing to tolerate these little job interview questions probably depends on how desperate you are to get the job and whether you find the question interesting. I will confess that I did find one of MetaWeb's questions (for a "Semantic Tools Engineer") interesting:
Mark V. Shaney is an ancient Usenet bot that generated realistic (for some value of reality) prose that fooled many educated people into thinking a human was the author. (See http://groups.google.com/group/net.singles/msg/531b9a2ef72fe58 for an example.) Describe succinctly an approach, algorithm, or technique you would use to automatically distinguish Mark's prose from human prose, assuming you don't have access to his compiled program or source code.
According to Wikipedia the "Mark V. Shaney" text was created with a Markov model which mirrors the statistical distribution of words in english text. So it can be assumed that simple statistical tests on the text will not distinguish it from human written text. The text also appears to have relatively correct grammar structure, so simple grammar parsing does not look like it would be a good test either. The irony is that the software to differentiate such synthetic text from real text may be much more complicated that the software that created the text.
As of May 2007 MetaWeb has disclosed a bit more about what they do, although their business model is still obscure to ordinary mortals like myself. MetaWeb has spun off the an a site called freebase. Freebase is a term most often associated with hard core cocaine use, but in this case refers to free + database = freebase according to the web site. Freebase has been created with MetaWeb's technology. MetaWeb is looking for engineers with experience with the Semantic Web, so perhaps MetaWeb provides semantic mark up and semantic graph technology. As it noted, my work and interests have an odd intersection with Hillis'.
What is Freebase?
Freebase.com is home to a global knowledge base: a structured, searchable, writeable and editable database built by a community of contributors, and open to everyone. It could be described as a data commons. Freebase.com is enabled by the technology of Metaweb, which is described at www.metaweb.com.
Obviously this sounds a great deal like Wikipedia. To this freebase provides the cute and content free answer:
How is Freebase different than the Wikipedia?
It's an apple versus an orange: each is deliciously different. Wikipedia is an encyclopedia with information arranged in the form of articles. Freebase is more of an almanac, organized like a database, and readable by people or software. Wikipedia and Freebase both appeal to people who love to use and organize information. In fact, many of the founding contributors to Freebase are also active in the Wikipedia community. Whenever Freebase and Wikipedia cover the same topic, Freebase will link to the Wikipedia article to make it easy for users to access the best of both sites.
Given the vast and growing size of Wikipedia, I'm not sure what content freebase thinks to publish that is not on Wikipedia. Wikipedia attempts to avoid gross bias and overt commercialism, perhaps this will be freebase's forte.
This North Carolina based company seems to build learning algorithms on top of graphs. It's not clear whether these are semantic graphs or not. The company was founded by people with experience in neural nets and learning systems.
Analytic Technologies is a small software company run by Steve Borgatti and Roberta Chase. Steve is a professor at the University of Kentucky where he teaches social network theory and analysis. The social analysis software is targeted at small graphs and, with refreshing candor, Prof. Borgatti writes that it goes slow around 10K vertices.
(Graphs are also frequently called networks)
The idea of representing relationships as graphs is both very old and very new. The idea has been around for a long time, but people are just starting to structure database systems in this way. See my long review of the book Linked, which discusses self-organizing complexity and networks. This review includes links to related literature and links to graph (network) visualization software.
Simile: Semantic Interoperability of Metadata and Information in unLike Environments
Simile is an MIT group that is working on a variety of useful Web software which can be used to build and process semantic web content. Simile includes tools for RDF, web page processing ("screen scraping") and building semantic content from other web pages.
InFlow Software from orgnet.com
Orgnet.com is a consulting company that appears to also sell a software tool that they claim you can use to analyze your organization to understand the self-organizing network relations.
The Semantic Web and the Resource Description Framework (RDF)
Arguably the most well know graph research project is the Semantic Web which is sponsored by W3C (the World Wide Web Consortium). Tim Burners-Lee is the Semantic Web's most well known proponent.
The Resource Description Framework is apparently a language and software framework for publishing and discovering semantic web information. A very "link rich" review, written by Brian Donovan, of the book Practical RDF: Solving Problems with the Resource Description Framework by Shelley Powers, O'Reilly, July 2003, was published on Slashdot.
An Introduction to RDF by Ian Davis
Uncloaking Terrorist Networks by Valdis E. Krebs, First Monday, Vol. 7, No. 4, April 2002
Valdis Krebs is involved (or perhaps the principle behind) www.orgnet.com, mentioned above.
This is a somewhat primative tool that allows graphs to be constructed from google results and other data.
WebGraph Department of Science and Information, Universita degli Studi di Milano
From the abstract of the paper The WebGraph framework I: Compression techniques by Paolo Boldi and Sebastiano Vigna, Technical Report 293-03, Universitartimento di Scienze dell'Informazione, 2003:
Studying web graphs is often difficult due to their large size. Recently, several proposals have been published about various techniques that allow to store [sic] a web graph in memory in a limited space, exploiting the inner redundancies of the web. The WebGraph framework is a suite of codes, algorithms and tools that aims at making it easy to manipulate large web graphs. This papers presents the compression techniques used in WebGraph, which are centred around referentiation and intervalisation (which in turn are dual to each other). WebGraph can compress the WebBase graph (118 Mnodes, 1 Glinks) in as little as 3.08 bits per link, and its transposed version in as little as 2.89 bits per link.
PuTTY: an Open Source Telnet by Simon Tatham
At the Bearcave we do not use Microsoft's Outlook Express. This software is simply an invitation for viral infection (for some examples see my web page Why are there still e-mail viruses?). If we really, really need a Windows based e-mail program we use Eudora. Interestingly, Eudora is also the choice for the email program used at the Lawrence Livermore National Labs., where security is taken very seriously.
In general we read our e-mail on our shell accounts on Idiom. This means that we use Telnet a lot. This was no problem when we lived in California with its high Internet router density, but it became a problem when we moved to New Mexico, where there can be significant lag over the Internet. Telnet is particularly sensitive to this lag and the Microsoft Windows Telnet program has a habit of hanging up the connection while we are in the middle of an edit. Obviously this is pretty irritating, so I looked around for an alternative. One of my colleagues pointed me to PuTTY. I just started using this program. It seems to be slower than Windows Telnet. On a 56K baud connection I can type slightly faster than PuTTY sends and echos characters back. It handles window resizing properly, sending windows size information to the UNIX system properly.
Unfortunately it turned out that PuTTY was no more tolerant of network lag (timeout) than Microsoft's telnet. However, PuTTY supports a number of nice features, including SSH (encrypted) login. This makes it a good choice for network links that are monitored.
I think that breaking into computer systems and destroying information is both juvenile and criminal. I never cease to be amazed that the people who do this don't have something better to do with their time and talents. However, writing btp gave me an apprecation for client-server software. I'm also very irritated with Microsoft, which seems to take a very cavalier attitude toward system security. So I admire the work of the Cult of The Dead Cow. They have written a client-server program called Back Orifice 2000, which can be used to remotely and control a Windows PC over the Internet. This control can optionally be done covertly, which has lead to the controversy surrounding The Cult of the Dead Cow. Back Orifice is available as Free Source, which is the only way I would download it since The Cult of the Dead Cow notes:
1999-07-13:
BO2K Defcon CDs accidentally tainted with Chernobyl virus. We did it, it's our mistake. If you got a BO2K CD or downloaded the software from a 3rd party mirror, you need to scan your system for CIH v1.2 TTIT.
So look at the source code before you compile the program. There have been a number of cases where trojan horse code has shown up in open source software (for example, in DNS support code). Despite all this, the sophisticated features and Open Source make Back Orifice a nice program to learn from. Programs like Back Orifice would be much less dangerous if Microsoft paid any real attention to security. The fact that executable e-mail enclosures can install a trojan horse like Back Orifice on a Microsoft system is inexcusable on Microsoft's part. Given the current state of security on Windows NT, I would never use a Windows NT system as an Internet front end.
Update: January 13, 2000
The Cult of the Dead Cow web page seems to be a dead end. All that is seen is something that says "filling ditches with snitches, head up cDc 2000". I did not find any links to click elsewhere. The cDc folk are also associated with L0pht Heavy Industries. The L0pht folk have recently formed a venture funded company known as @Stake, which specializes in computer security issues.
Update: September 29, 2003
The company @Stake does not resemble the sort of company that I would have expected the CdC/L0pht people to be involved with. @Stake recently announced that they fired their Chief Technology Officer, Daniel Geer for co-authoring a report critical of Microsoft. Given that the L0pht fold wrote Back Orifice, this is a strange turn of events. I have moved my notes on Daniel Geer's firing and @Stake off this web page, since this web page is already so long. These notes can be found here.
Update: August 7, 2005
The Cult of the Dead Cow web page is back. I'm trying to get them to sell Cult of the Dead Cow teeshirts with the way cool graphic from the front page (La Vaca es Muerto).
Snort is a open source packet sniffer and intrusion detection package. It is described in the article Snort - Lightweight Intrusion Detection for Networks by Martin Roesch. Snort runs on both Unix systems and Windows systems.
Ethereal is another network sniffing program. Like many tools Ethereal can be used a number of ways. I've used it to track down network problems on my local network.
Those who do not know history are doomed to repeat it.
The study of an academic discipline like computer science is, in large part, the study of history. We learn what others have done before. With Linux there seems to be an entire generation that is almost willfully ignorant of history. Linux has a monolithic kernel which is now so big that it is a struggle for anyone to understand it. Before Linux other operating systems like Mach, BSD Unix and even Windows NT were moving away from a monolithic structure into an operating system composed of components.
UNIX has a long history. The first version of UNIX that I used did not have virtual memory (e.g., System III). Once UNIX escaped from Bell Labs different versions started to develop. There was AT&T's version, System V, Sun's version (called SunOS back then), and the Berkeley version. The Berkeley version of UNIX had TCP/IP networking first and had a huge influence on the course of UNIX developement. Once the AT&T code was removed from UNIX, an open source version of the Berkeley Standard Development (BS) release was born. Versions of BSD calved off as well. James Howard has written a great article titled The BSD Family Tree which describes the various BSD versions and other BSD derived UNIXs.
What ever the virtues of BSD UNIX, Linux has now become the UNIX standard. There are number of reasons I like to use Linux for development. The operating system is transparent and easier to understand than Windows (NT, 2000, XP or what ever Microsoft calls their latest release). But despite what the Slashdot crowd claims, all is not wonderful on Linux. There are not many options when it comes to software to burn CD-ROMs and especially DVD-ROMs. I purchased a Plextor DVD-RW drive so that I could back up my hard drives. Here are some notes on CD/DVD-ROM software for linux:
Most people use X-CD-Roast on Linux. While I've succeeded in using this software to make copies of my music CDs (for personal use at work, I should add), I have never gotten it to work with a DVD (although I've destroyed several DVDs in the process). I find the user interface difficult and the documentation inadequate.
This software, for the Linux KDE Window system, looks really promising. I have not tried it yet, but I have hopes that it will be better than X-CD-Roast. One point of concern is that the documentation link does not point to anything but links to discussion groups. There is, apparently, a book in german on the software. So I will hope for an english translation.
Dvdrtools dvdrecord - dvd-rw/dvd-r made easy and free
Lets see, CD-Roast uses cdrecord. The cdrecord software is open source and this is a branch off of the cdrecord tree. I guess my comment is "free, perhaps, but easy?"
DVD+RW/+R/-R[W] for Linux by Andy Polyadov
Some notes by Andy Polyadov on CD/DVD tools on Linux. Some of this material may be a bit dated.
A Slashdot discussion: Free DVD Recording Tool for Linux?
A Slashdot discussion on DVD recording tools. As I recall the summary was that CD-Roast is about the best there is.
This section includes links to various software applications that I've found useful.
Audiograbber: CD-ripper for Windows
I am preparing to move my office at work into an area where personal electronics and music CDs are not allowed. MP3 files are allowed, however. So I looked around for a music CD "ripper" that I could use "rip" my music CDs to MP3 files and store them on my computer's hard disk. I have been very happy to find the freeware Audiograbber application. Audiograbber is very easy to use, has a small disk footprint and is not infected with spyware. I am also pleasantly surprised at how small the compressed MP3 files are. I'm sure that I'm losing some quality compared to a music CD, but I'm playing the music on cheap computer speakers, so I have not noticed any difference.
The evil RIAA stalks the earth looking for grandmothers who might be downloading rock oldies without paying the members of the RIAA (the horror, the horror). I'm sure that the RIAA dislikes programs like Audiograbber because this software makes it easy to convert music CDs into MP3 files. However, despite what the RIAA would like to think, it is still legal to make a personal copy, for your own use, of the music that you have purchased. That is exactly what I have done.
While I despise the heavy handed tactics of the RIAA and the general stupidity of the major music publishers (suing your customers, there's a good business model), I strongly believe that people should be paid for creating music (or books or software). So I would never publish music that I've ripped as a matter of principle.
At Bear Products International we are involved in compiler development, including software tools and compilers for Java. Compared to, for example, the IEEE Verilog standard, the Java Language Specification and the Java Virtual Machine Specification are models of clarity. However, there are still some holes. Some of these issues are discussed in The Java Spec Report. There is a lot of good material here, but its last update is listed as September 1998.
The www.bmsi.com Java Web page by Stuart D. Gathman publishes several interesting sources, including a Java class dependence analyzer. Java dependence analysis is interesting for a Java compiler because the compiler must compile classes that are referenced by the class being compiled. Clearly this is a recursive process.
Bill Venners, author of Inside the Java 2 Virtual Machine has a great web site with lots of information on the JVM, Java and Jini (the coolest Java technology yet released). The site is named after Venners' consulting company, Artima Software. You can click on the icon below to go to the Artima site.
Other bearcave.com Java related link pages:
Links to Web Pages Related to Compiling Java (to native and JVM byte code)
Sources for the Java Virtual Machine (JVM). This list concentrates on open source JVMs and JVMs for embedded systems. I have not included obvious sources like Sun and IBM.
This web page provides a limited set of links to compiler related resources on the Web.
Castor (from Exolab.org)
Castor supports serialization from Java to XML and to SQL (to update a database to provide persistance). The Castor software reads an XML Schema and generates Java classes which include code to "marshal" (write) to XML and "unmarshal" (read from) XML into the Java class.
I found Castors approach somewhat unexpected, at least for Java. Using Java introspection it is possible for Java software to discover the structure of an object. The object can then be written to or read from XML. In a non-XML context this is the operation defined by the Serializable interface.
Sun's Java 1.4 release (and later releases) includes an XMLEncoder and XMLDecoder which will encode and decode a Java class into XML form. John Zukowski, an active voice in the ANTLR community, wrote a brief tutorial on using the XMLEncoder and XMLDecoder classes. These XML serialization classes presumably makes use of the Java reflection API.
The Rogue Wave XML Object Link is a tool that reads XML schemas and generates C++ classes that include the marshalling and unmarshalling code. Although I like Rogue Wave software, I've always found it too expensive for use on my personal software projects, so I expect that this is another example of expensive Rogue Wave software. Perhaps its time to create an open source version.
Gnome libxml: the XML C parser and toolkit
This is an open source (MIT software license) C toolkit for parsing and validating XML. It apparently supports both "push" SAX type parsers, where you supply callback functions and "pull" parsers, where the parser requests the next token (this is a great feature, since many applications fall into this catagory). The parser can also generate DOM objects. All-in-all, it looks pretty good.
The Expat XML Parser developed by James Clark (no, not the Netscape guy).
Expat is an XML parser for C (or C++, using C linkage I assume). The main claim to fame, at least originally, for Expat, was speed. I don't know if this remains true compared to later versions of Xerces. However, this is the parser that was used for the Mozilla browser. Expat is probably smaller and easier to understand that Xerces, which can be a big plus.
James Clark has a software company in Bangkok, Thailand called Thai Open Source. Perhaps the name Expat is inspired by Mr. Clark's experience.
Building a Data Structure with Expat by David M. Howard (the link is on this web page, under publications).
David M. Howard's web page Building a Data Structure with Expat describes a limited version of Rogue Wave's XML Object Link. That is, a software tool that reads XML Schemas and generates C structs.
XP: a high-performance XML parser for Java
The Expat parser is targeted at C (or at least C linkage). XP is for Java. What makes XP interesting is that it seems to be designed for speed and it is does "pull" style (or demand style) parsing, not SAX callbacks. In the demand style parsing, the program doing XML semantics asks for the next token, which is delivered by the XML parser. In the SAX style the parser calls the semantic routines, which makes many applications difficult to implement. This is why people still tend to use DOM parsers.
The XP link above is from James Clark's XML Resources web page. This includes links to a number of tools that James Clark has developed for XML processing.
XML Pull is a fast Java XML parser with a very simple interface. Unlike SAX parsing, which produces events as it parses an XML document, XML pull allows a parser to be written that requests the next token. The XML Pull web site includes a brief tutorial and a discussion of how to use XML Pull with the Java XmlSerializer interface. I have also written some web pages on how XML Pull can be used and why it is better than SEX, ah SAX.
XPA: XML Processing for ANTLR, developed by Oliver Zeigermann
XPA allows SAX to be integrated with ANTLR. Oliver writes in the antlr-interest Yahoo discussion group:
It allows you to feed XML SAX events into ANTLR parsers as token streams. Optionally, if you do not care for space, you can create an AST from a SAX parser and transform it using ANTLR tree parsers.
The Resin XML Application Server from Caucho Technology
The Simple Object Access Protocol (SOAP) for XML is a way to pass data to and from Java servelets. The Resin XML application server is very fast and light weight. It is available for developing without fee and can be deployed in applications were no fee is charged (e.g., open source) without a license fee.
I've published some notes I wrote while installing and using Resin. These notes can be found here.
Unlike Resin, JBoss is an open source project. However, JBoss is definitely a commercial endeavor. Most open source projects are described in purely engineering terms. The project has these objectives, the software currently provides these features. One of the strange things about reading the JBoss web site is that it departs from simple concise engineering description, toward the language used by marketing people. At times the buzzwords start to get out of hand as well. For example:
The Aspect-Oritented Programming architecture of JBoss 4.0 enables it to provide a wide range of services, including object persistence, caching, replication, acidity, remoteness, transactions and security. The framework allows developers to write plain Java objects and apply these enterprise-type services later on in the development cycle -- without changing a line of Java code. This new concept of programming provides a clean separation between the system architect and the application developer. The iterative development process becomes more fluid as architectural design decisions can be made later on in the development process without changing any of your Java code. Entirely unique among Java-based application servers today, this architecture combines the simplicity of standard Java with the power of J2EE.
JBoss 4.0 brings Aspect-Oriented technology to Java through a pure 100% Java interface. Base on the new JBoss.org project Javassist, JBoss-AOP allows you to apply interceptor technology and patterns to plain Java classes and Dynamic Proxies.
JBoss has sometimes been given as an example of an open source project which actually brings in revenue. Does this mean that marketing and marketing driven exposition are necessary for a commerical enterprise?
Amazon has a SOAP/XML over HTTP interface to their software. They call this the Amazon Web Service.
IBM's Emerging Technologies Toolkit (ETTK)
The ETTK is based on SOAP and Apache AXIS. This is a package to support distributed computing and what IBM has come to call "autonomic technologies". The link above is under IBM's alphaWorks . In many cases projects move from alphaWorks to either a IBM product or an open source project.
Before there was XML, there was IDL (the Interface Definition Language). Sun's Java platform also supports IDL and CORBA. As the above link entries show, there is a vast and rich set of software to support XML. However, from a simple language pont of view it is not clear to me what advantages XML provides above and beyond IDL what IDL provided.
There was a theory that humans would not actually read XML. It would be created and consumed by software. XML's structure certainly reflects this. However, humans do read and write XML Schemas. XML is, arguably, more difficult to read than IDL.
XML related web pages on bearcave.com:
Notes on Resin, Axis, Servlets and SOAP.
This web page consists of my notes on installing the Resin HTTP server, AXIS SOAP support and Apache SOAP services.
This web page is intended to be a commentary on XML. Right now it is a set of annotated links. I plan to eventually finish this commentary and add the links above to this web page.
Somehow What is the Matrix is a more interesting question. But, like everything the evil empire foists on us (remember OLE, COM, DCOM and ActiveX), .Net is here whether we like it or not.
Microsoft .NET by DrPizza in ARStechnica, February 2002
Once Microsoft marketing gets ahold of something, it becomes entirely obscured by smoke, mirrors and the grand vision that Microsoft wants to sell you (which, of course, is only available on the Microsoft platform). This was certainly true of a now fading object technology called OLE. It is even more true of Microsoft's ".NET". As DrPizza (Peter Bright) writes at the start of his excellent and extensive article on Microsoft's .NET:
In a remarkable feat of journalistic sleight-of-hand, thousands of column inches in many "reputable" on-line publications have talked at length about .NET whilst remaining largely ignorant of its nature, purpose, and implementation. Ask what .NET is, and you'll receive a wide range of answers, few of them accurate, all of them conflicting. Confusion amongst the press is rampant.
The more common claims made of .NET are that it's a Java rip-off, or that it's subscription software. The truth is somewhat different.
What is impressive is that Mr. Bright seems to have written this article while an first year undergrad at the British Imperial College.
Microsoft's .NET seems to consist of two components: the common language interface (CLI), which is a compiler intermediate/virtual machine instruction set and Web services. The Web services are based on Microsoft's C# language and their web server technology. The real "point of the spear" as far as Microsoft is concerned is Web services, since this sells Microsoft operating systems and application software.
Microsoft .NET has not exactly taken the world by storm. There are several reasons for this. .NET is platform specific, although some componenets, like the CLI and C# are public standards. If you want to make use of .NET technology you pretty much have to do it on a Windows platform. However, a significant fraction of the web servers in the world use Apache or other non-Microsoft software. In most cases these non-Microsoft web servers run on UNIX or Linux. Java is also increasingly being used for Web services. Java has the advantage of running on UNIX, Linux and Windows. This appears to be a case where Microsoft attempted define a standard on the Windows platform and failed. .Net: 3 Years of the 'Vision' Thing by Peter Galli, eWeek, July 7, 2003, provides a brief discussion of the tepid adoption of .NET.
In Pursuit of Simplicity: the manuscripts of Edsger W. Dijkstra, University of Texas
Computer science and computer engineering are now getting old enough that the pioneers are starting to pass from us. Seymour Cray is gone, as is Edsger Dijkstra. I have read some of the essays published on this web site in Dijkstra's book Selected Writings on Computing: A Personal Perspective. Dijkstra come at computer science from an applied mathematics perspective. His computer science essays are interesting and educational. He was also a man of strong opinions and this makes his essays amusing. I am not convinced that viewing a 200,000 line compiler as a piece of mathematics is the right approach. Rather I'd say that the compiler is a set of data structures and transformations, with some mathematical algorithms like register coloring. So Dijkstra is certainly not the final work on software engineering and computer science. But he is someone who greatly influenced the field and this influence is felt today (Java has no GOTO).
I placed this link near the top of the computer science section because Paul Graham represents many of the things that I love about computer science. In Paul's essay Hackers and Painters he writes
When I finished grad school in computer science I went to art school to study painting. A lot of people seemed surprised that someone interested in computers would also be interested in painting. They seemed to think that hacking and painting were very different kinds of work-- that hacking was cold, precise, and methodical, and that painting was the frenzied expression of some primal urge.
Both of these images are wrong. Hacking and painting have a lot in common. In fact, of all the different types of people I've known, hackers and painters are among the most alike.
Like Don Knuth, Paul views computer science as both a science and an art. Sometimes I feel that this view is dying in the world around me. Computer science does not seem much valued anymore. Software engineering jobs have become hard to come by and employers seem to care most about whether you have mastered the latest buzzwords, not whether you have a deep background in computer science and software engineering. Job interviews frequently degenerate into inquisitions where the interviewee is asked to write a series of minor algorithms on a whiteboard. Fewer and fewer people seem interested in whether you can solve complex engineering problems. Like Knuth before him, Graham shows that there is more to computer science that whether you can pick the right Java class library component.
A slashdot discussion of Robert Milner's work on algorithmic proofs and proof of correctness.
This slashdot posting includes a number of interesting links to Robert Milner's work and related work by others including Peter Lee at CMU.
I've always been fascinated by the idea of applying theorem proving techniques to software. These techniques have been used in VLSI logic design tools to show that two designs are equivalent.
However, the application to software is more problematic. The problem in software is that one would like to assure that the software "does the right thing" or at least does not suffer from catastrophic defects. In theory you can show that a body of software implements a specification (which must, in turn, be specified in some formal language). But there is no way to prove that the formal specification is without catastrophic error in the context of the application.
Concepts, Techniques and Models of Computer Programming by Peter Van Roy and Seif Haridi
The authors write:
This textbook brings the computer science student a comprehensive and up-to-date presentation of all major programming concepts, techniques, and paradigms. It is designed for second-year to graduate courses in computer programming. It has the following notable features:
- Concurrency: the broadest presentation of practical concurrent programming available anywhere. All important paradigms are presented, including the three most practical ones: declarative concurrency, message-passing concurrency, and shared-state concurrency.
- Practicality: all examples can be run on the accompanying software development platform, the Mozart Programming System.
- Programming paradigms: the most complete integration of programming paradigms available anywhere.
- Formal semantics: a complete and simple formal semantics that lets practicing programmers predict behavior, execution time, and memory usage.
The book is organized around programming concepts. It starts with a small language containing just a few concepts. It shows how to design, write programs, and reason in this language. It then adds concepts one by one to the language to overcome limitations in expressiveness. In this way, it situates most well-known programming paradigms in a uniform framework. More than twenty paradigms are given, all represented as subsets of the multiparadigm language Oz.
Parsing Techniques - A Practical Guide by Dick Grune and Ceriel J.H. Jacobs
Parsing Techniques was originally published as a book. The authors write:
The latest publisher, Prentice Hall, claims their stock has run out. Specialized book shops may still have a copy or two. Prentice Hall has indicated that they will not reprint. Copyright of the book has been returned to us, and we are now (Spring 2003) working on a second edition, updated with recent developments. At the moment we are looking for a publisher.
I have not been able to find this book on any of the used book sites (e.g., abebooks.com, amazon.com used book sellers or alibris.com). Fortunately, Parsing Techniques is now published on-line by the authors. This is an extensive, practical, well regarded discussion on parsing.
ISO 14977 EBNF Standard (PDF)
Extended Bacus Naur Fform is the oldest language used to define grammars. Apparently there is an ISO standard (ISO 14977). The above link is a copy of the PDF in some one's home directory, so it is possible that this link will disappear.
Stratego: Strategies for Program Transformation
In a formal sense, the process of correctly compiling a program in a language like Java or C++ into byte code or processor assembly language is a process of rewriting the program from the input language into the lower level langauge. The distance between the input language (C++) and the target language (assembly language) means that the rewrite process takes many steps.
Stratego is described as:
Stratego is a modular language for the specification of fully automatic program transformation systems based on the paradigm of rewriting strategies. The construction of transformation systems with Stratego is supported by the XT bundle of transformation tools. The Stratego/XT distribution integrates Stratego and XT.
The Stratego compiler is, interestingly enough, implemented in Stratego (via bootstrapping).
I believe, by the way, that stratego is a latin term for general.
The Elegant compiler generator tool kit from Philips Research (released under the GNU copyleft).
From the Elegant web page:
What is elegant?
Elegant started as a compiler generator based on attributed grammars (the name stands for Exploiting Lazy Evaluation for the Grammar Attributes of Non-Terminals) and has grown into a full programming language. Although it has been inspired by the abstraction mechanisms found in modern functional languages, Elegant is an imperative language that does not discourage side-effects.
Elegant is written in Elegant. (Beware of any language implementation not written in that same language!) and has been used for internal use within Philips for about 15 years now. In this period, dozens of compilers have been built with Elegant. Elegant release 7 is distributed under the Gnu General Public License.
Front is front-end generator for Elegant. It will generate an Elegant attribute, a scanner and an implementation for the abstarct syntax tree from one BNF-like specification. Front can alternatively generate a front-end in C, using Bison and Flex. Click below for more details on the C-version of Front
My colleagues and I have been working on a query language which has grown to be fairly complex. One of the tools that is provided with Elegant is an EBNF to railroad diagram translator. Having syntax railroad diagrams would make our documentation easier to understand. But we have been unable to get this software to build and execute properly.
Like many large corporations, Philips has made big cutbacks in their research groups. I'm not sure that Elegant is still alive.
For other EBNF to graphic format (e.g., postscript) see:
Ebnf2ps: Peter's Syntax Diagram Drawing Tool
Unfortunately this software is written in Haskell98, which makes it a bit less portable than it would be if it were written in Java or C++.
Quoting from the comp.compilers posting that announced the availability of LLVM:
LLVM is a new infrastructure designed for compile-time, link-time, runtime, and "idle-time" optimization of programs from arbitrary programming languages.
LLVM uses a low-level, RISC-like, language-independent representation to analyze and optimize programs. Key features include explicit control flow, dataflow (SSA), and a language-independent type system that can capture the _operational behavior_ of high-level languages. The LLVM representation is low-level enough to represent arbitrary application and system code, yet is powerful enough to support aggressive "high-level" transformations. The LLVM infrastructure uses this representation to allow these optimizations to occur at compile-time, link-time and runtime.
Release 1.0 is intended to be a fully functional release of our compiler system for C and C++. As such, it includes the following:
- Front-ends for C and C++ based on GCC 3.4, supporting the full ANSI-standard C and C++ languages, plus many GCC extensions.
- A wide range of global scalar optimizations
- A link-time interprocedural optimization framework, with a rich set of analyses and transformations, including sophisticated whole-program pointer analysis and call graph construction.
- Native code generators for x86 and Sparc
- A JIT code generation system for x86 and Sparc
- A C back-end, useful for testing and to support other targets
- A test framework with a number of benchmark codes and some applications
- APIs and debugging tools to simplify rapid development
The Scheme Dialect of Lisp
How to Design Programs: An Introduction to Computing and Programming, by Matthias Felleisen, Robert Bruce Findler, Matthew Flatt and Shriram Krishnamurthi, MIT Press (paper book edition, 2001, on-line edition, 2003)
The DrScheme Programming Environment. The DrScheme Programming Environment runs on must platforms, including Windoz and Linux.
The TeachScheme! Project. A guide to teaching programming, via Scheme and How to Design Programs.
For some people the answer to "How to Design Programs" is "in LISP (or perhaps, Scheme). I have to confess that while I have worked in LISP and Scheme, I've never been very good. Not like I am in C++ or Java. Perhaps this is a result of having my mind warped by Pascal at an early age.
The book How to Design Programs appears to be a more approachable version of The Structure and Interpretation of Computer Programs (see below). This books seems to be an attempt at an everyperson's tutorial on programming. At least where "everyperson" has taken high school algebra. The DrScheme programming environment looks promising.
The Structure and Interpretation of Computer Programs
The Structure and Interpreation of Computer Programs is a classic computer science text book. I read somewhere that MIT is revising its CS and EE courses, but for many years this book was the text used by the MIT freshman core computer science course. This is an amazingly deep and wide ranging book. It is available via the above link on-line, in HTML form.
Partial Evaluation and Automatic Program Generation by N.D. Jones, C.K. Gomard, and P. Sestoft, with chapters by L.O. Andersen and T. Mogensen, 1993. In the preface the authors write:
This book is about partial evaluation, a program optimization technique also known as program specialization.
It presents general principles for constructing partial evaluators for a variety of programming languages, and it gives examples of applications and numerous references to the literature.
This web page publishes the complete text of this book in postscript and PDF format.
Rel - An Implementation of Date and Darwin's "Tutorial D"
The SQL expression language that is used to access and modify relational databases has been criticized (by C. J. Date, Hugh Darwen and Fabian Pascal) as not being faithful to the relational model. The critics of SQL claim that the shortcomings of SQL make is more difficult to use and to understand.
Rel is an experimental implementation of a language that is supposed to avoid the pitfalls of SQL. Rel is based on a language originally described in the book Foundation for Object/Relational Databases -- The Third Manifesto by C. J. Date Hugh Darwen, (Addison-Wesley, 1998 and published as a second edition in 2000). This book contains a section titled "Tutorial D" which describes the language.
The Beyond3D Web site has excellent articles on 3D graphics, hardware graphics accelerator architecture, products and much more. This site provides the kind of clear and detailed discussion of 3D graphics hardware that Ars Technica has been providing for processor architectures.
ReiserFS seems pretty dead right now. As I write this, Hans Reiser is on trial for the murder of his wife. This case has had many twists, including the revelation that Reiser's wife was involved with a man who claims to be a serial killer. Interestingly, the police have not arrested this individual.
Even if it were not for the tragic events surrounding Hans Reiser and
his family, work in the Linux community on file systems seems to have
eclipsed ReiserFS.
June 5, 2007
The ReiserFS is a file system for Linux. It does "journaling" which means that like data base transactions, a ReiserFS file system transaction is either complete or in a known state of process. If it is in a known state of process it can be either completed or removed. This avoids the horrible UNIX fsck (f-suck) which has been done for twenty years when UNIX systems boot.
The ReiserFS uses "fast balanced trees" and they claim that this gives it much better worst case performance.
The file system supports "plug-ins" so one can, for example, plug in an encryption module. Slashdot.com reported that the company that is developing ReiserFS (which is an open source project), Namesys, received $600K from DARPA to develop encryption plug-ins for the file system.
The ReiserFS web pages include documentation on high performance file system data structures. One of the great things about the ReiserFS project is that it is dedicated to sharing the computer science information behind ReiserFS, rather than keeping it secret.
the Journey Operating System has some interesting features. It apparently has a journaling file system (e.g., a file system that acts like a database, where operations are either committed or rolled back). Journey OS also includes something called a HyperQueue for interprocess communication and communication with the operating system. HyperQueue apparently avoids most context switches on what would usually be operating system calls.
Apparently Journey OS is largely or completely the work of J. Charles Kain. Sadly he has written very little about the design objectives of Journey OS or the architecture of the software components he has published so far. Perhaps looking at the example of Linux he believes that if you publish the software, they will come, even thought there is little in the way of documentation beyond an overview. After all, getting the source code out there is the important thing, right? And Journey OS has some way cool icons, so what more does it need.
The popularity of Linux is to some degree a historical accident. Linux is not popular because of technical excellence. Compared to operating systems like Mach, which evolved into the Next operating system, which evolved into Apple's OS X, Linux has been, historically primative. In fact, the BSD based operating systems (freeBSD, openBSD and netBSD) have been through much of Linux's existence better designed and more stable. So it is unlikely that Journey OS is going to take the world by storm because it is "way cool".
This is not the appropriate place to publish a long rant on software documentation. Such a rant can be found elsewhere on these web pages. But it is probably worth quoting something I read on the web page of a computer science professor at Carnegie Mellon University. This professor wrote that if you could not write well, don't bother applying for a graduate student position in his group. It does not matter how brilliant you are, he wrote, if no one understands what you have done or why you did it.
Some University projects may suffer from the opposite problem of projects like Journey OS: they publish lots of paper and don't build anything that works. This does not change the fact that software source code is not a medium for communicating ideas to humans. Programming languages are designed to represent algorithms in a form that can be efficiently compiled for execution on computer hardware. Anyone who claims that a large body of software is "self documenting" either can't write clearly or is simply lazy. People who do not document their source code are pushing work that they should have done onto someone else who will have to maintain the code at a later time. These poor souls will have the unenviable job of trying to understand a large body of undocumented source code so that it can be fixed or modified. In the case of the Journey OS, it seems likely that most people will not bother to wade through some obscure project whose motives and design are unstated. Regardless of how cool and brilliant it is.
Statistical Techniques in Language Translation: Franz Josef Och's web site at the University of Southern California's Information Science Institute
See also my web page on natural language processing information extraction.
Scott R. Ladd's Coyote Gulch Web site
Scott Ladd is a software engineer and author of a number of books including books on C++ and genetic algorithms. His web site has some interesting material, including some benchmarks that compare C++, Fortran and Java (Linux Number Crunching). Although this remains controversial for some people, it was no surprise to me that Scott concludes that Java is a poor choice for numerical simulation.
Large Limits to Software Estimation by J.P. Lewis, July 2001
The link to this paper is on J.P. Lewis' Web site hosted on idiom.com, which hosts bearcave.com as well. Large Limits to Software Estimation is an interesting paper. The paper shows that there are theoretical limits to estimating software complexity. Since software complexity cannot be exactly estimated, schedules for constructing software are sure to be off.
One has to be careful with mathematical proofs like this. They can be both true and false in practice. For example, perfect register allocation is impossible to solve in a reasonable amount of time (e.g., the time in which a compiler should compile a piece of code). So on one had we can say that optimal register allocation is an intractable problem. But a solution that is 80% of the perfect solution will be good enough, in practice. So if we could estimate the complexity of software with 80% accuracy, this would be good enough in practice. Of course it's not clear that we can do this and the history of software development projects does not encourage optimism.
This article was discussed at length on SlashDot. I did not actually see anyone address the content J.P. Lewis' article. They simply blathered on about their own experience and how their methodology worked or did not work in estimating software projects. Perhaps this is part of the problem in software estimation too.
J.P. Lewis' paper caused a stir in the software engineering community, probably as a result of its clarity and its negative implications. A subsequent paper, elaborating on the July 2001 paper, can be found here.
Are there limits to software estimation?, a response to J.P. Lewis' article by Charles Connell, published on slashdot.org, January 11, 2001
One of the points raised by Lewis in his original article is that we are more likely to be right in our estimates for the time to complete a project if we have implemented similar software before. For software where we have no previous experience, the estimates become doubtful.
Extensive design and implementation documentation is, to some degree, a dry run for the software implementation. If this documentation is complete the author has, at least in thought, completed a prototype. As a result, the time estimates are likely to be more accurate.
I like Lewis' paper and I think he has some valid points. I certainly prefer Lewis' approach to the hype and fuzzy thinking in some of the software engineering literature (e.g., if methodology X is followed all of the problems previously encountered in large software projects will be avoided). But as Charles Connell points out in his critique, software estimation is not aimed at finding a perfect time estimate:
The real-world problem of software estimation is much less strict than Lewis states. We are just trying to get somewhere close a reasonable percentage of the time!
One of the points that Lewis raises is that the process tends to be inherently subjective and claims at objective methodologies are likely to be wrong.
Complexity, in terms of a software project, is not reducable to Kolmogorov complexity. For example, I have been working on wavelet algorithms for a year. The actual algorithms, once they are developed, are relatively small and simple (a.k.a. elegant). These algorithms should have a relatively low Kolmogorov complexity. In fact, they can be described largely using matrices, which ties directly into the Kolmogorov description.
The ideas behind wavelet algorithms are complex. In particular, some wavelet algorithms are only reversible for infinite data sets. In the case of finite data sets, errors are introduced at the edges of the data. Techniques like Gram-Schmidt orthogonalization have been proposed to deal with this problem, but this technique does not always work in practice. Other wavelet algorithms (e.g., some lifting scheme algorithms) do not have these problems, but developing the wavelet and the scaling functions takes a lot of insight.
As these issues demonstrate, Kolmogorov complexity will provide no measure for the "conceptual complexity" behind an algorithm. Yet conceptual complexity is of critical importance if the task at hand is developing software for wavelet compression of images.
Fast Algorithms for Sorting and Searching Strings by Jon L. Bentley and Robert Sedgewick (in pdf format), January, 1997
The related web page on string algorithms includes the article Ternary Search Trees (also by Bentley and Sedgewick)
Cotse Security is a web site run by Stephen K. Gielda, who does computer security consulting work. Stephen's web site discusses issues concerning computer security, privacy, anonymity and freedom of expression. As with many other people, Stephen has found that there tends to be a conflict between anonymity and accountability.
Complexification.net: Gallery of Computation
This web page showcases the work of Jared Tarbell. Jared's work is truly the intersection of art and computer science. He seems to use as a starting point a computer algorithm, from which he develops striking dynamically evolving art.
I first saw Jared's work Substrate running as a screen saver on a colleagues computer. Substrate looks like a cross between a fractal or cellular automata created city and impressionist art.
This is Jared Tarbell's studio/company. This has links to Jared's publications and more of his images.
This section includes links to software libraries, software frameworks, system tools (e.g., compilers) and software that I find interesting from a computer science perspective.
The Boost library publishes a set of library functions (in source form) that are compatible with the C++ Standard Template Library (STL). There are some interesting functions published in the Boost library. These include classes to support graphs and graph manipulation. However, unlike STL, there does not seem to be any guiding direction for what is included. For example, the math library section includes quaterion functions, which are used by people doing 3-D programming and tracking applications. This is cools stuff, but certainly not as generally useful as the string class or the vector template in the C++ STL.
POSIX Threads
Pthreads Win32: Open Source POSIX Threads for Win32
POSIX Threads (known as Pthreds) have kernel level support on several UNIX systems and provide a platform independent way to implement therading. Pthreads have no native support on Windows (Win32). The Pthreads Win32 package from RedHat provides a Pthreads interfact to the Win32 native threads.
The Boost library also has a threads package. I can't see why anyone would use this package if Pthreads is available. Pthreads is well documented and is a POSIX standard. This means that your code is more likely to be portable between platforms.
Posix Threads Programming: a tutorial from the Lawrence Livermore National Labs. This is part of the Livermore Computing Training. There are additional tutorials on Python, OpenMP, MPI and other topics.
OpenTop is an implementation of some of the core components of the Java class library, for C++. Compared to Java, C++ has a fairly limited class library (consisting of at most the Standard Template Library, Boost and a few others). OpenTop provides many of the core Java classes, which is a big step forward. An added advantage is that these classes will also be familiar to any Java programmer.
OpenTop is available as GPLed Open Source and as a commercial product (presumably the commercial product is a different source base).
This appears to be a Free Software project repository for projects that are not an official part of the Free Software Project. This software is described as "Free Software" for free operating systems (e.g., not Windows).
Digital Mars is Walter Bright's company and publishing site for his compilers. Walter was the author of the Zortech C and later C++ compilers. The Zortech C++ compiler was one of the first compilers available for the IBM PC (this was the version of C++ based on The Annotated C++ Reference Manual by Ellis and Stroustrup. The Zortech compiler was also one of the most reasonably priced and was the first C++ compiler I purchased. Zortech was purchased by Synantec, before they got into the anti-virus software. Walter apparently also wrote the Semantec Java compiler.
The Digital Mars web site publishes compilers for Win32 and DOS. The Digital Mars compilers and runtime libraries are available on CD-ROM for around $30, which as far as I'm concerned is just about free. You can also download the compilers if the CD is too expensive for you (go on, buy the CD).
There are also some interesting libraries published on the site, including the C++ STL and Boehm's garbage collector ported for the Digital Mars compiler/runtime environment.
Walter Bright continues an amazingly productive and innovative career. He has designed the D Programming Language. He has apparently written a compiler for D as well. As the name suggests, D is meant to be a successor to C/C++. Unlike C++, D does not have lots of compatibility baggage.
Walter has not only written a number of optimizing compilers for various languages, he has also written a multiplayer strategy game called Empire.
BrookGPU: Brook for GPUs is a compiler and runtime implementation of the Brook stream program language for modern graphics hardware.
Modern graphics processors usually have at least four floating point arithmetic units. This make graphic processors (GPUs) the fastest numeric engines available for microprocessor prices. The BrookGPU project is based at Standford and is sponsored in part by DARPA.
Mono: a Free Software implementation of .NET
The Mono project is an attempt to implement version of Microsoft's stadardized software components on Linux. Currently these are the C# langauge and the Microsoft CLI (Common Language Interface). Other, non-standardized components, will be implemented as well.
I have mixed feelings about the Mono project. The C# language and the .NET framework can be seen as yet another attempt by Microsoft to control the computing platform.
My perception, at the time this was written, is that .NET is losing ground to Enterprise Java. Enterprise Java is not simply Sun Microsystems, but includes a number of other companies like IBM, Oracle and Weblogic. Potentially this gives enterprise Java a development base that is so large that even Microsoft cannot compete with it. One advantage of enterprise Java is that it runs on both Windows and Linux. Portability and the large developer and user base make enterprise Java a huge treat to Microsoft's attempt to dominate enterprise software.
The Mono project brings portability to .NET, removing one of the advantages of enterprise Java. To the extent that Mono is usable (currently an open question), this could help Microsoft gain acceptance of .NET. Mono is an open source project and most developers are donating their development time. To the extent that developing Mono helps Microsoft, developers have to ask themselves whether this is how they want to spent their free time.
Some writers have pointed out that a successful Mono project which becomes widely adopted for on Linux could be destroyed by Microsoft, threatening any project that relied on Mono. See Mono-culture and the .NETwork effect posted on Librenix, October 13, 2003.
From www.hoard.org:
Hoard is a fast, scalable and memory-efficient allocator for multiprocessors. Hoard solves the heap contention problem caused when multiple threads call dynamic memory allocation functions like malloc() and free() (or new and delete). Hoard can dramatically improve the performance of multithreaded programs running on multiprocessors.
Valgrind: an open-source memory debugger for x86-GNU/Linux
I have implemented a C++ reference counted String container. As it turns out, this software exposes much of the complexity of C++. You can only assure proper function with a fairly large test suite and with a software tool to verify memory use (this is why people like Java).
I've used Purify, a commercial software tool that runs on the Sun Microsystems workstations, for verifying memory use in the past. For a cave based software developer like me Purify has a substantial license fee. This license fee can definitely be justified for commercial software, but it is harder to justify for software that I give away. Purify was originally developed for the Sun platform and has a good reputation on this system. At least at the time of this writing, Purify's reputation is not as strong on other systems, especially Linux.
Valgrind is an excellent open source alternative to Purify. Object code that is processed by Valgrind does not seem to suffer a huge performance penalty. On Linux I was able to use Valgrind to find some reference before definition errors in my String container test suite. The results produced by Valgrind are not as easy to understand as those produced by Purify. Also, Valgrind does not have Purify's nice GUI interface.
Commercial C++ Memory Debugging Tools
The complexity of some C++ code means that it can be difficult to tell whether there are memory errors. Memory errors take several forms, including references to memory that has been deallocated and failure to free allocated memory. In addition to the Open Source valgrind tool mentioned above, there are a few commercial tools. One of the first commercial tools to provide sophisticated memory reference debugging was Purify which is currently sold by IBM/Rational.
As noted above, Purify is an excellent tool and is very easy to use. I have used it to verify a number of large software systems on Sun's Solaris verson of UNIX. From my point of view, the problem with Purify is its cost, which at the time of this writing is over $1,000 for a single license.
Parasoft sells a tools called Insure++ which does many of the same things that Purify. However, like Purify, the Insure++ software tool is expensive (I've seen quotes around $1,200 per license).
A company named Software Verification sells a tool called Memory Validator. This is more affordable. At the time of this writing their web site states that it is available for at an "introductory price" of $299 and is usually priced at $399. The authors of Memory Validator discuss it in Software Verification's Memory Validator
TOM: A Pattern Matching Compiler for Multiple Target Languages
TOM is a tool for doing pattern matching on trees. For example, the abstract syntax trees generated by a compiler. TOM has also been used for pattern matching in rule based systems. This last application particularly caught my eye. Perhaps it is possible to use TOM for pattern matching in information extraction applications in natural language processing.
Fnorb: An Object Request Broker for Python
As I write this XML technologies seem to be all the rage. Or at least them seem like a disease that can't be eradicated. Before there were XML Schemas (XSDs), XSLT and XPath, there was IDL, CORBA and related object marshalling and unmarshalling technologies. Like XML, these offered platform independent ways to distribute data. Fnorb is a Object Request Broker platform for Python.
Zope is yet another Web application framework. The web site humbly describes Zope as:
Zope is a unique software system: a high-performance application server, a web server, and a content management system. Straight out of the box, it is a complete and self-contained solution, that includes a robust, scalable object database, web services architecture, and powerful programming capabilities.
Apparently Zope is closely integerated with the Python language, originally developed by Guido van Rossum. On a side note: although object databases been around for some time, they do not seem to have made much of a dent in the relational database model.
Zope is refered to as a "content management framework". Apparently Plone builds on top of Zope to provide a "content management system". The main goals of CMS are to allow easy creation, publishing and retrieval of content to fit a business needs. (from the plone.org web pages). I guess if you already know what content management frameworks and contement management systems are, this is probably all crystal clear. Obviously bearcave.com could use something to help beat it into a more organized whole.
The Visualization Toolkit: an open source toolkit for graphics visualization
this is a 2D and 3D graphics visualizatin toolkit, which seems to run on all major platforms. Unlike many open source projects, the Visualization Toolkit is extensively documented in a user guide and a text book.
Steganography is the art and science of hiding information within other information. For example, hiding a message within an image. The Stegdetect program claims to find such messages. I think that the proper way to refer to this is that it claims to find some such messages.
One of the problems with steganography and internet images is that most images are encoded in .jpg (JPEG) format. In general JPEG compression uses lossy compression (information is lost when the image is compressed). The human eye usually does not notice the information that is lost, so the "lossyness" of the compression scheme does not effect the apparent image quality. However, if information is buried in the image "noise", JPEG compression may destroy the hidden message.
The problem with JPEG compression aside, one interesting way to hide information is to apply a wavelet noise filter to the image. This will remove noise from the image without affecting its quality. An encrypted message (which is similar to noise) can then be plugged into the "holes" left by the noise filter. If the message appears in a particular wavelet spectrum, it can be recovered by applying a wavelet transform. Assuming that such a technique worked (which is an open question), statistical tests might not work well in detecting such an image.
At the Bearcave we have a large library of books. Readerware supports book indexing and cataloging. I first found out about it at a bookstore in Barcelona (Hibernian Books) which used Readerware to track their inventory.
Misinterpretation, yet another excellent article by "Robert X. Cringely" on code obfuscation (as it applies to Java byte code, .NET code and native code).
The article mentions a small software company called PreEmptive Solutions, which makes obfuscator software for Java and .NET object. They have developed a technique that they call Program State Code Protection which could be applied to native code as well.
Arxan is another company involved in this area.
On February 17, 2004 I attended an unclassified talk at LLNL given by John Grosh, who is the associate Director for Advanced Computing, Information Systems Directorate at the Department of Defense. He was talking about the various DoD computing initiatives. One of these involved issues of code obfuscation to protect codes with national security importance. The thinking is that the US can no longer control the export of high performance computers, but it may be able to control the export of important software and the associated algorithms.
While DoD may be concerned with byte code based codes, my impression was that the primary area of concern was native (microprocessor) code. While there are several companies that sell "obfuscation" products for Java byte code, I have not heard of a company that does this for microprocessor instruction sets.
There are a number of interesting questions here. If you have highly optimized code for a modern processor, how well can you do disassembling it into the source language? I am not sure of the answer to this question. But at MasPar Computer Corp. we worked on a debugger for optimized code. It was a difficult problem and in many cases the debuger got confused. My impression is that the general feeling is that disassembly of machine code does not do very well.
Another problem, pointed out by one of my colleagues, involves embedded system code. Modern weapons platforms (e.g., planes, tanks, helicopters) have increasing amounts of embedded software. These platforms are captured every-once-in-a-while (like the signal intelligence plane captured three years ago by China). If the computer systems are not destroyed, the possessor of the platform can understand the system by examining the software. So DoD is probably concerned on a number of levels about protecting code.
GnuPG: the GNU Privacy Guard
Network Associates bought Phil Zimmerman's PGP and commercialized it. Apparently it never made money, perhaps because a free version existed in parallel. In early March Network Associates announced that they had been unable to find a buyer for their PGP division. They subsequently fired 18 members of the staff and announced that there would be no new development. GnuPG offers a complete replacement:
GnuPG is a complete and free replacement for PGP. Because it does not use the patented IDEA algorithm, it can be used without any restrictions.
GnuPG is, of course, open source.
CDex: CD-ROM to .wav, MPEG, etc
A great source of 1-D data for signal processing via wavelets or Fourier transforms is music stored on CDs. The problem is getting the music in digital form off the CD so it can be fed to signal processing code. CDex is an open source program read a track off a CD-ROM into a .wav file (which is more or less a pure bit-stream). It turns on Windows NT, which is the platform I use for developing a lot of my signal processing software. It apparently needs Adaptec's ASPI manager, what ever that is (I have not installed this software yet, but I do have an