Seth's To-Do List

I have more ideas for neat projects than I have time or energy to do them. Want to help?

Web Stuff

Personal Web Indexer

This has its own page now. It's done. It works. Yay!

Automatic Portal[link to here]

Why doesn't my browser make a portal page for me? It knows which URLs I keep visiting. I'd still want something organized decently, and I don't expect a program to manage that tricky task, but it could handle a "things not already on this page that should be" section.

Update: It turned out to be really easy to make a page links to the pages I visit most often, so I now have one that updates itself every night. The problem is that it's hard to use and contains some crap I've never heard of. To be useful, it needs to be filtered, organized, and presented well. I haven't done this yet.

Personalized What's New Page[link to here]

Using cookies, a web site could have a personalized What's New list that told you exactly which pages had changed since the last time you looked at them. With version control, it could even tell how much each page had changed and report that or report the changes themselves. Using inline images or a server hack to get realtime access statistics, the service could be provided entirely by a third party. A server doing this should give users an option to disable data collection for them.

A note for the uninitiated: as with all site-specific user data, it could be stored in cookies on the browser, or it could be stored on the server, using cookies only for a user id. The decision affects system complexity, server load, bandwidth use, and load balancing options.

Another way to do it, without requiring site support, is to have your browser do the whole thing using its history database and an RDF site map (or some spidering and HEAD requests, for sites that don't supply RDF maps). The RDF map is unlikely to contain enough information to tell you how much each page has changed, but it could tell you when they've changed. I've heard good things about Mozilla's RDF support. They'll probably have something like this.

Web Springies Site Use Analysis[link to here]

I'd like a map for site maintainers that shows how people actually use the site. I just want an xspringies view of a site in which each the strength of each spring depends on the traffic the link gets. Xspringies is an old unix program that animates in 2-d a bunch of masses connected to each other by springs. It's lots of fun. The edges strengths could be shown with bolder/brighter/thicker lines. It'd be an interesting and maybe even educational way to view your site. I took a quick look at a few professional log analysis tools, and the best they do is give you list of internal links with usage statistics, in a big text list or bar chart. Not very easy to absorb.

I'm not much of a graphics programmer, so I probably won't attempt this unless I can find a visualization package to which I can just pass the configuration data.


Text Stuff

Concert Agent[link to here]

I like when my favorite bands come to town, but I often don't find out in time. There are too many lists to check, and checking each one periodically is too much work. There are centralized event listings, but they rely on people to enter info manually, and that just doesn't happen reliably.

I need something that continuously spiders the web and populates a concert listing database with info posted on fan, artist, and venue sites.

This will involve determining that a page has relevant info, and inducing extraction patterns.

  1. Define initial feature set. Some possible feature patterns:
  2. Define field set.
  3. Implement something to resolve feature values for a given web page.
  4. Write or find a classifier and inducer. A simple algorithm is fine to start with. One with certainty factors will probably do better.
  5. Collect and label a training set. Train a classifier.
  6. For results that don't suck, use multiple classifiers that each use a subset of the features. (For example, one that only uses X-is-in-list L, and another that uses the rest.) Use co-learning to have the classifiers train each other using thousands of unlabeled web pages. This is how Flipdog works.
  7. Formatting styles are often consistent within a site, so it would be valuable to induce and retain patterns for specific sites. Flipdog does this.

Grammatical Dissociator[link to here]

Text dissociators analyze a bunch of text, and then produce text that's sort of similar, but all mixed up. All cases I know of use Markov chains of either letters or words. That means they notice the frequency of letter pairs or word pairs in the source text, then produce something where each word (or letter) is followed by a random word (or letter), with probabilities based on the observed frequencies. (Doing it by pairs is using a Markov chain of length one. You can use longer chains (triplets, etc) to get more similarity at the cost of exponentially increasing memory usage.)

Those dissociators are lots of fun, but the sentences they produce don't follow the same grammar as the source text. You get sentences of the form noun-verb-noun-verb-noun-verb-noun. There are random text generators that do produce more normal sentences, but these all have a grammar coded in by a programmer.

There has been research done into inducing grammars from text. I haven't read the publications, but it sounds like fun. If it works decently, it should be easy to go from that to a grammatical dissociator.

Content-Based Spam Detector[link to here]

I actually did most of the work on this way back in 1998, but never cleaned up it and turned it into a functioning package. I made a simple Bayesian classifier, but the data I collected could be used to train other kinds of classifiers. As always, the real magic is in selecting features.

There are address-based detectors out there, but this would be more fun and wouldn't require as much maintenance. Mine is mostly done. I really should finish it.

Netsam Detector[link to here]

Netsam is the crap that floats around on the net. That can include jokes, virus warnings, calls for political action on some bogus cause, etc.

While it'd be neat to somehow detect these things by the form they tend to take, I'd be happy if I had something that just determined whether something was similar enough to anything in a set of known netritus, like CIAC lists of hoaxes and chain letters.

Quote Selector[link to here]

Given some context and a set of quotes, select a quote that might be somehow appropriate. This is for frivolous things like email signatures, so my accuracy requirements are low.


Visual

Text Color Fixer[link to here]

My designer friends tell me that site authors should have complete control over the web browsing experience because they're better at it than users are. Some are, but some of them suck. Sometimes the choice of text color is so bad that the text is unreadable on the background. Sometimes the vlink color is the same as the link color. Sometimes the links aren't linky enough. These problems could be detected and fixed automatically by my browser (or by a proxy, since that's easier to hack).

Photo Cropper[link to here]

I'd have to try it to be sure, but I think I could train a computer to crop photos to improve balance and composition for a majority of portrait and landscape shots. If it works, it could be implemented on a digital camera to have it suggest framing, though with a digital camera, there may not be much advantage to doing it at shoot time.


Misc

GIMP Shuffler[link to here]

I want to be able to select an arbitrary region and randomly scramble the positions of the contained pixels. So if I apply this to a square that's blue on one side and red on the other, I should get a square containing a random but equal distribution of red and blue pixels. Being able to sort would be nice too.

I wanted to use this to set the color balance of a textured region. "Oh, this much red, this much blue, this much green. Now blend it all." So I guess it'd be neat to also be able to scramble the three hues (RGB) independently, though with enough resolution that's not important.