Saturday, November 20, 2010

Reading Notes for 11/22/10

1.) I really liked the David Hawking articles about the basics of the web crawlers, how they function, the problems they have, and why they are such great tools.  How they must leave out "low-value automated content" since that is not necessary and makes the search harder.  How they must be able to search in multiple different languages, misspellings,  and also made up words (ie: google, yahoo).  How they must work with such a vast amount of information, how they must avoid tricky spamming sites, and all within a few milliseconds.  Its quite amazing when you think about it like that!


2.) The Shreeves, S. L et al. article had an example of its customers that I thought was intriguing.  The Sheet Music Consortium is trying to digitize sheet music and they initially had some trouble with  figuring out how to  digitize the various components of their information such as the "cover art, the sheet music itself, the lyrics, etc."  and also I assume all the other small additions written onto the music such as playing in piano, forte, or staccato.  In class, sometimes we discuss the concept of using other languages  for digitizing data, but music was not something I had though of before.  Essentially music is like another language.  


3.) The Bergman article was great because it explained the true depth of the deep web and the difficulty for crawlers to find this information.  Even though the crawlers do a great job with all that they are responsible for, there is still a massive amount of information which is left out of the equation.  The article discusses how "Internet searchers are therefore searching only 0.03% — or one in 3,000 — of the pages available to them today."   That is such a tiny fraction of the information that we could be accessing!  I find that number almost unbelievable. The article also discusses, that crawlers aren't even looking in fire-walled or "Intranet" sites within institutions as they can't.  So there's more information there that we cannot access..   

No comments:

Post a Comment