Support Bloggers' Rights!
Support Bloggers' Rights!







How to Videos & Articles: eHow.com

Wednesday, July 27, 2005

Advanced Query Operators Great Tool for Search


I'm constantly telling freinds, family and clients about advanced search query operators. My search engine of chocie for personal use is Google, so that is what I tell most people about, but those advanced operators are available from all search engines and can be found from the link on their home pages labeled "Advanced Search" in most cases. That "Advanced" concept appears to scare off the casual users, so I tell them not to be frightened by the word "Advanced" and look anyway. MSN recently announced several new query operator tools on their blogThose tools are invaluable if you REALLY want to find something in your searches that is too cluttered with SE spam or lost in vagueness. So to further the value and use of Advanced Query Operators, I'm reproducing below an article that came to my attention today by Cari Haus.

Cache in the Bank: Understanding Google's Advanced Operators

Copyright 2005 Log Cabin Rustics

If you would like to know when your site was last indexed by Google, you can find out easily by using the Google cache command. By typing ?cache:www.logcabinrustics.com? into the Google search engine, I learned that my site was last indexed yesterday. The Google cache also displays the web page at the time of indexing, so you can see the latest version of your page that was indexed by Google.

As some webmasters have learned, the Google cache feature can be particularly handy when a valuable website and its backup have been lost due to computer failures. It may be time-consuming, especially if you have hundreds of pages, but you can actually retrieve the ?lost? pages from your site in the form that Google last indexed them. If this doesn?t work, you might also try the Wayback Machine at archive.org.

Forensic experts have also used the Google cache feature to their advantage?to retrieve incriminating evidence from the web. This should be an important reminder to all webmasters not to publish sensitive material online. A later decision not to publish some tantalizing tidbit, and the frantic page-pulling that ensues, may not be enough to erase those ill-said words from the Net.

Webmasters are supposed to be able to block Google from caching their site by using the ?no cache? tag. However, many don?t even try this for fear of losing favor in the company's powerful search rankings. Although Google says the ?no cache? tags don?t affect web rankings, some webmasters aren?t so sure.

Other Helpful Google Operators

Other helpful search engine operators of particular value to webmasters include:

LINK: The LINK operator, when used in conjunction with your domain name, is supposed to tell you how many links are pointing to your site. The syntax for this command is ?link:http://www.thevegetarianexpress.com/. By way of caution, this only shows how many links indexed by Google that are linking to you. A more inclusive option is found at the Marketleap website, where the Link Popularity Tool reports how many links are pointing to your site from other well-traveled search engines as well.

INURL: Google?s INURL operator will restrict your search to one site only. For example, typing ?inurl:www.logcabinrustics.com log beds? will bring up the log beds only on the Log Cabin Rustics furniture website. This is a particularly helpful option if you are looking for a specific phrase on one site.

INTITLE: The INTITLE operator is helpful if you are looking for sites with a particular keyword in their title tag. Use this phrase at Google by typing in ?intitle:furniture? or whatever other search term you are looking for.

Variations of the above themes include the ALLINURL and ALLINTITLE search operators. These are particularly useful when you are looking for a string of keywords in either a title or site. For example, if you start a query with allinurl:, Google will restrict the results to those with all of the query words in the url. For instance, [allinurl:logcabinrustics.com bunk beds] will return only documents that have both "bunk" and "beds" in the url.

Google operators can be especially helpful in analyzing the web pages of key competitors. To learn more, visit http://www.google.com/help/operators.html .

About the Author:

Cari Haus is webmaster for http://www.logcabinrustics.com, an online retailer of quality log furniture.

Tuesday, July 19, 2005

More Google Privacy Concerns - Now national news


How much does Google know about you? I ignored this story when I first saw it at Wired News. Apparently the Associated press picked up the story and now we see it at CNN. So addressing privacy concerns at Google is clearly a large public issue since we have seen so many privacy breaches and data losses by large public companies. The story focuses first on hackers and Google employees with evil intent and points to the oft-heard Google mantra "Don't be evil" which has created a lot of goodwill for the company.

There is little that could stop serious hackers and Google insiders with a grudge against the company from seriously harming the image of Google by gathering and selling much valuable information held by the company.

It is not impossible to see a scenario where bad guys inside the company cooperate with external hackers or do some hacking themselves from inside - compromising the endless stores of data Google has on Gmail users, Google toolbar users, Google personalized search users, Google Desktop Search users, Adwords and Adsense users, Froogle shopping search users, Google Sitemaps users, Blogger users, Orkut users, etc. through the many Google company units.

As a user of all of those services, I shiver a bit when thinking of all that information gathered on my habits and client activities through each of those services all aggregated into one database. When you add to it, the payment services that Google is currently developing, it presents a very large target for bad guys.

Will we see a large scale Google security breach in the future. G - I hope not. What are you doing to prevent it Google?

Sunday, July 17, 2005

60 Day Sandbox for Google, AskJeeves. MSN Quickest, Yahoo Next


60 Day Sandbox for Google, AskJeeves. MSN Quickest, Yahoo Next

© Copyright July 18, 2005 Mike Banks Valentine

Search engine listing delays have been called the Google Sandbox effect and are actually in effect at each of four top tier search engines in one form or another. MSN, it seems has the shortest indexing delay at 30 days. This article is the second in a series following the spiders through a brand new web site beginning on May 11, 2005 when the site was first made live on that day under a newly purchased domain name. http://publish101.com/Sandbox2

Previously we looked at the first 35 days and detailed the crawling behavior of Googlebot, Teoma, MSNbot and Slurp as they traversed the pages of this new site. We discovered the each robot spider displays distinctly different behavior in crawling frequency and similarly differing indexing patterns.

For reference, there are about 15 to 20 new pages added to the site daily, which are each linked from the home page for a day. Site structure is non-traditional with no categories and a linking structure tied to author pages listing their articles as well as a "related articles" index varied by linking to relevant pages containing similar content.

So let's review where we are with each spider crawling and look at pages crawled and compare pages indexed by engine.

The AskJeeves spider, Teoma has crawled most of the pages on the site, yet indexes no pages 60 days later at this writing. This is clearly a site aging delay that's modeled on Google's Sandbox behavior. Although the Teoma spider from Ask.com has crawled more pages on this site than any other engine over a 60 day period and appears to be tired of crawling as they've not returned since July 13 - their first break in 60 days.

In the first two days, Googlebot gobbled up 250 pages and didn't return until 60 days later, but has not indexed even a single page in 60 days since they made that initial crawl. But Googlebot is showing a renewed interest in crawling the site since this crawling case study article was published on several high traffic sites. Now Googlebot is looking at a few pages each day. So far no more than about 20 pages at a decidedly lackluster pace, a true "Crawl" that will keep it occupied for years if continued that slowly.

MSNbot crawled timidly for the first 45 days, looking over 30 to 50 pages daily, but not until they found a robots.txt file, which we'd neglected to post to the site for a week and then bobbled the ball as we changed site structure, then failed to implement robots.txt in new subdomains until day 25 - and THEN MSNbot didn't return until day 30. If little else were discovered about initial crawls and indexing, we have seen that MSNbot relies heavily on that robots.txt file and proper implementation of that file will speed crawling.

MSNbot is now crawling with enthusiasm at anywhere between 200 to 800 pages daily. As a matter of fact, we had to use a "crawl-delay" command in the robots.txt file after MSNbot began hitting 6 pages per second last week. The MSN index now shows 4905 pages 60 days into this experiment. Cached pages change weekly. MSNbot has apparently found that it likes how we changed the page structure to include a new feature which links to questions from several other article pages.

Slurp gets strangely inactive then alternately hyperactive for periods of time. The Yahoo crawler will look at 40 pages one day and then 4000 the next, then simply look at the home page for a few days and then jump back in for 3000 pages the next day and back to only reviewing robots.txt for two days. Consistency is not a curse suffered by Slurp. Yahoo now shows 6 pages in their index, one an errors page and another is a "index/of" page as we have not posted a home page to several subdomains. But Slurp has crawled easily 15,000 pages to date.

Lessons learned in the first 60 days on a new site follow:

  1. Google crawls 250 pages on first discovery of links to site. Then they don't return until they find more links and crawl slowly. Google has failed to index new domain for 60 days.

  2. Yahoo looks for errors pages and once they find bad links will crawl them ceaselessly until you tell them to stop it. Then won't crawl at all for weeks until crawling heavily one day and lightly the next in random fashion.

  3. MSNbot requires robots.txt files and once they decide they like your site, may crawl too fast, requiring "crawl-delay" instructions in that robots.txt file. Implement immediately.

  4. Bad bots can strain resources and hit too many pages too quickly until you tell them to stay out. We banned 3 bots outright after they slammed our servers for a day or two. Noted "aipbot" crawled first then "BecomeBot" came along and then "Pbot" from Picsearch.com crawled heavily looking for image files we don't have. Bad bots, stay out. Best to implement robots.txt exclusions for all but top engines if their crawlers strain your server resources. We considered excluding the Chinese search engine named Baidu.com when they began crawling heavily early on. We don't expect much traffic from China, but why exclude one billion people? Especially since Google is rumored to be considering a possible purchase of Baidu.com as entry to Chinese market.

The bottom line is that we've discovered all engines seem to delay indexing of new domain names for at least thirty days. Google so far has delayed indexing THIS new domain for 60 days since first crawling it. AskJeeves has crawled thousands of pages, while indexing none of them. MSN indexes faster than all engines but requires robots.txt file. Yahoo's Slurp crawls on again off again for 60 days, but indexes only six of total 15,000 or more pages crawled to date.

We seem to have settled that there is a clear indexing delay, but whether this site specifically is "Sandboxed" and whether delays apply universally is less clear. Many webmasters claim that they have been indexed fully within 30 days of first posting a new domain. We'd love to see others track spiders through new sites following launch to document their results publicly so that indexing and crawling behavior are proven.

Mike Banks Valentine is a search engine optimization specialist who operates WebSite101 Small business ecommerce tutorial and will continue reports of case study chronicling search indexing of Publish101

Friday, July 08, 2005

Playing in Googlebot's Sandbox with Slurp, Teoma, & MSNbot


Playing in Googlebot's Sandbox with Slurp, Teoma, & MSNbot
By Mike Valentine

There has been endless webmaster speculation and worry about the so-called "Google Sandbox" - the indexing time delay for new domain names - rumored to last for at least 45 days from the date of first "discovery" by Googlebot. This recognized listing delay came to be called the "Google Sandbox effect."

Ruminations on the algorithmic elements of this sandbox time delay have ranged widely since the indexing delay was first noticed in spring of 2004. Some believe it to be an issue of one single element of good search engine optimization such as linking campaigns. Link building has been the focus of most discussion, but others have focused on the possibility of size of a new site or internal linking structure or just specific time delays as most relevant algorithmic elements.

Rather than contribute to this speculation and further muddy the Sandbox, we'll be looking at a case study of a site on a new domain name, established May 11, 2005 and the specific site structure, submissions activity, external and internal linking. We'll see how this plays out in search engine spider activity vs. indexing dates at the top four search engines.

Ready? We'll give dates and crawler action in daily lists and see how this all plays out on this single new site over time.

* May 11, 2005 Basic text on large site posted on newly purchased domain name and going live by days end. Search friendly structure implemented with text linking making full discovery of all content possible by robots. Home page updated with 10 new text content pages added daily. Submitted site at Google's "Add URL" submission page.

* May 12 - 14 - No visits by Slurp, MSNbot, Teoma or Google. (Slurp is Yahoo's spider and Teoma is from Ask Jeeves) Posted link on WebSite101 to new domain at Publish101.com

* May 15 - Googlebot arrives and eagerly crawls 245 pages on new domain after looking for, but not finding the robots.txt file. Oooops! Gotta add that robots.txt file!

* May 16 - Googlebot returns for 5 more pages and stops. Slurp greedily gobbles 1480 pages and 1892 bad links! Those bad links were caused by our email masking meant to keep out bad bots. How ironic slurp likes these.

* May 17 - Slurp finds 1409 more masking links & only 209 new content pages. MSNbot visits for the first time and asks for robots.txt 75 times during the day, but leaves when it finds that file missing! Finally get around to add robots.txt by days end & stop slurp crawling email masking links and let MSNbot know it's safe to come in!

* May 23 - Teoma spider shows up for the first time and crawls 93 pages. Site gets slammed by BecomeBot, a spider that hits a page every 5 to 7 seconds and strains our resources with 2409 rapid fire requests for pages. Added BecomeBot to robots.txt exclusion list to keep 'em out.

* May 24 - MSNbot has stopped showing up for a week since finding the robots.txt file missing. Slurp is showing up every few hours looking at robots.txt and leaving again without crawling anything now that it is excluded from the email masking links. BecomeBot appears to be honoring the robots.txt exclusion but asks for that file 109 times during the day. Teoma crawls 139 more pages.

* May 25 - We realize that we need to re-allocate server resources and database design and this requires changes to URL's, which means all previously crawled pages are now bad links! Implement subdomains and wonder what now? Slurp shows up and finds thousands of new email masking links as the robots.txt was not moved to new directory structures. Spiders are getting errors pages upon new visits. Scampering to put out fires after wide-ranging changes to site, we miss this for a week. Spider action is spotty for 10 days until we fix robots.txt

* June 4 - Teoma returns and crawls 590 pages! No others.

* June 5 - Teoma returns and crawls 1902 pages! No others.

* June 6 - Teoma returns and crawls 290 pages. No others.

* June 7 - Teoma returns and crawls 471 pages. No others.

* June 8-14 Odd spider behavior, looking at robots.txt only.

* June 15 - Slurp gets thirsty, gulps 1396 pages! No others.

* June 16 - Slurp still thirsty, gulps 1379 pages! No others.

So we'll take a break here at the 5 weeks point and take note of the very different behavior of the top crawlers. Googlebot
visits once and looks at a substantial number of pages but doesn't return for over a month. Slurp finds bad links and seems addicted to them as it stops crawling good pages until it is told to lay off the bad liquor, er that is links by getting robots.txt to slap slurp to its senses. MSNbot visits looking for that robots.txt and won't crawl any pages until told what NOT to do by the robots.txt file. Teoma just crawls like crazy, takes breaks, then comes back for more.

This behavior may imitate the differing personalities of the software engineers who designed them. Teoma is tenacious and hard working. MSNbot is timid and needs instruction and some reassurance it is doing the right thing, picks up pages slowly and carefully. Slurp has addictive personality and performs erratically on a random schedule. Googlebot takes a good long look and leaves. Who knows whether it will be back and when.

Now let's look at indexing by each engine. As of this writing on July 7, each engine also shows differing indexing behavior as well. Google shows no pages indexed although it crawled 250 pages nearly two months ago. Yahoo has three pages indexed in a clear aging routine that doesn't list any of the nearly 8,000 pages it has crawled to date (not all itemized above.) MSN has 187 pages indexed while crawling fewer pages than any of the others. Ask Jeeves has crawled more pages to date than any search engine, yet has not indexed a single page.

Each of the engines will show the number of pages indexed if you use the query operator "site:publish101.com" without the quotes. MSN 187 pages, Ask none, Yahoo 3 pages, Google none.

The daily activity not listed in the three weeks since June 16 above has not varied dramatically, with Teoma crawling a bit more than other engines, Slurp erratically up and down and MSN slowly gathering 30 to 50 pages daily. Google is absent.

Linking campaign has been minimal with posts to discussion lists, a couple of articles and some blog activity. Looking back over this time it is apparent that a listing delay is actually quite sensible from the view of the search engines. Our site restructuring and bobbled robots.txt implementation seems to have abruptly stalled crawling but the indexing behavior of each engine displays distinctly differing policy by each major player.

The sandbox is apparently not just Google's playground, but it is certainly tiresome after nearly two months. I think I'd like to leave for home, have some lunch and take a nap now.

Back to class before we leave for the day kiddies. What did we learn today? Watch early crawler activity and be certain to implement robots.txt early and adjust often for bad bots. Oh yes, and the sandbox belongs to all search engines.

Mike Banks Valentine is a search engine optimization specialist who operates http://WebSite101.com and will continue reports of case study chronicling search indexing of http://Publish101.com