Chinese Kid IT World: 2008

Thursday, October 30, 2008

What is broadband?

What is broadband?
The term broadband commonly refers to high-speed Internet access. The FCC defines broadband service as data transmission speeds exceeding 200 kilobits per second (Kbps), or 200,000 bits per second, in at least one direction: downstream (from the Internet to the user’s computer) or upstream (from the user’s computer to the Internet).

HOW IS BROADBAND DIFFERENT FROM DIAL-UP SERVICE?
-Broadband service provides higher speed of data transmission—Allows more content to be carried through the transmission “pipeline.”Broadband provides access to the highest quality Internet services—streaming media, VoIP (Internet phone), gaming, and interactive services. Many of these current and newly developing services require the transfer of large amounts of data which may not be technically feasible with dial-up service. Therefore, -broadband service may be increasingly necessary to access the full range of services and opportunities that the Internet can offer.
-Broadband is always on—Does not block phone lines and no need to reconnect to network after logging off.
-Less delay in transmission of content when using broadband.

What are the benefits of broadband?
It's fast ... generally 10-20 times faster than your existing dial-up modem. A typical dial-up modem operates at either 28.8 kbit/s or 56 kbit/s. A broadband connection operates at between 256 kbit/s and 10 Mbit/s, depending on the service you have selected.
To give you an idea of the difference that this speed can make, a 3.5 minute MP3 music file takes about 18 minutes to download using a 28.8 kbit/s dial-up modem but only about 21 seconds on a 1.5 Mbit/s broadband link. An e-mail containing a family photo takes about 55 seconds at 28.8 kbit/s but only about three seconds on a 512 kbit/s link.Broadband's high speed gives you access to applications that are either not feasible at the speed of a dial-up connection or just annoyingly slow.Broadband can allow you to transfer large files of text or graphics at high speeds give you instant access to webpages, even those with large amounts of graphics that are typically very slow to download on a dial-up connection; allow employees to telecommute, operating from their home or elsewhere with the same response speeds and level of security as if they were in their office; link several computers to the Internet through the same connection; make videoconferencing faster, smoother and more practical; save money by allowing a business to rationalise and centralise its servers.
It's always on. As long as your computer is switched on, you can be connected to the Internet. This means that you do not waste time dialling up and waiting for your modem to connect you to the Internet every time you go online. You will not be subject to annoying busy signals and your connection is unlikely to drop out. Your phone line is not tied up while using the Internet. Therefore there is no need to pay for a second phone line. There are no additional dial-up charges to connect each time you use the service.

Thursday, October 23, 2008

basic of computer or PC

Definition of a Computer -A machine that given instructions and can manipulate data by itself.

Basic Computer Needs
-Need to perform calculations faster and more accurately.
-Need to control processes consistently.
-Need to handle larger and larger amounts of data

Computer System-A computer system contains two parts:
-Hardware: actual pieces of equipment-keyboards, screens, components inside the boxes, printers, modems, etc.
-Software: instructions that direct-operating systems, compilers, applications

Hardware
-Access to data
-Read: fetching information from somewhere
-Write: putting information somewhere
-Common hardware components
-Central Processing Unit(CPU)
-Main Memory
-Secondary Storage
-Input & Output(I/O)

software operating system:
window:
(window history :
1985: Windows 1.0
1987: Windows 2.0
1990: Windows 3.0
1993: Windows NT 3.1
1993: Windows for Workgroups 3.11
1994: Windows NT Workstation 3.5
1995: Windows 95
1996: Windows NT Workstation 4.0
1998: Windows 98
1999: Windows 98 Second Edition
2000: Windows Millennium Edition (Windows Me)
2000: Windows 2000 Professional
2001: Windows XP
2001: Windows XP Professional
2001: Windows XP Home Edition
2001: Windows XP 64-bit Edition
2002: Windows XP Media Center Edition
2002: Windows XP Tablet PC Edition
2007: Window vista basic, home premium, business, ultimate)
LINUX
UNIX
MAC

CPU
-CPU is the heart of any computer-
Three functions
1. control: CPU takes on instruction at a time and follows it to direct the rest of the system.
2. arithmetic operations: addition, subtraction, multiplication, division.
3. logical operation: comparisons

Main Memory
-Main memory is a temporary, working storage area.
-Stores two types of things:
1. current set of instructions the CPU is following.
2. data these instructions manipulate.
-Made up locations with unique addresses.
-Addresses allow random access of data.
-Very fast, but expensive and volatile.

Secondary Storage
-Permanent storage device-Disk, hard and floppy, are common examples
-Relatively cheap, but slow(can be hundreds of times slower than Main Memory)
-Accessed in blocks(many characters at a time)

I/O Devices
-Provide a way for humans to communicate with computers
-Common input devices:
1. Keyboard
2. Mouse
3. Scanner
-Common output devices:
1. Monitor
2. Printer
3. Speaker

Quick Salution to remove Browser Hijacker from IP address: 85.12.43.84

Quick Salution to how to remove or solve Browser Hijacker from IP address: 85.12.43.84

***Finished reading this page about how to solve browser hijacker problem or Here if you want a Quick Salution to remove Browser Hijacker from IP address: 85.12.43.84 you can just go to this website to this website and download and antivirus that Why not go try out for free for a month - you can remote control your computer from anywhere. Cyberwalker uses it and thinks it's fantastic!

What is this IP adress 85.12.43.84 ? it is an "attacker" websites, and when the web page loads, it loads anywhere from this IP address: 85.12.43.84

If ur are having this problem i recomend you all to use the below tool to solve your website browser website.eset-nod32-antivirus-smart-security/Here you can also find the serial key for above..

How to solve Browser Hijacker Problem

Here's a series of steps by step on how you can take to use Hijack This to remove a browser hijack.
(BTW, thanks to my good friend RT for teaching me this, providing the notes this was based on and allowing me to pass this on to you.)
BEFORE YOU START - Download and install Hijack This from http://www.downloads.com/

-STEP 1- SAFETY STUFFBackup your documents and create a system restore point.

-STEP 2- CHECK FOR SUSPICIOUS STARTUP ITEMS

You can use Hijack This to clean out hijacked items from Microsoft's Internet Explorer (redirections due to spyware), however they will return if the executable program causing it is not removed.a.

Click on Start> Run and type "msconfig" and click OK.b. Select the "Startup" tab. c. Uncheck any items you don't recognize.

Note that many legitimate programs will appear here too.Most spyware will load from this area.

If unsure if a particular item is legitimate or not, do a Google search on the .exe file name that loads. The only caveat here is that some spyware .exe files get a randomly generated name, so a search will not identify them.

You can look in the Command column to see the name of the .exe file itself and you can stretch this column if you cannot see the entire line of text.

By the way, it IS safe to uncheck everything here as a test anyway - nothing critical to Windows loads here. So, if in doubt, it is OK to uncheck something.

d. Apply the changes, and restart Windows.

-STEP 3 - Run Hijack This1.

Run the tool, and select "Scan".

2. Look mostly at the R0, R1 and 02 entries. This relates to the hijack, and represent changes to your default browser settings (homepage, search page).

3. Have a look at the addresses for these entries. If they are different from your preferences, check the box next to it.4. Click on "Fix Checked" and confirm.

This process cleans out the modified (hijacked) entries. You can also define what Hijack This uses by clicking the Config button (lower right), however this is not required.

-STEP 4 - DOUBLE-CHECK HOME PAGE AND TEST

One problem is that if the IE Home Page isn't cleared, you'll get "rehijacked" when you launch IE. This is because that particular page is the source of the problem. (It may try to load an ActiveX control.)

Hijack This may have already reset your Home Page in

STEP 3, but double check before starting IE:

a. Head to Control Panel, Internet Options.
b. Change your Home Page on the General tab.
c. Browse the Internet, reboot your machine, and test over the next little while.
If the hijack stays away, you've successfully cleared it, and one of the Startup items you disabled in STEP 2 might still be the cause.

-STEP 5- PERMANENETLY DELETE THE CAUSE
We need to find the Startup item that is causing this, if any. Recall that in STEP 2 we disabled some suspicious startup items. One, or several of them may be triggering the hijack.Also note that we've been testing the machine with the Startup Items disabled. We want to ensure the computer runs fine (no errors) with all these items unchecked.
If you are unsure about deleting an item or using the registry editor, seek help with your local tech expert.a. Launch MSCONFIG once more.b. For the first suspicious item, expand the "Location" column to see where it is loading from in the registry.c.

Click on Start, Run, type "regedit" and click OK.d. Browse to the key listed in the "Location" column for MSCONFIG.e. Delete the key on the right hand side only, that specifically matches that startup item. **See example below.** f. Note the "Command" folder in MSCONFIG. Browse to this folder, and delete the .exe file itself.

What is browser hijacker ?

How To Resolve / Remove this IP address: 85.12.43.84
Problem?
Solution?

First of all This A browser hijacker (sometimes called hijackware) is a type of malware program that alters your computer's browser settings so that you are redirected to Web sites that you had no intention of visiting. Most browser hijackers alter default home pages and search pages to those of their customers, who pay for that service because of the traffic it generates.

More virulent versions often: add bookmarks for pornographic Web sites to the users' own bookmark collection; generate pornographic pop-up windows faster than the user can click them shut; and redirect users to pornographic sites when they inadvertently mistype a URL or enter a URL without the www. preface. Poorly coded browser hijackers -- which, unsurprisingly, are common -- may also slow your computer down and cause browser crashes.

Browser hijackers and the pornographic material they often leave in their wake can also be responsible for a variety of non-technical problems. Employees have lost jobs because of content and links found on their computers at work; people have been charged with possession of illegal material; and personal relationships have been severed or strained. In one case in the United States, a Russian immigrant was convicted for possession of child pornography, although he claims to have been the victim of a browser hijacker.

Like adware and spyware, a browser hijacker may be installed as part of freeware installation. In this case, the browser hijacker is probably mentioned in the user agreement -- although, obviously, not identified as a browser hijacker. The problem is that users typically either ignore the fine print or only give it a cursory reading. A browser hijacker may also be installed without user permission, as the result of an infected e-mail, a file share, or a drive-by download.

To avoid contamination, experts advise users to read user agreements carefully, and to be cautious about freeware downloads and e-mail messages from unknown sources.If You ever Been infects by a Broser Hijaker such as from this IP address: 85.12.43.84You can Get the Solution Here

Saturday, September 20, 2008

Algorithm and Data structure: Hashing

Introduction

•Dictionaries are data structures that support search,
insert, and delete operations.
•One of the most important criteria in dictionary is to
provide fast searching of information.
•One the effective representations is a hash table.
•The implementation of hash table is frequently called
hashing.

Hash Table

•Hash Tables represent the storage space for the entire search. The primary concern in the Hash Table is to provide appropriate indexing and to allow many of the different possible keys to be mapped onto the same location.
• There are two common methods of Hash Tables:1.Array Hash Table2.Linked Hash Table

•An Array Hash Table is simply an array that stores all the elements.
•Since array is indexed by default, we do not have the problem of deciding the method of indexing the Hash Table.
•The problems that we face with Array Hash Table are collision and overflow.
•A Linked Hash Table uses an array for indexing.
•Each element of the array is a pointer that points to a linked list that stores the data.
•The elements are not stored in the array itself but rather in a separate list.
•This method is also known as chaining.
•Chaining provides a simple method to overcome collision.
•However, if the list of element is too long, it may slow down the searching process.

Hash Function Techniques
•If the hash function is uniform, or equally distributes the data keys among the hash table indices, then hashing effectively subdivides the list to be searched.
• Worst-case behavior occurs when all keys hash to the same index. Then we simply have a single linked list that must be sequentially searched.
•Consequently, it is important to choose a good hash function.
•Several methods may be used to hash key values.

Advantages of Linked Hash Table (Chaining)
•Space Saving - Linked list as dynamic memory storage save memory space because node can be added and remove from the list.
•Collision resolution. - In linked list, collision does not occur as in array. A collision occurs when 2 item need to be stored in the same location. This problem appears in array hash table and easily overcome by Chaining.
•Overflow - Linked list avoid record overflow as in array implementation. It is extremely useful when dealing with ambiguous number of record.
•Deletion - Deletion can be performed quick and easy task as performing deletion from simple linked list.

•Use of space - All the links require space. If the record s are large, then this space is negligible in comparison with that needed for the records themselves; but if the records are small, then it is not.
•Small records - Suppose, for example, that the links take one word each and that the items themselves take one word (which is the key alone). Such applications are quite common, where we use the hash table only to answer some yes-no question about the key. Suppose that we use chaining and make the hash table itself quite small, with the same number of n of entries as the number of items. Then we shall use 3n words of storage altogether: n for the hash table, n for the keys, and n for the links to find the next node (if any) on each chain.
•Search speed - Search time for chaining is typically quit slow than array implementation.

Collision and The Resolution with Open Addressing
•The idea of a hash table is to allow many of the different possible keys that might occur to be mapped to the same location under the action of the hash function.
•In array implementation, it is commonly that 2 records might be mapped into the same location.
•If this is case, we called it as Collision.
•A simple example is like this. If a hash function which hash a name base on the first alphabet, then John and James will be hashed into the same location.
•The simplest Method to resolve a collision is to start with the hash address (the location where the collision occurred) and do a sequential search for the desire key or an empty location. Hence this method searches in a straight line and is there for called linear probing. The array should be consider circular, so that when the last location is reached, the search proceeds to the first location of the array.
•The major drawback of linear probing is that, as the table becomes about half full, there is a tendency toward clustering; that is, records starts to appear in long string of adjacent position with gaps between the strings. Thus the sequential searches needed to find an empty position become longer and longer.
•The problem of clustering is essentially one of instability; if a few keys happen randomly to be near each other, then it become more and more likely that other keys will join them, and the distribution will become progressively more unbalanced.

•Rehashing
–If we are to avoid the problem of clustering, then we must use some more sophisticated way to select the sequence of location to check when a collision occurs. There are many ways to do so. One, called rehashing, use a second hash function to obtain the second position to consider. If this position is filled, then some other method is needed to get the third position, and so on. But if we have a fairly good spread from the first hash function, then little is to be gained by an independent second hash function.

•Quadratic Probing
–Another method is to use Quadratic Probing. If there is a collision at hash address h, this method probes the table at location h + 1, h + 4, h + 9,…., that is at location h + i^2 (% HASHSIZE), for I = 1, 2, 3, …. That is the increment function is i^2. This method substantially reduces Clustering, but it is not obvious that it will probe all locations in the table, and in fact it does not.

•Key-Independent Increments
–Rather than having the increment depend on the number of probes already made, we can let it be some simple function of the key itself. For example, we could truncate the key to a single character and use its code as the increment. In C, we might write
–increment = *key
–A good approach, when the remainder after division is taken as the hash function is to let the increment depend on the quotient of the same division. An optimizing compiler should specify the division only once, so the calculation will be fast, and the result generally satisfactory.
–In this method, the increment, once determined, remains constant. If HASHSIZE is a prime, it follows that the probes will step through all the entries of the array before any repetition. Hence overflow will not be indicated until the array is completely full.

•Random Probing
–A final method is to use a pseudorandom number generator to obtain the increment.
–The generator used should be one that always generates the same sequence provided it starts with the same seed.
–The seed, then, can be specified as some function of the key.
–This method is excellent in avoiding clustering, but is likely to be slower than the others.

Algorithm and Data structure: Sorting and Searching

Introduction
•Sorting is a process of arranging a set of similar elements into an increasing or decreasing order.
•For example, we might want to arrange a list of student names into alphabetical order or a list of student marks into descending (highest to lowest) order. We usually stored all the data in an array.
•Specifically, given a sorted list i of n elements, then
i1 <= ... <= in 2 types of sorting : •Internal sorting –algorithms that sort arrays –the amount of data to be sorted is sufficiently small so that the entire process can be carried out in the computer's main memory •External sorting –algorithms that sort sequential disk or magnetic tape files –there are too many data to permit internal sorting. The data is stored in a secondary storage devices •Usually when information is sorted, a portion of the information is used as the sort key. •The key is that part of the data that determines which item comes before another. •Thus, the key is used in comparisons, but when an exchange is made, the entire data structure is swapped. •For example, in a mailing list the ZIP code field might be used as the key, but the entire address is sorted. Classes of Sorting Algorithms •There are three general methods for sorting arrays: •Exchange •Selection •Insertion •To understand these three methods, imagine a deck of cards. –To sort the cards by using exchange, spread them on a table, face up, and then exchange out-of-order cards until the deck is ordered. –Using selection, spread the cards on the table, selects the card of lowest value, take it out of the deck, and hold it in your hand. Then from the remaining cards on the table, select the lowest card and place it behind the one already in your hand. This process continues until all the cards are in your hand. The cards in your hand will be sorted when you finish the process. –To sort the cards by using insertion, hold all the cards in your hand. Place one card at a time on the table, always inserting it in the correct position. The deck will be sorted when you have no cards in your hand. •Preliminaries –The algorithm we describe will all be interchangeable. Each will be passed an array containing the elements and an integer containing the number of elements. –We will assume that N, the number of elements passed to our sorting routines, has already been checked and is legal. In accordance with C conventions, the data will start at position 0 for all sorts. –We will also assume the existence of the "<" and ">" operators, which can be used to place a consistent ordering on the input. Besides, the assignment operator, these are the only operations allowed on the input data. Sorting under these conditions is known as comparison based sorting.

Insertion Sort
•Is the simplest sorting algorithm.
•Insertion sort consists of N – 1 passes. For pass P = 1 through N – 1, insertion sort ensures that the elements in positions 0 through P are in sorted order. Insertion sort makes use of the fact that elements in position 0 through P – 1 are already known to be in sorted order.

Bubble Sort
•The most well known sort.
•Its popularity is derived from its simplicity.
•It is named Bubble sort because, during sorting, values seem to rise to the top of the list like bubbles in a fish tank. Larger values seem to sink like stones.
•Bubble sort variations:
–The simplest is the single bubble sort.
–The more complex double-bubble sort.

•Single bubble sort
–Bubble sort is an exchange sort.
–The basic operation is the exchange of an adjacent pair of elements.
–So in the single bubble sort algorithm, the program will pass through the data, switching consecutive items which are out of order.
–After each pass through the list, the program checks to see if any switches were made.
–If there were, it passes through the list again, switching consecutive items which are still out of order.
–If no switches are made during an entire pass through the list, the data is sorted.
–In the preceding code, item is a pointer to the character array to be sorted and count is the number of elements in the array.
–The bubble sort is driven by two loops.
–Given that there are count elements in the array, the outer loop causes the array to be scanned count-1 times.
–This ensures that, in the worst case, every element is in the proper position when the function terminates.
–The inner loop actually performs the comparisons and exchanges.
–Here is an example : we want to sort 390, 205, 182, 45, 235
•Bubble sort for the first pass (exchange of an adjacent pairs)
390 205 182 45 235 (Switch 1)
205 390 182 45 235 (Switch 2)
205 182 390 45 235 (Switch 3)
205 182 45 390 235 (Switch 4)
205 182 45 235 390 ( First pass-sorted list)
•The first pass moves the largest element (390) to the nth position, forming a sorted list of length one
•The second pass only has to consider (n-1) elements and moves the second largest element to the (n-1) position.

•Double Bubble Sort
–Same as the bubble sort except instead of continuing down the list after switching two consecutive items, the double bubble sort goes up the list, comparing and switching consecutive items until either two items in correct order are found or the top of the list is reached.
–This means that only one pass through the data is required because each out-of-order item 'bubbles up' to its correct position before the next out-of-order item is encountered.
–The number of comparisons is still N-1 and, in the worst case scenario, the program will have to move up and down the list an equivalent number of times to the single bubble sort, but no unnecessary passes through the data are needed.

Shell Sort
•Works by comparing elements that are distant; the distance between comparison decreases as the algorithm runs until the last phase, in which adjacent elements are compared. For this reason, Shell sort is sometimes referred to as diminishing increment sort.
•Shell sort uses a sequence h1, h2, …,ht, called the increment sequence. Any increment sequence will do as long as h1 = 1. But some choices are better than others are. After a phase, using some increment hk, for every I, we have A[i] < m =" n2,"> 7, so we traverse to the right child.
•On the third comparison, we succeed.

•Each comparison results in reducing the number of items to inspect by one-half.
•Figure above shows another tree containing the same values.
•While it is a binary search tree, its behavior is more like that of a linked list, with search time increasing proportional to the number of elements stored.
•This problem can be overcome using AVL tree

Algorithm and Data structure: Trees and Heaps (Priority Queues)

Trees

•A tree is a non linear data structure which solve most of the problem in O(log N)
•Trees are very useful in computer science
•The structure of tree itself provides hierarchical grouping of its elements, thus helping programmer to structure problem in hierarchy manner

Binary Tree
•A binary tree is a special type of tree that every node can only have maximum two nodes.
•A binary tree consists of
–A node(call the root node)
–Left and right sub-trees
•Both sub-trees are themselves binary trees
•The nodes at the lowest level of tree (the one with no sub-trees) are call leaves

•In an ordered binary tree:
–The key of all the nodes in the left sub-tree are less than that of the root
–The key of all the nodes in the right sub-tree are greater than that of the root
–The left and right sub-trees are themselves ordered binary trees

Tree Traversal
•There are 3 ways to traverse the tree:
–inOrder
–preOrder
–postOrder

AVL Tree
•An AVL tree were the firstly dynamically balanced trees to be proposed.
•They are not perfectly balanced, but pairs of sub-trees differ in height by at most 1, maintaining an O (log n) search time.
•Addition and deletion operations also take O (log n) time.

Heaps (Priority Queues)
•Although queues may solve FIFO or First Come First Serve problems, there are certain cases, which we need to violate this rules.
•For instance, certain jobs which are more important than the other.
•Those jobs must be given higher priority to be executed first.
•This particular jobs require a special kind of queue known as a priority queue.
•A priority queue is a special type of queue which allows items with special characteristic to be removed from the list first.
•One typical example is to remove item with the smallest value from the list.
•Heap is an example of priority queue.

Binary Heap
•A heap is a binary tree that is completely filled with the possible exception of the bottom level, which is filled from left to right.
•Such a tree is known as a complete binary tree

Basic Heap Operations

DeleteMins are handled in a similar manner as insertion. Finding the minimum is easy; the hard part is removing it. When the minimum is removed, a hole is created at the root (remember that the root contain the minimum value) making the heap one smaller. It follows that the last element X in the heap must move somewhere in the heap. If X can be placed in the hole, then we are done, however this is unlikely happen. In the case of ascending heap, the smallest children of the hole will take over the hole, thus pushing the hold down one level. We repeat this steps until X can be placed in the hole. Thus, our action is to place X in its correct spot along a path from the root containing minimum children.

HeapSort
•One of a popular implementation of heap is HeapSort.
•HeapSort enables sorting in O (N log N) tome, is one of the best Bog-O running time is sorting methods

Advance Heaps
•d-Heap
–A simple generalization is a d-heap, which exactly like binary heap, except that all nodes have d children's (thus, a binary heap is a 2-heap).
–Since a d-heap contain more nodes, it is much shallower tha a binary heap, improving the running time of Insert to O(logd N).
•The most glaring weakness of the heap implementation, aside from the inability to perform Finds, is that combining two heap into one hard operation.
•This extra operation is known as a merge

Algorithm and Data structure: Stack and queues

STACK
A stack is a constrain version of list, with the restriction that insertion and deletions can be performed in only one position that is at the end of the list. This position is call top.
For this reason, stack is referred to as Last-In, First-Out (LIFO) ADT

Array Implementation
For array implementation, we only need to declare an array for stack data and an integer variable to mark the top of stack index
Array implementation of stack is much simpler, however it possesses the limitation to be fixed sized

Application of Stack
•Applications that use stack ADT:
–Balancing Symbol
–Converting Infix to Postfix Expression
–Evaluating Postfix Expression
–Function Call

•Balancing Symbol
–Check a mathematics expression to ensure that the parentheses are balance

•Evaluating Postfix Expression
–A stack is also being used to evaluates postfix expression, i.e. to calculate the value produces by the postfix expression
–When a number is seen, we push onto the stack;
–When an operator is seen, the operator is applied to the two numbers that are popped from the stack.
–The result is then pushed onto the stack

Queues

•Queue are list •With queue, insertion is done at one end whereas deletion is performed at the other end. •Queue is referred as First-In, First-Out (FIFO) ADT •Queue can be implemented using array or linked list •Pointers in the queue implementation: –Head –Tail •Node process: –Enqueue: is done through the tail pointer –Dequeue: is done through the head pointer

Algorithm and Data structure:Linear Lists

Goals of studying Data Structures:
–To identify and develop useful mathematical entities and
operations and to determine what classes of problems can be
solve by using these entities and operations
•This goals views a high-level data type as a tool that can be used to solve other problems
–To determine representations of those abstract entities and to
implement the abstract operations on these concrete
representations
•This goals views the implementation of such a data type as a problem to be solve using already existing data types

Abstract Data Type
•Data type = a collection of values and a set of operations on those values.
•Some instance of data types are integer, character, string, float and double.
•Abstract = conceptual solution to a specific problem. The conceptual solutions must be independent from the implementation.
•Abstract Data Type (ADT) = the basic mathematical concept that define the data type.
•ADT is not concern with implementation at all, not even the type of programming languages used.
•ADT is a useful guideline to implementers and a useful tool to programmers who wish to use the data type correctly
•Two parts of ADT:
–Value definition: define the collection of values of the ADT
–Operator definition: define the operations, which can be performed on the data type
•The reason for creating ADT is to aid programmers to solve problem

Arrays
•Array is called composite or structured data type.
•It is a collection of similar data type arranged in order.
•Two basic operations for array:
–Extraction: is a function that accepts an array, a, and an index, i, and returns an elements of the array
–Storing: accepts an array,a , an index, i, and an element, x.
•Elements of an array’s index:
–Lower bound: the smallest element of an array’s index.
–upper bound: the highest element.
–Range: the number of elements in an array
•Formula: range = upper bound – lower bound + 1

Linked Lists
•A list is a collection of values arranged in sequence
•Array is a data type suitable of representing a list, however, using arrays, we are implementing the list using static data structure
•Link list is use to represent list in dynamic data structure
•Linked list is a linear collection of self-referential structures, called nodes, connected by pointer links
•Advantages of linked list:
–A linked list is appropriate when the number of data elements to be represents in the data structure at once is unpredictable.
–Linked list are dynamic, so the length of a list can increase or decrease as necessary
–Linked list become full when the system has insufficient memory to satisfy dynamic storage allocation requests

•Pointers
•Self-Referential Structures
•Dynamic Memory Allocation
•Linked List
•Implementation of Linked List
•Traverse a Linked List
•Advanced Link List
•Linked List in Applications
•Weakness of Link List

•List is usually refers as a sequence of similar elements, for example a list of student names, a list of books
•In computer, list can be easily represented using array. For example, a list of integer can be represented in an array of integer
•Linked list consists of a series of structures, which are not necessarily adjacent in memory.
•Each structure is link to its adjacent structure (predecessor) through a self-referential pointer.
•Linked list as a linear collection of self-referential structures, called nodes.
•Subsequent nodes are accessed via the link pointer member stored in each node. By convention, the link pointer in the last node of a list is set to NULL to mark the end of the list.

Implementation of Linked List
•Deleting A Node
–Delete a node requires 3 pointers.
–currentPtr and previousPtr for transversal, and a temporary pointer to store the node temporary, before we delete it from the list.
–Deleting requires a temporary hold the node, which we need to delete.
–Deleting a node does not remove it from the memory until we free it from the memory using free(tempPtr) command

Traverse A Linked List
•Traverse means moving from one node to another continuously.
•To traverse a linked list, we need to use a pointer to move from one node to another.
•The algorithm to traverse linked list and search for matching elements .

Advance Linked List
•Expansion of linked list:
–Circular Linked List
–Double Linked List
–Skip List

Linked List In Applications
•Linked list can be view as dynamic and non-contiguous array
•Linked list adds the capability of dynamic space allocation and efficient insertion and deletion process
•Typically linked list is important for databases, moreover, it usually use as supporting ADT to become the enabler for other ADT
•Some of the ADT that use list are Stack, Queue, Tree and Graph

Weakness of Linked List
•Weaknesses of linked list compare to array is in searching and data retrieval
•Because linked list is not indexed, there is no way to retrieve a data unless to transverse the whole list
•If the data located in the beginning of the list, the problem is not significant
•However, if the data is located at the end of the list, it may seriously affect the program performance
•Array on the other hand is indexed. Program can be easily access to an array element if it knows the index of the elements

Algorithm and Data structure: Algorithm Analysis

Introduction

•The role of algorithm analysis is to estimate resource consumption
•Algorithm analysis methods:
–Empirical analysis
–Mathematical (theoretical) analysis

Algorithm Analysis

•Algorithm analysis
–Is the process of estimating the running time of an algorithm
–The main consideration in algorithm analysis is the size of input
•Algorithm Efficiency
–In order to determine the efficiency of an algorithm, we have to consider how the running time of an algorithm grows with the amount of data
–Algorithm that grows slower is considered to be more efficient than those that grow faster
–The best method for this estimation is to use the O notation

Estimating Running Time

•Multiple Consecutive Line
•Looping
•Nested Loops
•If-Else
•For Loop

Case Analysis

•Best Case, Average Case and Worst Case Analysis

Selecting Best Algorithm

•Points to concern when writing a program is:
–Nested looping will reduce algorithm efficiency, in fact, deep nested loops is very bad for program efficiency
–Recursive can be more efficient than looping but be careful not to wrongly implement recursion. It can produce horrible result
–The structure of the data is also important. If the data is structured wrongly, it may have more tendencies to run in worst case
•Selecting the best algorithm depends on your program requirement:
–Critical parts of your program may require more efficient algorithm because the tendency of the parts to be executed is more often
–Parts that are seldom use may need less efficient but simple to code algorithm

Algorithm and Data structure: Programming Principles

Introduction

•What is Programming?
–Programs = instruction to instruct machine to carry out specific task, or to solve specific problems.
–Algorithm = a step-by-step procedures that will accomplish a desired task
–Program structure = the action and the order of execution of an algorithm
–Data structure = the format and the of the data in computer memory
–Programming = the activity of communicating algorithms
–Programming process is analogues
–Programming process:
•Identifying programming task or problem
•Formulating algorithm for its solutions
•Translating the algorithm into programming language
•Testing and debugging the program

Memory Concepts

•Memory concept is a concept of storing data(input) and information (output) into the computer memory
int int1, int2, sum;scanf("%d", &int1);scanf("%d", &int2); sum = int1 + int2;
•Variable names such as int1, int2 and sum actually correspond to locations in the computer's memory

Program Structures

•Sequential execution = statement in a program are execute one after the other in the order in which they are written
•Examples of basic C program structures:
–Conditional
–Looping and Iteration
–Recursion
–Pointers
–Advanced Data Types
–File Manipulation

Program StructuresConditional

•Logical operation in C:
•These operators are used in conjunction with the following statement:
–If
–?
–switch

•The if statement has the same function as other languages.

•The ?(ternary condition) operator is a more efficient form for expressing simple if statement

•It allows multiple choice of a selection of items at one level of a conditional
•It is a far neater way of writing multiple if statement

•The break is needed if we want to terminate the switch after execution of one choice

Program StructuresLooping and Iteration

•C mechanism for controlling looping and iteration:
–The for statement
–The while statement
–The do-while statement

Program StructuresRecursion

•Recursion = function that call itself either directly or indirectly through another function
•A recursive function is called to solve a problem
•Two types of function:
–Function is called with a base case
•The function is simply returns a result
–Function is called with a complex problem
•The function divides the problems into two conceptual pieces:
–A piece that the functions knows how to do
–A piece that the function does not know how to do

Program StructuresPointers

•C uses pointers a lot, because:
–It is the only way to express some computations
–It produces compact and efficient code
–It provides a very powerful tool.
•C uses pointers explicitly with:
–Arrays
–Structures
–Functions
•Pointers = a variables which contains the address in the memory of another variable
•Pointer to any variable type is an address in memory – which is an integer address. Pointer is not an integer
•Two types of pointer:
–Unary or monadic operator & gives the ‘address of a variable’
–Indirection or dereference operator * gives the ‘content of an object pointed to by a pointer’
•Declaring a pointer to a variable:
–int *pointer

Program StructuresAdvanced Data Types

•Structures
–Structures in C are similar to records in Pascal

•Defining New Data Type
–typedef can also be used with structures.
–C allows array of structures:

•Unions
–Is a variable which may hold (at different times) objects of different sizes and type.
–C uses the union statement to create unions

•Coercion or Type-Casting
–C allows coercion.
–Coercion means that forcing one variable of one type to be another type
–Coercion in C can be done by using the cast operator ()

•Enumerated Types
–Enumerated types contain a list of constant that can be addressed in integer values

•Static Variables
–A static variable is a local to particular function
–It is only initialized once (on the first call to function)
–To define a static variable simply prefix the variable declaration with the static keyword

Program StructuresFile Manipulation

•All text file functions and types in C come from the stdio library
•Commands for read and write into a file:

•Random Access Files
–Fixed in length
–Can be accessed directly without searching through other records
–Appropriate for database system that required rapid access to specific data

Programming Style
Names:

•The names of variables and functions should be chosen with care so as to identify their meanings clearly and succinctly
•Guidelines in choosing name:
–Names should be meaningful and should suggest clearly the purpose of the function, variable and the like
–Keep the name simple for variables used on;y briefly and locally
–Use common prefixes or suffixes to associate names of the same general category
–Develop proper conventions to name your variables and functions
–Avoid deliberate misspellings and meaningless suffixes to obtain different names
–Avoid choosing cute names whose meaning has little or nothing to do with the problem
–Avoid choosing names that are close to each other in spelling or otherwise easy to confuse
–Be careful in the use of latter l, O or 0

Documentation and Format

•A good habit is to prepare documentation as the program is being written.
•Guidelines for documentation style:
•Place a prologue at the beginning of each function
•When each variable, constant, or type is declared, explain what it is and how it is used
•Introduce each significant section of the program with a comment stating briefly its purpose at action
•Indicate the end of each significant section
•Avoids comment that parrot what the code does
•Explain any statement that employs a trick or whose meaning is unclear
•The code itself should explain how the program works. The documentation should explain why it works and what it does
•Whenever a program is modified, be sure that the documentation is correspondingly modified

Refinement and Modularity

•One of the most important parts of the refinement process is deciding exactly what the task of each function is, specifying precisely what its input will be and what result it will produce
•Action of the function:
–Preconditions
•Indicates the beginning
–Postconditions
•Indicates what finishes

Program Tracing and Debugging

•Debugger is use to keep track of function calls, changes of variables and so on.
•Scaffolding technique is a snapshop that can help programmer converge quickly on the particular location where an error is occurring.
•Scaffolding is an excellence in tracing pointers errors
•Scaffolding can also help novice programmer to test for correct code
•Tracing and debugging a program is a skill to be masters.hence, put more effort to debug your own program as the only way to improve the skill is through practice

e-commerce and IT law: Selling and Marketing on the Web

•Business models for selling on the Web
•Effective Web Presence
•Identifying and Reaching for Customers
•Branding

Business/revenue models for selling on the Web

•Web Catalog Revenue Model
–Based on the mail order catalog business model.
–Seller establishes a brand image and uses the strength to sell to prospective buyers.
–Buying activities:
•Customers get the product information from the Web site and place orders through the Web sites or telephone.
–Items such as apparel, computers, electronics, house wares, and gifts.

•Web Catalog Revenue Model
–Examples :
•Some companies has Web sites in addition to physical stores. Both sell same products.
•Dell (http://www.dell.com/) allows customers to specify the computer configuration they order on the Web.
•Amazon.com (http://www.amazon.com/) started with selling books because they were easy and inexpensive to ship
•For luxury goods and high-fashion clothing items, people don’t buy through the Web site. E.g. Web site of Versace provides information to shoppers who would visit the physical stores to examine items shown in Web sites.
•Clothing retailers include photos, prices, sizes, colors and tailoring details of their products. Problems for clothing retailers is that the color setting for computer monitors vary which makes it difficult for customers to get a correct color.

•Digital Content Revenue Model
–Firms that own intellectual property or rights to that property have embraced the Web as a new and highly efficient distribution mechanism.
–Items such as newspapers, journals, court cases, laws, tax regulations and other digital contents.
–Buying activities:
•Most of the time, users have to subscribe to these sites by paying certain amount as the subscription fees. However, there are certain companies who offer a credit card charge option or e-cash payment for infrequent users who do not want a subscription.
–Examples below are accessible through Unitar Virtual Library (try it out):
•ProQuest – sells digital copies of published docs
•ACM Digital Library – subscriptions to electronic versions of journals
–Electronic publishing of journals eliminates the costs of paper, printing and delivery.

•Advertising-Supported Revenue Model
–Originally used by network television that provides free programming to an audience along with advertising messages.
–Only a few general-interest Web sites have generated sufficient traffic to be profitable based on advertising revenue alone.
–Web employment advertising is one example of successful implementation of the advertising-supported business model.
–‘Stickiness’ of a Web site is its ability to keep visitors at the site and attract repeat visitors.
–2 major problems in web advertising:
•No consensus on how to measure and charge for site visitor views – e.g. number of visitors/number of click-throughs
•Very few Web sites have sufficient numbers of visitors to interest large advertisers.

•Advertising-Supported Model examples:
–About.com, HowStuffWorks
–Portals : e.g. Yahoo
•Yahoo was a Web directory which expanded into a portal - includes Web directory, search engine, e-mail, calendar etc
–Advertising on results page triggered by the terms in the search
•There are a few other portal sites which uses the advertising-supported revenue model e.g. AOL, MSN
–Newspaper publishers
–Classified advertising sites e.g. Web employment advertising – very successful
•Can add targeted banner ad in results page for which advertisers pay more
•Can add short articles of topics of interest to increase the sites stickiness and to attract people not necessarily looking for a job

•Advertising-Subscription Mixed Revenue Model
–Subscribers pay a fee and accept some level of advertising.
–In most cases, the subscribers are subjected to much less advertising than on advertising supported sites.
–Mostly used by newspapers and magazines.
–Examples:
•The New York Times
•The Wall Street Journal
•TmNet E-browse offers digital version for a few newspapers e.g. Berita Harian, New Straits Times at a fee which can be paid quarterly, half-yearly or yearly
http://ebrowse.bluehyppo.com/index_nst.asp

•Fee-for-Transaction Revenue Model
–Involves receiving a fee for facilitating a transaction. Fee is based on the number or size of transactions they process.
–For travel agency, they earn commissions from transportation and lodging providers e.g. from airplane tickets, hotel reservation.
–Travel agency also generate advertising revenue from ads placed on travel information pages.
–Examples: http://www.travelocity.com/

•Fee-for-Transaction Revenue Model (cont.)
–Some auto dealers use a Web site to buy/sell autos and remove its intermediary (salesperson) from a value chain. This is known as disintermediation.
–Some stock brokerage firms use this model by charging their customers a commission for each trade executed.
–Other possible candidates that can use fee-for-transaction model are insurance companies, real estate brokerages, event ticketing and online banking.

•Fee-for-Service Revenue Model
–Fee is charged by companies providing services based on the value of the service.
–Examples:
•online games : visitors pay to play games by downloading or entering games area e.g. Sony’s Station.com
•Streaming video e.g. RealOne SuperPass
•Financial advice and professional services e.g. accountants

Effective Web Presence

•An effective site is one that creates an attractive presence that meets the objectives of the business or other organization.
•Web presences convey the images that companies want to project e.g. favored by a younger generation, classic image
•Most Web sites for organizations contains links to a fairly standard information set e.g. history, mission statement, information about products and services and how to communicate with the organization
•The Web must be used as a two-way communication between organization and customers.

•Objectives of effective site:
•Attracting visitors to the web site
•Making the site interesting enough that visitors stay and explore
•Convincing visitors to follow the web site’s links to obtain information
•Creating an impression consistent with the organization’s desired image
•Building a trusting relationship with visitors
•Reinforcing positive images that the visitor might already have about the organization
•Encouraging visitors to return to the site

•Visitors visit Web sites with reasons such as :
•Learning about products/services
•Buying products/services
•Obtain information about warranty, service or repair policies
•Obtain general information about the organization
•Obtain financial information for making an investment
•Identifying the people managing the organization
•Obtaining contact information in the organization

•Goals that should be met for a Web site:
–Offer easily accessible facts about the organization
–Allow visitors to experience the site in different ways and at different levels e.g. same information but in different file format
–Provide visitors with a two-way communication link with the organization
–Sustain visitor attention and encourage return visits
–Offer easily accessible information about products/services and how to use them

Appealing Web Site Design

•Page organization
•Online news readers views headlines and news briefs first
•Color – must appeal either to general public or to the demographic group that the site tries to attract
•Bright colors attract visitors
•Coolers colors make visitors comfortable
•Pizzazz
•Are small interactive programs such as crossword puzzles, trivia, eye-popping animations, prize-bearing competitions.
•To gain repeated visits, offer ‘a little something’ for free e.g. free downloads
•Menus – vertical, horizontal, layered
–Order of the items : items that are important to less important
–Use of color
–Use of interesting bullets
•Links to other pages
•Others:
–Offer more information
–change appearance every few months

Marketing

•Marketing mix = combination of elements that companies use to achieve their goals for selling and promoting their products and services.
•Marketing strategy = a particular marketing mix which consists of particular elements that the company decided to use.
•Essential issue of marketing = 4Ps of marketing
–Product
–Price
–Promotion – spreading the word about the product
–Place – distribution of product

•Marketing strategies:
–Product-based – products are displayed in categories e.g. Staples
–Customer-based
•products are shown based on groups of customers e.g. Sabre
•More common on B2B sites

Customers

•3 ways to reach customers
–Personal contact (one-to-one)
–Mass media (one-to-many)
–The Web (many-to-one, many-to-many)

Marketing Approaches

•Market segmentation = divides the pool of potential customers into segments (have common characteristics).
•Segments normally defined in terms of demographic characteristics e.g. age, gender, marital status, income level, geographic location.
•Micromarketing which is the practice of targeting very small and well-defined market segments.
•Uses specific advertising and promotion efforts
•Geographic segmentation – based on geographic group of customers
•Demographic segmentation – based on age, gender, family size, income, education, religion or ethnicity of customers
•Psychographic segmentation – based on social class, personality, approach to life
–E.g. sports car for customers with high need for achievement

How consumers reach product information

•Search by product type
•Search by brand name
•Click on advertisements
•Shopping channels
•Yellow pages

Advertising on the Web

•Formats:
–Banner ads
–Pop-up ad
–Interstitial ad – the ad appears in its own browser window
–Rich media ad (active ad) – appears/floats over the web page itself, contains moving graphics and sometimes audio/video
–Site sponsorship
–Email marketing

Creating and Maintaining Brands on the Web

•What is a brand?
–Brands are the sum total of all the images that people have in their heads about a particular company and a particular mark.
•Elements of Branding
–Differentiation
•In what significant ways is this product or service unlike its competitors?
–Relevance
•How does this product or service fit into my life?
–Perceived Value
•Is this product or service good?

Types of branding

•Emotional branding
•Use emotional appeal to maintain branding.
•Suitable when ad targets are in a passive mode of information acceptance such as TV and radio.
•Not suitable for web ad because web is an active medium controlled by the users/customers.
•Rational branding
•Offers to help web users in some way in exchange for their viewing an ad.
•Emotional appeal is replaced by cognitive appeal of providing functional assistance.
•Involves interactive marketing - advertisers can interact with consumers and customers and vice versa.

Branding Strategies

•Use Dominant Position and leverage approach
•Only works for firms that already have Web sites that dominate a particular market such as Yahoo!
•Online registry/Brand Consolidation Strategies
•Della & James (online bridal registry) offers a single registry that connects to several local and national department and gift stores.
•Act as a market intermediary as well as selling their own products.
•Affiliate marketing
•Is an arrangement whereby a marketing partner refers consumers to the selling company’s (the merchant’s) website
•This is done by placing an ad or logo or link of the selling company on the affiliated company’s Website
•When a customer that was referred to the selling company’s Website makes a purchase there, the affiliated partner receive commissions.
•For example, Amazon.com has almost 500,000 affiliates.
•Viral marketing
•Depends on existing customers to tell other people

Affiliate Marketing Models

•Pay-per-sale model
•Pay-per-click model
•Pay-per-lead model
–E.g. if consumer fill in and submit registration form
•Hybrid programs
–E.g. Merchant pay affiliates for amount of sale and number of new customers
•Multitiered programs
–Similar to multilevel marketing
–Affiliates at all level receive a portion of the commission paid by the merchant
•Cross-linking
–Two firms agree to place each other’s clickable icon on banner with no charges.

Branding Issues

•Costs
•It is less expensive to transfer and maintain existing products on the web than creating an entirely new brand on the web.
•Firms have to constantly promote their URL through product packaging, mass media advertising, search engine database and other information distributed mechanisms.
•Naming a web site – domain name recognition
•Company’s URL is as important as company’s legal trademark in order to identify their products.
•Thus, some companies are willing to buy/invest a lot of money for the URL that can represent their products.

e-commerce and IT law: Introduction to Electronic Commerce

Electronic Commerce

Objectives:

•Differences between e-commerce and traditional commerce
•Advantages and disadvantages of using e-commerce to conduct business
•International nature of e-commerce
•Fostering of e-commerce through economic forces
•Utilizing value chains

What is Electronic Commerce?

•or “e-commerce”
•The marketing, buying and selling of products and services on the Internet
•Used everywhere in everyday life e.g. ?
•E-business = connecting critical business systems directly to customers, vendors, and suppliers via the Internet, Extranet and Intranets

Types of e-commerceEFT and EDI:

•Electronic Funds Transfers (EFT)
–Used by the banking industry to exchange account information over secured networks
•Electronic Data Interchange (EDI)
–Used by businesses to transmit data from one business to another

Traditional Commerce:

•The exchange of valuable objects or services between at least two parties
•Includes all activities that each party undertakes to complete the transaction
•Barter system eventually gave way to the use of currency

Activities as Business Processes:

•The activities in which businesses engage as they conduct commerce are often referred to as Business Processes.
–Transferring funds
–Placing orders
–Sending invoices
–Shipping goods to customers

E-Commerce Drivers:

•Digital convergence
•Anytime, anywhere, anyone
•Changes in organizations
•Increasing pressure on operating costs and profit margins
•Demand for customized products and services

E-commerce Myths:

•Setting up a Web site is easy
•E-commerce is cheap.
•E-commerce means end of mass marketing.
•Everyone is doing it.
•Online retailing is always the low-cost channel.
•Build it and they (customers) will come.

Advantages of Electronic Commerce:

•Increased sales
–Reach narrow market segments in geographically dispersed locations
–Create virtual communities
•Decreased costs
–Handling of sales inquiries
–Providing price quotes
–Determining product availability
•Better customer service
•Quick comparison shopping

Disadvantages of Electronic Commerce:

•Loss of ability to inspect products from remote locations
•Rapid developing pace of underlying technologies
•Difficult to calculate return on investment
•Cultural and legal impediments
•Security, system and data integrity problems
•Customer relation problems

International Electronic Commerce:

•Language barriers must be overcome
•Political structures
–Currency conversion
–Tariffs and import/export restrictions
•Legal, tax and privacy concerns

Economic Forces andElectronic Commerce:

•Transaction Costs
–The total of all costs that the buyer and seller incur as they gather information and negotiate a purchase-sale transaction
•The “Market”
–Potential sellers must come in contact with potential buyers
–A medium of exchange must be available

Transaction Costs:

•Brokerage fees
•Sales Commissions
•Information search and acquisition
•Investment in equipment
•Hiring of skilled employees

Role of Electronic Commerce in Economy:

•Reduces transaction costs
–Improves information flow
–Increases coordination of actions
•Improvement of existing markets
•Creation of new markets

Value Chains inElectronic Commerce:

Defined as the way of organizing the activities undertaken to design, produce, promote, market, deliver, and support the products or services a business sells.

Value Chain Primary Activities:

•Identify customers
–Market research, customer satisfaction surveys
•Design
–Concept research, engineering, test marketing
•Purchase materials and supplies
–Vendor selection, quality and timeliness of delivery
•Manufacture
–Fabrication, assembly, testing, packaging
•Market and sell
–Advertising, promotion, pricing, monitoring sales and distribution channels
•Deliver
–Warehousing, materials handling, monitoring timeliness of delivery
•Provide after-sale service and support
–Installation, testing, maintenance, repair, warranty replacement, replacement parts

Value Chain Support Activities:

•Finance and administration
–Accounting, bill payment, borrowing, regulations, compliance with laws
•Human resources
–Recruiting, hiring, training, compensation, benefits
•Technology development
–Research, development, improvement studies, maintenance procedures

Role of Electronic Commerce in Value Chains:

•Reducing costs
•Improving product quality
•Reaching new customers or suppliers
•Creating new ways of selling existing products

Categories of E-commerce:

•Business-to-consumer (B2C), on the Internet
•Business-to-business (B2B), on the Internet and Extranet
•Business-within-business, on the Intranet
•Business-to-Government (B2G)
•Customer-to-Customer (C2C)
•Mobile Commerce (M-Commerce)

The Internet and World Wide Web:

•The Internet is a large system of interconnected networks that spans the globe
•The World Wide Web (WWW) is part of the Internet and allows users to share information with an easy-to-use interface

Origins of the Internet:

•Developed by the U.S. Department of Defense in the early 1960s
•The world’s telephone companies were early models for networked computers
•Researchers at universities were connected in 1969

New Uses for the Internet:

•E-mail
–The ability to send messages to one or many across the Internet
•File Transfer Protocol (FTP)
–The ability to transfer data files from one computer to another
•Telnet
–The ability to remotely logon to another computer
•World Wide Web (WWW)
–The ability to access information using a common interface
•Videoconferencing
–The ability to use video across the Internet for conferencing purposes
•Multimedia
–The ability to use video, audio, and animations across the Internet

IInformation retrieval: Query Language

•Keyword-based querying
–Basic queries
–Boolean queries
–Weighted queries
•Queries in weighted systems
•Pattern matching
•Natural language
•Structured queries
•Query protocols

Keyword-based Querying

•Queries are combinations of words
•The document collection is searched for documents that contain these words
•Word queries are intuitive, easy to express and provide fast ranking
•The concept of word must be defined:
–A word is a sequence of letters terminated by a separator (period, comma, blank, etc)
–Definition of letter and separator is flexible; e.g.: hyphen could be defined as a letter or as a separator
–Usually, “trivial words” (such as “a”, “the”, “or”, “of”) are ignored

Basic Queries

Single-word queries: A query is a single word
•Simplest form of query
•All documents that include this word are retrieved
•Documents may be ranked by the frequency of this word in the document.

Phrase queries: A query is a sequence of words treated as a
single unit. It is also called “literal string” or “exact phrase”
query.
•Phrase is usually surrounded by quotation marks.
•All documents that include this phrase are retrieved
•Usually separators (commas, colons, etc) and “trivial words” (e.g. “a”, “the” or “of”) in the phrase are ignored
•In effect, this query is for a set of words that must appear in sequence.
•Allows users to specify a context and thus gain precision
•Example: “The Lord of The Rings”

Multiple-word queries: A query is a set of words (or phrase)
•Two interpretations:
–A document is retrieved if it includes any of the query words
–A document is retrieved if it includes each of the query words
•Documents may be ranked by the number of query words they contain:
–A document containing n query words are ranked higher than a document containing m <> intersection
–Or -> union
–Except -> difference

•The use of except prevents creation of very large answers: not B will compute all documents that do not include B (complement), whereas A except B limits the universe to the documents that include A.
•Precedence: except, and, or; use parentheses to override; process left-to-right among operators with the same precedence.
•Example:
–computer or server except mainframe
•select all documents that discuss computers, or documents that discuss servers but do not discuss mainframes
–(computer or server) except mainframe
•Select all documents that discuss computers or servers, do not select any documents that discuss mainframes
–computer except (server or mainframe)
•Select all documents that discuss computers, and do not discuss either servers or mainframes.

•Classical Boolean systems do not rank documents: a document either satisfies the query (and is retrieved) or it does not satisfy the query (and is not retrieved).
•The Boolean formalism is not simple for users without training in mathematics

Weighted Queries

Weighted multiple-word queries: Each of the words is
assigned a different weight, expressing the relative
importance of the word within the request.
•A query is then a set of word-weight pairs: (, ), …, (, )
•The ranking of a document is the sum of the weights for the query words that it satisfies
•Example:
–Query: (A, 0.8), (B, 0.5), (C, 0.3)
–Document 1: (A, B, D)
–Document 2: (A, C, D)
–Ranking of Document 1: 0.8 + 0.5 = 1.3
–Ranking of Document 2: 0.8 + 0.3 = 1.1
–Each document includes two words from the query, but Document 1 is ranked higher because it includes more important words

Weighted Boolean queries: Each word in a Boolean query is
associated with the weight.
•Example: and ( or )
–A document with A and B satisfies this query better than a document with A and C (without such weights, both documents satisfy the query equally)

Information retrieval: Retrieval Evaluation

Retrieval Evaluation

•1.Motivation
•2.Precision and recall
•3.Single value measures
•4.Reference collections

•Most systems are evaluated on the basis of their time and space performance.
•For example, in the case of database management systems:
–Time: How long is the response time to queries.
–Space: How much storage is required for index structures, etc.
•In an information retrieval system, where there is no guarantee that answers satisfy the requests as intended by the user, we must also consider:
–Retrieval performance: how good is the answer.

•Precision
–The ability to retrieve top-ranked documents that are mostly relevant.
•Recall
–The ability of the search to find all of the relevant items in the corpus.

•Total number of relevant items is sometimes not available:
–Sample across the database and perform relevance judgment on these items.
–Apply different retrieval algorithms to the same database for the same query. The aggregate of relevant items is taken as the total relevant set.

Effect of enlarging/reducing answers on precision and recall:
•Enlarge answer:
–Recall: Can only improve (denominator remains constant).
–Precision: Unpredictable; will improve (worsen) if the precision in the documents added is better (worse) than the current precision.
•Reduce answer:
–Recall: Can only worsen (denominator remains constant).
–Precision: Unpredictable; will improve (worsen) if the precision in the documents removed is worse (better) than the current precision.
•When documents are ranked, and are added/removed according to their rankings, then precision will improve (worsen) when documents are added (removed).

Optimization options:
•Optimize recall => large answers, lower precision.
–Tuning a system to optimize recall, would normally result in larger answers with decreased precision.
–Extreme case: Retrieve the entire collection.
•Optimize precision => small answers, lower recall.
–Tuning a system to optimize precision, would normally result in smaller answers with decreased recall.
–Extreme case: Retrieve only one or two items.

Estimating the precision and recall of a given answer:
•Precision: Easier to estimate.
–An expert scan the answer and determines the documents that are relevant to the query.
•Recall: Harder to estimate.
–In a small collection, an expert may be able to scan the entire collection to determine the complete set of relevant documents.
–In a large collection (e.g., the WWW), the complete set of relevant documents might never be known; it could be estimated by the using a variety of systems (e.g., search engines) and deciding that a document is relevant if it is included in a majority of answers.

Interpolating a Recall/Precision Curve

•Interpolate a precision value for each standard recall level:
–rj Î{0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0}
–r0 = 0.0, r1 = 0.1, …, r10=1.0
•The interpolated precision at the j-th standard recall level is the maximum known precision at any recall level between the j-th and (j + 1)-th level:

•Average Precision at Seen Relevant Documents:
–Measure the precision each time a new relevant document is retrieved.
–Average these precision values.
–In the previous example, 5 relevant documents were retrieved overall, hence 5 precision values are averaged: (1+0.67+0.3+0.36+0.33)/5 =0.53.
•R-precision:
–Let r be the total number of relevant documents (i.e., r is the size of the ideal answer).
–Consider the r top-ranked documents.
–The R-precision measure is defined as the precision of this set.
–In the ideal case, all these documents will be relevant and P = R = 1.
–In the previous example, r = 10 and there were 3 relevant documents among the top 10, hence R-precision is 3/10 = 0.3.

•Typically average performance over a large set of queries.
•Compute average precision at each standard recall level across all queries.
•Plot average precision/recall curves to evaluate overall system performance on a document/query corpus.

Subjective Relevance Measure

•Novelty Ratio: The proportion of items retrieved and judged relevant by the user and of which they were previously unaware.
–Ability to find new information on a topic.
•Coverage Ratio: The proportion of relevant items retrieved out of the total relevant documents known to a user prior to the search.
–Relevant when the user wants to locate documents which they have seen before (e.g., the budget report for Year 2000).

Other Factors to Consider

•User effort: Work required from the user in formulating queries, conducting the search, and screening the output.
•Response time: Time interval between receipt of a user query and the presentation of system responses.
•Form of presentation: Influence of search output format on the user’s ability to utilize the retrieved materials.
•Collection coverage: Extent to which any/all relevant items are included in the document corpus.

Experimental Setup for Benchmarking

•Analytical performance evaluation is difficult for document retrieval systems because many characteristics such as relevance, distribution of words, etc., are difficult to describe with mathematical precision.
•Performance is measured by benchmarking. That is, the retrieval effectiveness of a system is evaluated on a given set of documents, queries, and relevance judgments.
•Performance data is valid only for the environment under which the system is evaluated.

BENCHMARK

Benchmarking - The Problems

•Performance data is valid only for a particular benchmark.
•Building a benchmark corpus is a difficult task.
•Benchmark web corpora are just starting to be developed.
•Benchmark foreign-language corpora are just starting to be developed.

Early Test Collections

•Previous experiments were based on the SMART collection which is fairly small. (ftp://ftp.cs.cornell.edu/pub/smart)
Collection Number Of Number Of Raw Size
Name Documents Queries (Mbytes)
CACM 3,204 64 1.5
CISI 1,460 112 1.3
CRAN 1,400 225 1.6
MED 1,033 30 1.1
TIME 425 83 1.5
•Different researchers used different test collections and evaluation techniques.

The TREC Benchmark

• TREC: Text REtrieval Conference (http://trec.nist.gov/)
Originated from the TIPSTER program sponsored by
Defense Advanced Research Projects Agency (DARPA).
• Became an annual conference in 1992, co-sponsored by the
National Institute of Standards and Technology (NIST) and
DARPA.
• Participants are given parts of a standard set of documents
and TOPICS (from which queries have to be derived) in
different stages for training and testing.
• Participants submit the P/R values for the final document
and query corpus and present their results at the conference.

The TREC Objectives

• Provide a common ground for comparing different IR
techniques.
–Same set of documents and queries, and same evaluation method.
• Sharing of resources and experiences in developing the
benchmark.
–With major sponsorship from government to develop large benchmark collections.
• Encourage participation from industry and academia.
• Development of new evaluation techniques, particularly for
new applications.
–Retrieval, routing/filtering, non-English collection, web-based collection, question answering.

TREC Advantages
•Large scale (compared to a few MB in the SMART Collection).
•Relevance judgments provided.
•Under continuous development with support from the U.S. Government.
•Wide participation:
–TREC 1: 28 papers 360 pages.
–TREC 4: 37 papers 560 pages.
–TREC 7: 61 papers 600 pages.
–TREC 8: 74 papers.

TREC Tasks
•Ad hoc: New questions are being asked on a static set of data.
•Routing: Same questions are being asked, but new information is being searched. (news clipping, library profiling).
•New tasks added after TREC 5 - Interactive, multilingual, natural language, multiple database merging, filtering, very large corpus (20 GB, 7.5 million documents), question answering.

Characteristics of the TREC Collection
•Both long and short documents (from a few hundred to over one thousand unique terms in a document).
•Test documents consist of:
WSJ Wall Street Journal articles (1986-1992) 550 M
AP Associate Press Newswire (1989) 514 M
ZIFF Computer Select Disks (Ziff-Davis Publishing) 493 M
FR Federal Register 469 M
DOE Abstracts from Department of Energy reports 190 M

More Details on Document Collections
•Volume 1 (Mar 1994) - Wall Street Journal (1987, 1988, 1989), Federal Register (1989), Associated Press (1989), Department of Energy abstracts, and Information from the Computer Select disks (1989, 1990)
•Volume 2 (Mar 1994) - Wall Street Journal (1990, 1991, 1992), the Federal Register (1988), Associated Press (1988) and Information from the Computer Select disks (1989, 1990)
•Volume 3 (Mar 1994) - San Jose Mercury News (1991), the Associated Press (1990), U.S. Patents (1983-1991), and Information from the Computer Select disks (1991, 1992)
•Volume 4 (May 1996) - Financial Times Limited (1991, 1992, 1993, 1994), the Congressional Record of the 103rd Congress (1993), and the Federal Register (1994).
•Volume 5 (Apr 1997) - Foreign Broadcast Information Service (1996) and the Los Angeles Times (1989, 1990).

Information retrieval : Modelling

introduction Modelling

•IR systems usually adopt index terms to process queries
•Index term:
–a keyword or group of selected words
–any word (more general)
•Stemming might be used:
–connect: connecting, connection, connections
•An inverted file is built for the chosen index terms

•Matching at index term level is quite imprecise
•No surprise that users get frequently unsatisfied
•Since most users have no training in query formation, problem is even worst
•Frequent dissatisfaction of Web users
•Issue of deciding relevance is critical for IR systems: ranking

•A ranking is an ordering of the documents retrieved that (hopefully) reflects the relevance of the documents to the user query
•A ranking is based on fundamental premisses regarding the notion of relevance, such as:
–common sets of index terms
–sharing of weighted terms
–likelihood of relevance
•Each set of premisses leads to a distinct IR model

IR : MODELS

Retrieval: Ad Hoc x Filtering

Classic IR Models - Basic Concepts

•Each document represented by a set of representative keywords or index terms
•An index term is a document word useful for remembering the document main themes
•Usually, index terms are nouns because nouns have meaning by themselves
•However, search engines assume that all words are index terms (full text representation)

•Not all terms are equally useful for representing the document contents: less frequent terms allow identifying a narrower set of documents
•The importance of the index terms is represented by weights associated to them
•Let
–ki be an index term
–dj be a document
–wij is a weight associated with (ki,dj)
•The weight wij quantifies the importance of the index term for describing the document contents

–Ki is an index term
–dj is a document
–t is the total number of docs
–K = (k1, k2, …, kt) is the set of all index terms
–wij >= 0 is a weight associated with (ki,dj)
–wij = 0 indicates that term does not belong to doc
–vec(dj) = (w1j, w2j, …, wtj) is a weighted vector associated with the document dj
–gi(vec(dj)) = wij is a function which returns the weight associated with pair (ki,dj)

The Boolean Model

•Simple model based on set theory
•Queries specified as boolean expressions
–precise semantics
–neat formalism
–q = ka Ù (kb Ú Økc)
•Terms are either present or absent. Thus, wij e {0,1}

Drawbacks of the Boolean Model

•Retrieval based on binary decision criteria with no notion of partial matching
•No ranking of the documents is provided (absence of a grading scale)
•Information need has to be translated into a Boolean expression which most users find awkward
•The Boolean queries formulated by the users are most often too simplistic
•As a consequence, the Boolean model frequently returns either too few or too many documents in response to a user query

The Vector Model

•Use of binary weights is too limiting
•Non-binary weights provide consideration for partial matches
•These term weights are used to compute a degree of similarity between a query and each document
•Ranked set of documents provides for better matching

•Define:
–wij > 0 whenever ki Î dj
–wiq >= 0 associated with the pair (ki,q)
– vec(dj) = (w1j, w2j, ..., wtj) vec(q) = (w1q, w2q, ..., wtq)
–To each term ki is associated a unitary vector vec(i)
–The unitary vectors vec(i) and vec(j) are assumed to be orthonormal (i.e., index terms are assumed to occur independently within the documents)
•The t unitary vectors vec(i) form an orthonormal basis for a t-dimensional space
•In this space, queries and documents are represented as weighted vectors

•Sim(q,dj) = [S wij * wiq] / dj * q
•How to compute the weights wij and wiq ?
•A good weight must take into account two effects:
–quantification of intra-document contents (similarity)
•tf factor, the term frequency within a document
–quantification of inter-documents separation (dissi-milarity)
•idf factor, the inverse document frequency
–wij = tf(i,j) * idf(i) ß unNormalized

•Let,
–N be the total number of docs in the collection
–ni be the number of docs which contain ki
–freq(i,j) raw frequency of ki within dj
•A normalized tf factor is given by
–f(i,j) = freq(i,j) / max(freq(l,j))
–where the maximum is computed over all terms which occur within the document dj
•The idf factor is computed as
–idf(i) = log (N/ni)
–the log is used to make the values of tf and idf comparable. It can also be interpreted as the amount of information associated with the term ki.

•The best term-weighting schemes use weights which are give by
–wij = f(i,j) * idf(i)
= freq(i,j) / max(freq(l,j)) * log(N/ni)
the strategy is called a tf-idf weighting scheme (Normalized)
•For the query term weights, a suggestion is
–wiq = (0.5 + [0.5 * freq(i,q) / max(freq(l,q)]) * log(N/ni)
= (0.5 + 0.5 * f(i,q)) * idf(i)
•The vector model with tf-idf weights is a good ranking strategy with general collections
•The vector model is usually as good as the known ranking alternatives. It is also simple and fast to compute.

•Advantages:
–term-weighting improves quality of the answer set
–partial matching allows retrieval of docs that approximate the query conditions
–cosine ranking formula sorts documents according to degree of similarity to the query
•Disadvantages:
–assumes independence of index terms (??); not clear that this is bad though