Web is almost an unlimited source of information. Using search engines such as Google, Bing and similar we can easily find web pages with possibly relevant information. The number of returned pages would usually however be very large which does not allow for manual processing. The solution to this are computer programs that are able to find and extract relevant information from possibly very large number of non-structured or semi-structured documents and return results in structured form.


The main objective of this course is to teach students about how to develop programs for web search (including surface web and deep web search) and for extraction of structural data from both, static and dynamic web pages. Beside basic concepts of the web search and retrieval, students will learn about relevant techniques and approaches. After the course, if successful, students will be able to develop programs for automatic web search and structured data extraction from web pages (including search and extraction from on-line social media).


The main topics that will be addressed within the course are:

  • Information Retrieval and Web Search (Basic Concepts of Information Retrieval, Information Retrieval Models, Relevance Feedback, Evaluation Measures, Text and Web Page Pre-Processing, Inverted Index and Its Compression, Latent Semantic Indexing, Web Search, Meta-Search...)
  • Web Crawling (A Basic Crawler Algorithm, Implementation Issues, Universal Crawlers, Focused Crawlers, Topical Crawlers, Structured Data Extraction, Wrapper Induction, Instance-Based Wrapper Learning, Automatic Wrapper Generation, String Matching and Tree Matching, Multiple Alignment, Building DOM Trees, Extraction Based on a Single List Page or Multiple Pages...)
  • Information Integration (Schema-Level Matching, Domain and Instance-Level Matching, Combining Similarities, 1:m Match, Integration of Web Query Interfaces, Constructing a Unified Global Query Interface...) 
  • Opinion Mining and Sentiment Analysis (Document Sentiment Classification, Sentence Subjectivity and Sentiment Classification, Opinion Lexicon Expansion, Aspect-Based Opinion Mining...)


It is expected from students that they know at least basics of program languages and technologies such as, Java, JavaScript, Python, HTML, CSS, web page structure.  

For a positive grade at this course students are expected to successfully finish two homework's, project work and written examination (at least 50% of all points) . 
At the course we will recognize principles and guidelines for designing User Interfaces (UI), and communication between brains and computer via movement imagery, i.e., non-invasive Brain-Computer Interface (BCI). The topics are following: human capabilities (memory and learning, perception, cognition), types of UI communications (input models, models and metaphors), UI design principles (Norman's hints, Mandel's principles, Nielsen's principles), UI design guidelines (selection and arranging graphic controllers for interaction, graphic design, feedback and interactions, selection and design of icons), electroencephalogram (EEG) and brain-computer communication, international reference database for designing BCI (EEGMMI DS - EEG Motor Movement/Imagery DataSet), designing non-invasive BCI, spectral analysis of EEG signals (power spectrum, autoregressive method, time-frequency representations, parametric modeling), feature extraction in time and frequency domain, feature selection, classification of imagined movements, BCI with machine learning, BCI applications (cursor moving, spelling, communication for handicapped). The environments used will be NetBeans and Matlab.