In [None]:
import requests
import bs4
import re
from time import sleep
import random
import networkx as nx

When you look at a website using a web browser what you see is a rendered view from an HTML code. This graphical version is much easier for us to read, however, it's not trivial to automatize the process of any data retrieval from it. That's why we will use raw HTML form

The first step is to download the content. We can use a request library. Basic usage is very simple.

In [None]:
response = requests.get('https://en.wikipedia.org/wiki/Pozna%C5%84_University_of_Technology')
response.status_code # it's wise to check the status code 

In [None]:
response.text

Working with such a string might be problematic. However, HTML is structured and we can benefit from it. The Beautiful Soup package transforms HTML string into a tree form and allows us to query it in a much more efficient and easier way.

In [None]:
parsed = bs4.BeautifulSoup(response.text)
parsed

Typically text content is stored in a paragraph element denoted with a 'p' tag. We can take a look at the text from all paragraphs.

In [None]:
for p in parsed.select('p'):
    print(p.getText())

## Task1
Implement a function getText(url) 
 - download content from a given url
 - transform it using bs
 - return text from all paragraphs

In [None]:
def getText(url):
    output = ""
    
    return output

In [None]:
getText('https://en.wikipedia.org/wiki/Pozna%C5%84_University_of_Technology')

In [None]:
getText("http://wp.pl")

Nowadays websites are often dynamic, not static. Working with them would require dealing with javascript. It's possible and there are python packages supporting this processing but we will not cover them in this course.

In [None]:
getText("http://facebook.com")

## Task 2
Extract number of students from infobox table

In [None]:
response = requests.get('https://en.wikipedia.org/wiki/Pozna%C5%84_University_of_Technology')
parsed = bs4.BeautifulSoup(response.text)
parsed.find('table')

In this task, you have to extract specific information from a specific table. The table we are interested in is of class infobox and bs allows us to use this information.

In [None]:
parsed.find('table', class_="infobox")

Since bs creates a tree structure you can navigate through it and use methods find and find_all on the next nodes.

In [None]:
parsed.find('table', class_="infobox").find_all("tr")

In [None]:
parsed.find('table', class_="infobox").find_all("tr")[6]

In [None]:
parsed.find('table', class_="infobox").find_all("tr")[6].find('th').text

In [None]:
parsed.find('table', class_="infobox").find_all("tr")[6].find('td').text


Write a function that returns the number of students from the infobox table from the provided URL. You can use inspect tool - just right click on a website, choose inspect and analyze the html structure.

In [None]:
def getStudentCount(url):
    
    

In [None]:
assert getStudentCount("https://en.wikipedia.org/wiki/Pozna%C5%84_University_of_Technology") == 21000

In [None]:
assert getStudentCount("https://en.wikipedia.org/wiki/Wroc%C5%82aw_University_of_Science_and_Technology") == 28314

# Regex
Regular expression is a scheme for defining patterns to be found in a text

Let's try it  https://regexone.com/

## Task 3

We can also retrieve URLs and once we have them we can scrape them as well.

In [None]:
parsed.find_all('a')

It's important to add some delay between accessing the next page. Otherwise, you might cause too much traffic and be temporarily banned

In [None]:
links = parsed.find_all('a', attrs={'href': re.compile(r'^/wiki')}) # find all links starting with /wiki
random.shuffle(links)
for link in links[:10]:
    print(link['href'])
    response = requests.get("https://en.wikipedia.org" + link['href'])
    print(response.status_code)
    sleep(random.random()*3)

implement DFS based on the code above, try to avoid links not leading to an article

In [None]:
def dfs(link):
    

In [None]:
dfs('https://en.wikipedia.org/wiki/Pozna%C5%84_University_of_Technology')

write BFS with printing current link

In [None]:
def bfs(link):
    


## Networkx

It's a package for working with various networks

In [None]:
G = nx.Graph()

G.add_edge("a", "b")
G.add_edge("a", "c")
G.add_edge("c", "d")
G.add_edge("c", "e")
G.add_edge("c", "f")
G.add_edge("a", "d")

nx.draw(G, with_labels=True)


Can you plot a graph of Wikipedia links? Extend the bfs function by plotting a network. Limit the search to a reasonable number of nodes.

# Scrapy

Scrapy is an efficient library for web crawling and scraping. It has a slightly higher entrance level than requests + bs but it's much easier for complex tasks.

In [None]:
import requests
import bs4
import re
from time import sleep
import random

import scrapy
from scrapy.crawler import CrawlerProcess

In [None]:
class MySpider(scrapy.Spider):
    name = "lab1"

    start_urls = ['http://quotes.toscrape.com']

    def parse(self, response):
        quotes = response.css('div.quote')
        for quote in quotes: # you can extract data you need
            yield {
                'text': quote.css('.text::text').get(),
                'author': quote.css('.author::text').get(),
            }

        next_page = response.css('li.next a::attr(href)').get() #find next URL

        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse) #and process it

In [None]:
process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(MySpider)
process.start() 

## Additional task

Which color of a car is the most expensive? Analyze offers from https://www.olx.pl/motoryzacja/samochody/ some of them leads to olx while other to otomoto. Can you use data from both sources?

<details>

<summary>Bonus</summary>

Funny website for scraping https://web.archive.org/web/20190615072453/https://sirius.cs.put.poznan.pl/~inf66204/WKC.html
</details>