{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import requests\n", "import bs4\n", "import re\n", "from time import sleep\n", "import random\n", "import networkx as nx" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When you look at a website using a web browser what you see is a rendered view from an HTML code. This graphical version is much easier for us to read, however, it's not trivial to automatize the process of any data retrieval from it. That's why we will use raw HTML form\n", "\n", "The first step is to download the content. We can use a request library. Basic usage is very simple." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "response = requests.get('https://en.wikipedia.org/wiki/Pozna%C5%84_University_of_Technology')\n", "response.status_code # it's wise to check the status code " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "response.text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Working with such a string might be problematic. However, HTML is structured and we can benefit from it. The Beautiful Soup package transforms HTML string into a tree form and allows us to query it in a much more efficient and easier way." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "parsed = bs4.BeautifulSoup(response.text)\n", "parsed" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Typically text content is stored in a paragraph element denoted with a 'p' tag. We can take a look at the text from all paragraphs." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for p in parsed.select('p'):\n", " print(p.getText())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Task1\n", "Implement a function getText(url) \n", " - download content from a given url\n", " - transform it using bs\n", " - return text from all paragraphs" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def getText(url):\n", " output = \"\"\n", " \n", " return output" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "getText('https://en.wikipedia.org/wiki/Pozna%C5%84_University_of_Technology')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "getText(\"http://wp.pl\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Nowadays websites are often dynamic, not static. Working with them would require dealing with javascript. It's possible and there are python packages supporting this processing but we will not cover them in this course." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "getText(\"http://facebook.com\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Task 2\n", "Extract number of students from infobox table" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "response = requests.get('https://en.wikipedia.org/wiki/Pozna%C5%84_University_of_Technology')\n", "parsed = bs4.BeautifulSoup(response.text)\n", "parsed.find('table')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this task, you have to extract specific information from a specific table. The table we are interested in is of class infobox and bs allows us to use this information." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "parsed.find('table', class_=\"infobox\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since bs creates a tree structure you can navigate through it and use methods find and find_all on the next nodes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "parsed.find('table', class_=\"infobox\").find_all(\"tr\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "parsed.find('table', class_=\"infobox\").find_all(\"tr\")[6]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "parsed.find('table', class_=\"infobox\").find_all(\"tr\")[6].find('th').text" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "parsed.find('table', class_=\"infobox\").find_all(\"tr\")[6].find('td').text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Write a function that returns the number of students from the infobox table from the provided URL. You can use inspect tool - just right click on a website, choose inspect and analyze the html structure." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def getStudentCount(url):\n", " \n", " " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "assert getStudentCount(\"https://en.wikipedia.org/wiki/Pozna%C5%84_University_of_Technology\") == 21000" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "assert getStudentCount(\"https://en.wikipedia.org/wiki/Wroc%C5%82aw_University_of_Science_and_Technology\") == 28314" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Regex\n", "Regular expression is a scheme for defining patterns to be found in a text\n", "\n", "Let's try it https://regexone.com/" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Task 3\n", "\n", "We can also retrieve URLs and once we have them we can scrape them as well." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "parsed.find_all('a')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It's important to add some delay between accessing the next page. Otherwise, you might cause too much traffic and be temporarily banned" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "links = parsed.find_all('a', attrs={'href': re.compile(r'^/wiki')}) # find all links starting with /wiki\n", "random.shuffle(links)\n", "for link in links[:10]:\n", " print(link['href'])\n", " response = requests.get(\"https://en.wikipedia.org\" + link['href'])\n", " print(response.status_code)\n", " sleep(random.random()*3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "implement DFS based on the code above, try to avoid links not leading to an article" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def dfs(link):\n", " " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dfs('https://en.wikipedia.org/wiki/Pozna%C5%84_University_of_Technology')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "write BFS with printing current link" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def bfs(link):\n", " \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Networkx\n", "\n", "It's a package for working with various networks" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "G = nx.Graph()\n", "\n", "G.add_edge(\"a\", \"b\")\n", "G.add_edge(\"a\", \"c\")\n", "G.add_edge(\"c\", \"d\")\n", "G.add_edge(\"c\", \"e\")\n", "G.add_edge(\"c\", \"f\")\n", "G.add_edge(\"a\", \"d\")\n", "\n", "nx.draw(G, with_labels=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Can you plot a graph of Wikipedia links? Extend the bfs function by plotting a network. Limit the search to a reasonable number of nodes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Scrapy\n", "\n", "Scrapy is an efficient library for web crawling and scraping. It has a slightly higher entrance level than requests + bs but it's much easier for complex tasks." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import requests\n", "import bs4\n", "import re\n", "from time import sleep\n", "import random\n", "\n", "import scrapy\n", "from scrapy.crawler import CrawlerProcess" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class MySpider(scrapy.Spider):\n", " name = \"lab1\"\n", "\n", " start_urls = ['http://quotes.toscrape.com']\n", "\n", " def parse(self, response):\n", " quotes = response.css('div.quote')\n", " for quote in quotes: # you can extract data you need\n", " yield {\n", " 'text': quote.css('.text::text').get(),\n", " 'author': quote.css('.author::text').get(),\n", " }\n", "\n", " next_page = response.css('li.next a::attr(href)').get() #find next URL\n", "\n", " if next_page is not None:\n", " next_page = response.urljoin(next_page)\n", " yield scrapy.Request(next_page, callback=self.parse) #and process it" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "process = CrawlerProcess({\n", " 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'\n", "})\n", "\n", "process.crawl(MySpider)\n", "process.start() " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Additional task\n", "\n", "Which color of a car is the most expensive? Analyze offers from https://www.olx.pl/motoryzacja/samochody/ some of them leads to olx while other to otomoto. Can you use data from both sources?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "Bonus\n", "\n", "Funny website for scraping https://web.archive.org/web/20190615072453/https://sirius.cs.put.poznan.pl/~inf66204/WKC.html\n", "
" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.12" } }, "nbformat": 4, "nbformat_minor": 2 }