Tech

BeautifulSoup transforms a complex HTML document into a complex tree of Python objects, such as tag, navigable string, or comment.

Install using pipsudo pip3 install lxml bs4.

Let’s consider a dummy HTML Example file:

```html <!DOCTYPE html>

Header

Operating systems

Solaris
FreeBSD
Debian
NetBSD
Windows

FreeBSD is an advanced computer operating system used to power modern servers, desktops, and embedded platforms.

Debian is a Unix-like computer operating system that is composed entirely of free software.

</pre>

1. Get tags:

#!/usr/bin/python

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:

contents = f.read()

soup = BeautifulSoup(contents, 'lxml')

print(soup.h2)
print(soup.head)
print(soup.li)

2. Get tags, name, text:

#!/usr/bin/python

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:

contents = f.read()

soup = BeautifulSoup(contents, 'lxml')

print(f'HTML: {soup.h2}, name: {soup.h2.name}, text: {soup.h2.text}')

3. Traverse Tags:

#!/usr/bin/python

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:

contents = f.read()

soup = BeautifulSoup(contents, 'lxml')

for child in soup.recursiveChildGenerator():

if child.name:

print(child.name)

4. Getting DOM Child:

#!/usr/bin/python

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:

contents = f.read()

soup = BeautifulSoup(contents, 'lxml')

root = soup.html

root_childs = [e.name for e in root.children if e.name is not None]
print(root_childs)

5. Get all descendants:

#!/usr/bin/python

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:

contents = f.read()

soup = BeautifulSoup(contents, 'lxml')

root = soup.body

root_childs = [e.name for e in root.descendants if e.name is not None]
print(root_childs)

6. Find elements by Id

#!/usr/bin/python

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:

contents = f.read()

soup = BeautifulSoup(contents, 'lxml')

#print(soup.find('ul', attrs={ 'id' : 'mylist'}))
print(soup.find('ul', id='mylist'))

7. Get all tags:

#!/usr/bin/python

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:

contents = f.read()

soup = BeautifulSoup(contents, 'lxml')

for tag in soup.find_all('li'):
print(f'{tag.name}: {tag.text}')

8. CSS selectors:

#!/usr/bin/python

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:

contents = f.read()

soup = BeautifulSoup(contents, 'lxml')

print(soup.select('li:nth-of-type(3)'))