Python-BeautifulSoup
BeautifulSoup transforms a complex HTML document into a complex tree of Python objects, such as tag, navigable string, or comment.
- Install using pip
sudo pip3 install lxml bs4.
Let’s consider a dummy HTML Example file:
```html <!DOCTYPE html>
Operating systems
- Solaris
- FreeBSD
- Debian
- NetBSD
- Windows
FreeBSD is an advanced computer operating system used to power modern servers, desktops, and embedded platforms.
Debian is a Unix-like computer operating system that is composed entirely of free software.
</pre>
1. Get tags:
#!/usr/bin/python
from bs4 import BeautifulSoup
with open('index.html', 'r') as f:
contents = f.read()
soup = BeautifulSoup(contents, 'lxml')
print(soup.h2)
print(soup.head)
print(soup.li)
2. Get tags, name, text:
#!/usr/bin/python
from bs4 import BeautifulSoup
with open('index.html', 'r') as f:
contents = f.read()
soup = BeautifulSoup(contents, 'lxml')
print(f'HTML: {soup.h2}, name: {soup.h2.name}, text: {soup.h2.text}')
3. Traverse Tags:
#!/usr/bin/python
from bs4 import BeautifulSoup
with open('index.html', 'r') as f:
contents = f.read()
soup = BeautifulSoup(contents, 'lxml')
for child in soup.recursiveChildGenerator():
if child.name:
print(child.name)
4. Getting DOM Child:
#!/usr/bin/python
from bs4 import BeautifulSoup
with open('index.html', 'r') as f:
contents = f.read()
soup = BeautifulSoup(contents, 'lxml')
root = soup.html
root_childs = [e.name for e in root.children if e.name is not None]
print(root_childs)
5. Get all descendants:
#!/usr/bin/python
from bs4 import BeautifulSoup
with open('index.html', 'r') as f:
contents = f.read()
soup = BeautifulSoup(contents, 'lxml')
root = soup.body
root_childs = [e.name for e in root.descendants if e.name is not None]
print(root_childs)
6. Find elements by Id
#!/usr/bin/python
from bs4 import BeautifulSoup
with open('index.html', 'r') as f:
contents = f.read()
soup = BeautifulSoup(contents, 'lxml')
#print(soup.find('ul', attrs={ 'id' : 'mylist'}))
print(soup.find('ul', id='mylist'))
7. Get all tags:
#!/usr/bin/python
from bs4 import BeautifulSoup
with open('index.html', 'r') as f:
contents = f.read()
soup = BeautifulSoup(contents, 'lxml')
for tag in soup.find_all('li'):
print(f'{tag.name}: {tag.text}')
8. CSS selectors:
#!/usr/bin/python
from bs4 import BeautifulSoup
with open('index.html', 'r') as f:
contents = f.read()
soup = BeautifulSoup(contents, 'lxml')
print(soup.select('li:nth-of-type(3)'))