What Will I Learn?
I intend to cover the following concepts in this part of the tutotial
- You will learn how to navigate in through pages in a website.
- You will learn different functions and methods for interacting with forms.
- You will learn to build a basic form filling bot.
Requirements
The user is expected to have the following requirements for clear understanding of the tutorial
- Basic knowledge on Python programming language.
- Python 3+ installed PC (For practical understanding)
- Read my previous tutorial on MechanicalSoup to ensure continuation.
Difficulty
- Basic
Tutorial Contents
So Let's continue our journey to learn MechanicalSoup. As stated above we will learn how to automate interactions with a webpage with MechanicalSoup, like we use a browser to interact with them. This might come in handy if you are into creating bots and web scrapers.
1. Navigation
First of all let's look at how to navigate through pages in a website. We know that every page in a website has a unique URL associated with it. While using a browser we click on links available on the pages to navigate through pages in the website.
open() method
We have used and familiarized the open() method in the previous part of this tutorial. We saw in the how to open a webpage in the browser instance in MechanicalSoup. It involved passing the whole absolute URL of the webpage we are targeting to the function browser.open()
ie. For example to open Steemit.com
browser.open('https://steemit.com')
You are free to use the open() method every time you need to go to another page in the site or to a different website. Its gonna be a mess if we have to specify the absolute URL every time for accessing a specific page (Unless you want to move to a different website). Luckily there is a shortcut method for this.
follow_link() method
follow_link() method can be used to move to different pages by just specifying only the relative path to the page. ie. We can now avoid the https://website.domain part and just specify a page.
For example, if we want to move to the pages that contain the latest posts in Steemit, then we just have to do this after browser.open('https://steemit.com')
browser.follow_link('created') # Since the new posts are listed in https://steemit.com/created
Now our browser instance is pointed towards the link https://steemit.com/created and contains the contents of that page.
NOTE:
follow_link()should only be used in the case if you want to move to a different page in the same website. ie. As long as the domain part stays the same, it will work. And in case if you need to move to a different website, then you should use theopen()method instead offollow_link().
So I hope its clear about surfing through pages in a website.
2. Interacting with Forms
Let us see the different methods we need for this:
select_form() method
It is a function to select a particular form on a page. It pretty much works just like a CSS selector which is really helpful in selecting the form we need, when a page contains more than a single form.
Everyone who worked with HTML and CSS is familiar with CSS selectors, which are used to give styling properties to a single or a group of elements. Here is a great guide on CSS selectors from w3schools
It is a function associated with the browser instance, it is called as
form = browser.select_form('optional_css_selector')
The above code will return a mechanicalsoup.form.Form object, which has all the input fields in the form, which can be accessed as a Python dictionary and also some cool functions to help us with the form filling.
In case if the page doesn't have multiple forms, then calling just
select_form()without any arguments will do the trick.
get_current_form() method
This method is a member function of the browser instance which will return the currently selected form object.
form = browser.get_current_form()
print_summary() method
It is a method associated with the Form object, which is returned by the select_form() method, On calling this method, it returns the list of all the input elements present inside the form object.
You can print the list of inputs either like this:
browser.select_form('optional_css_selector').print_summary()
Or like this, by using get_current_form() function:
browser.get_current_form().print_summary()
Assigning values to input fields.
It's very simple to assign values to the form fields in MechanicalSoup. First of all you have to select the particular form using select_form() then you can assign values to the form fields like this:
We utilize the name of the input field to assign values to the form inputs, If you have some experience in the Web development then you will know that the POST request consists of a JSON structure like the name fields acts as the keys and value act as the corresponding value.
The same mechanism is applied here. You can just assign the values to the corresponding input fields by just using the browser object:
For example if we have an input named "Name", then to assign a value to the field you just have to:
browser['Name'] = 'Ajmal Noushad'
Simple isn't it?
launch_browser() method
This will launch a real browser with the current page that is in the browser instance. But you can see that the browser doesn't go to the original URL, but instead goes to local URL to a file stored inside your PC, because it also contains the form that you just filled along with it. So using launch_browser() function you can just confirm that you just did everything right.
So that's all the methods we need, so just get on the play ground.
Creating a Basic form filling bot
We will now see how we can fill a form in a webpage using the above functions. For that purpose I have made a dummy webpage with a form that consists of some input fields. I used Django to build this. You can find the code here : Github Repo
The form looks like this :
I hosted it into PythonAnywhere for practice, you can access it here : DummyForm
Lets proceed,
First of all, open a python console.
If you have read the previous tutorial we have set up an environment with mechnaicalsoup installed in it. So you just have to activate the virtualenv and type python in the terminal.
$ source env-name/bin/activate
$ python
Now in the python console, import mechanicalsoup
import mechanicalsoup
Create a new browser instance
browser = mechanicalsoup.StatefulBrowser()
Open the URL of the webpage that contains the form, in our case 'ajmal.pythonanaywhere.com'
browser.open('http://ajmal.pythonanaywhere.com')
Select the form in the webpage using select_form()
browser.select_form()
Remember: No CSS selectors are given since the age contains a single form.
To list the input fields we use print_summary()
browser.get_current_form().print_summary() # get_current_form() returns the form object pointing to the currently selected form
The above command will give you an output like:
<input name="csrfmiddlewaretoken" type="hidden" value="lIMnuL2olx1GEnGyTms3rDLMEB8lZKqCRWd9qo111631GkSEBNhEjv4IOAHDniym"/>
<input class="form-control" id="id_name" maxlength="20" name="name" required="" type="text"/>
<input class="form-control" id="id_age" name="age" required="" type="number"/>
<select class="form-control" id="id_gender" name="gender">
<option value="1">MALE</option>
<option value="2">FEMALE</option>
</select>
<textarea class="form-control" cols="40" id="id_about_me" name="about_me" required="" rows="10"></textarea>
Now fill the input fields with data using their name attributes.
browser['name'] = 'My Name'
browser['age'] = 21
browser['gender'] = '1' # See that we gave the value attribute of the select options for the gender input field. ie. '1' for 'MALE' and '2' for 'FEMALE'
browser['about_me'] = 'I am learning MechanicalSoup'
Note that
radioinputs andselectinputs should provide the correspongding value attribute of the item that needs to be selected.
Inputs for checkboxes can be given as an array of values likebrowser['checkbox'] = ['val1', 'val2']
Now lets take a look how it looks on a browser now,
browser.launch_browser()
The above command will open a local webpage with same contents as of the original webpage with the form filled with the values given by us.
Finally submit the form
browser.submit_selected()
Now we will get response 200 indicating that everything went well and Form is successfully submitted.
<Response [200]>
Finally lets checkout the content of the response,
browser.get_current_page()
The above will return the page after the form submit is occured. If you did it through a browser you can see that you get a 'Success' as the http-response message
In console we see that as
<html><body><p>Success</p></body></html>
So that's it, you have now learned how to work with MechanicalSoup to interact with webpages. Feel free to ask any doubts. Thanks for reading...
Curriculum
My previous tutorial on MechanicalSoup
Posted on Utopian.io - Rewarding Open Source Contributors