![](https://steemitimages.com/640x0/https://i1.wp.com/blog.fossasia.org/wp-content/uploads/2017/07/loklak-twitter-video-1-copy.png?resize=825%2C510&ssl=1)
The primary web service that loklak scrapes is Twitter. Being a news and social networking service, Twitter allows its users to post videos directly to Twitter and they convey more thoughts than what text can. But for an automated scraper, getting the links is not a simple task.
Let us see that what were the problems we faced with videos and how we solved them in the loklak server project.
Previous setup and embedded videos
In a previous version of loklak server, the TwitterScraper searched for videos in 2 ways –
- Youtube links
- HTML5 video links
To fetch the video URL from HTML5 video, following snippet was used –
if ((p = input.indexOf("<source video-src")) >= 0 && input.indexOf("type=\"video/") > p) {
String video_url = new prop(input, p, "video-src").value;
videos.add
continue;
}
Here, input is the current line from raw HTML that is being processed and prop is a class defined in loklak that is useful in parsing HTML attributes. So in this way, the HTML5 videos were extracted.
The Problem – Embedded videos
Though the previous setup had no issues, it was useless as Twitter embeds the videos in an iFrame and therefore, can’t be fetched using simple HTML5 tag extraction.
If we take the following Tweet for example,
So we needed to come up with a better technique to get those videos.
Parsing video URL from iFrame
The <div>
which contains video is marked with AdaptiveMedia-videoContainer
class. So if a Tweet has an iFrame
containing video, it will also have the mentioned class.
Also, the source of iFrame is of the form https://twitter.com/i/videos/tweet/{Tweet-ID}
. So now we can programmatically go to any Tweet’s video and parse it to get results.
Extracting video URL from iFrame source
Now that we have the source of iFrame, we can easily get the video source using the following flow –
public final static Pattern videoURL = Pattern.compile("video_url\\\":\\\"(.*?)\\\"");
private static String[] fetchTwitterIframeVideos(String iframeURL) {
// Read fron iframeURL line by line into BufferReader br
while ((line = br.readLine()) != null ) {
int index;
if ((index = line.indexOf("data-config=")) >= 0) {
String jsonEscHTML = (new prop(line, index, "data-config")).value;
String jsonUnescHTML = HtmlEscape.unescapeHtml(jsonEscHTML);
Matcher m = videoURL.matcher(jsonUnescHTML);
if (!m.find()) {
return new String[]{};
}
String url = m.group(1);
url = url.replace("\\/", "/"); // Clean URL
/*
* Play with url and return results
*/
}
}
}
MP4 and M3U8 URLs
If we encounter mp4 URLs, we’re fine as it is the direct link to video. But if we encounter m3u8 URL, we need to process it further before we can actually get to the videos.
For Twitter, the hosted m3u8 videos contain the link to further m3u8 videos which are of different resolution. These m3u8 videos again contain the link to various .ts files that contain actual video in parts of 3 seconds length each to support better streaming experience on the web.
To resolve videos in such a setup, we need to recursively parse m3u8 files and collect all the .ts
videos.
private static String[] extractM3u8(String url) {
return extractM3u8(url, "https://video.twimg.com/");
}
private static String[] extractM3u8(String url, String baseURL) {
// Read from baseURL + url line by line
while ((line = br.readLine()) != null) {
if (line.startsWith("#")) { // Skip comments in m3u8
continue;
}
String currentURL = (new URL(new URL(baseURL), line)).toString();
if (currentURL.endsWith(".m3u8")) {
String[] more = extractM3u8(currentURL, baseURL); // Recursively add all
Collections.addAll(links, more);
} else {
links.add(currentURL);
}
}
return links.toArray(new String[links.size()]);
}
And then in fetchTwitterIframeVideos, we can return the all .ts URLs for the video –
if (url.endsWith(".mp4")) {
return new String[]{url};
} else if (url.endsWith(".m3u8")) {
return extractM3u8(url);
}
Putting things together
Finally, the TwitterScraper can discover the video links by tweaking a little –
if (input.indexOf("AdaptiveMedia-videoContainer") > 0) {
// Fetch Tweet ID
String tweetURL = props.get("tweetstatusurl").value;
int slashIndex = tweetURL.lastIndexOf('/');
if (slashIndex < 0) {
continue;
}
String tweetID = tweetURL.substring(slashIndex + 1);
String iframeURL = "https://twitter.com/i/videos/tweet/" + tweetID;
String[] videoURLs = fetchTwitterIframeVideos(iframeURL);
Collections.addAll(videos, videoURLs);
}
Conclusion
This blog post explained the process of extracting video URL from Twitter and the problem faced. The discussed change enabled loklak to extract and serve URLs to video for tweets. It was introduced in PR loklak/loklak_server#1193 by me (@singhpratyush).
The service was further enhanced to collect single mp4 link for videos (see PR loklak/loklak_server#1206), which is discussed in another blog post.
Resources
- The m3u extension – https://www.lifewire.com/g00/m3u8-file-2621956.
- Loklak’s TwitterScraper – https://github.com/loklak/loklak_server/blob/development/src/org/loklak/harvester/TwitterScraper.java.
- iFrames – https://www.w3schools.com/tags/tag_iframe.asp.
Originally posted on FOSSASIA blog - Fetching URL for Embedded Twitter Videos in loklak server
Very useful post. Keep it up!
Downvoting a post can decrease pending rewards and make it less visible. Common reasons:
Submit
Hi! I am a robot. I just upvoted you! I found similar content that readers might be interested in:
https://blog.fossasia.org/fetching-url-for-embedded-twitter-videos-in-loklak-server/
Downvoting a post can decrease pending rewards and make it less visible. Common reasons:
Submit